key: cord-0474040-3mpqfkp1
authors: Goldman, Jacob Vorstrup; Singh, Sumeetpal Sidhu
title: Spatiotemporal blocking of the bouncy particle sampler for efficient inference in state space models
date: 2021-01-08
journal: nan
DOI: nan
sha: 09c0161a592f3abc32b4d99e9d9aa48cdae14634
doc_id: 474040
cord_uid: 3mpqfkp1

We propose a novel blocked version of the continuous-time bouncy particle sampler of [Bouchard-C^ot'e et al., 2018] which is applicable to any differentiable probability density. This alternative implementation is motivated by blocked Gibbs sampling for state space models [Singh et al., 2017] and leads to significant improvement in terms of effective sample size per second, and furthermore, allows for significant parallelization of the resulting algorithm. The new algorithms are particularly efficient for latent state inference in high-dimensional state space models, where blocking in both space and time is necessary to avoid degeneracy of MCMC. The efficiency of our blocked bouncy particle sampler, in comparison with both the standard implementation of the bouncy particle sampler and the particle Gibbs algorithm of Andrieu et al. [2010], is illustrated numerically for both simulated data and a challenging real-world financial dataset.

Markovian state space models are a class of probabilistic graphical models applied in biology, signal processing, target tracking, finance and more, see Cappé et al. [2006] for a technical overview. In our setup, a latent process (x n , n ≥ 1) on R d evolves according to a state transition density p(x n |x n−1 ), with p(·) denoting a generic density. The dimension of the latent process is its spatial dimension, although often no physically relevant interpretation might be available. We indirectly observe the latent process through a noisy set of observations (y n , N ≥ n ≥ 1) defined on R m , where the realizations depend only on the current value of the latent state, y n |x n ∼ p(y n |x n ). For convenience we introduce the sequence notation i : j = (i, i + 1, . . . , j) when j > i. Unless otherwise mentioned, the sequence is y 1:N fixed throughout. Given the observation sequence, we define smoothing as the off-line estimation of the conditional joint probability density p(x l:m |y 1:N ), with 1 ≤ l ≤ m ≤ N . We will be interested in the case where the target is the full conditional p(x 1:N |y 1:N ). Smoothing is generally a hard problem due to the high dimensionality of the state space and spatiotemporal interdependence of the latent states; below we will give a brief historical overview, and subsequently detail our contributions. Sequential Monte Carlo methods form the backbone of most smoothing algorithms. A popular early example is the sequential importance resampling smoother of Kitagawa [1996] , which utilises the entire trajectories and final particle weights of a particle filter to generate smoothed estimates. This method suffers from severe particle degeneracy as the resampling step non-strictly decreases the available paths used to estimate the joint posterior. A solution was the algorithm of Godsill et al. [2004] , which introduces a sequence of backward passes incorporating the state transition. This algorithm has linear computation cost in time, particles and number of samples. Similar algorithms like the general two-filter smoother of Briers et al. [2010] have equivalent computational costs. In Finke and Singh [2017] , an approximate localization scheme is proposed for the forward-backward algorithm, including theoretical results that guarantees bounds on the asymptotic variance and bias under models that are sufficiently local. In the landmark paper of Andrieu et al. [2010] , the authors introduced particle Markov Chain Monte Carlo, which combines particle filters in conjunction with either Metropolis-Hastings or Gibbs algorithms. The latter algorithm, denoted particle Gibbs, generates a single trajectory chosen according to the final particle weights from a particle filter run conditionally on a fixed trajectory. Particle Gibbs is stable if the number of particles grow at least linearly with the time series length; further theoretical analysis of ergodicity and asymptotic variance is provided in Andrieu et al. [2018] and Chopin and Singh [2015] . More recently, couplings of conditional particle filters have been introduced in Jacob et al. [2020] , Lee et al. [2020] , and provide unbiased estimators with asymptotically exact confidence intervals.

Unfortunately, the performance of particle Gibbs depends entirely on the efficiency of the conditional particle filter which like the particle filter can suffer from weight degeneracy. If the spatial dimension is large, the curse of dimensionality described in Bengtsson et al. [2008] implies that infeasibly many particles are required to effectively approximate the posterior; localization of proposals by exploiting spatial conditional independence was subsequently introduced in Rebeschini et al. [2015] but this method is not generically applicable. As an alternative, the space-time particle filter Beskos et al. [2017] is applicable if the likelihood can be written in a product form of terms that depend on an increasing number of latent dimensions. In the data assimilation field, a very popular method for high-dimensional filtering is the use of the Ensemble Kalman Filter algorithm, but the theoretical understanding of this algorithm is still quite limited, see however Del Moral and Tugaut [2018] , de Wiljes et al. [2018] for recent work in this regard. Overall, there is no generically applicable, asymptotically exact approach that makes the particle filter viable in high dimensional time-series models.

In comparison with filtering which is known to be uniformly stable in time under reasonable assumptions, see Van Leeuwen et al. [2019] , the difficulty of smoothing increases as the length of the time-series increases. In such scenarios, Whiteley [2010] , in the Royal Statistical Society's discussion of Andrieu et al. [2010] , proposed to incorporate a backward pass similar to the algorithm of Godsill et al. [2004] to avoid particle paucity in the early trajectories; for low spatial dimensions, the resulting algorithm was shown in Lee et al. [2020] to be computationally efficient and stable as the time horizon grows. A conceptually similar method that updates the fixed reference trajectory has been developed in Lindsten et al. [2014] . As an alternative to manipulation of particle lineages, applying the particle Gibbs algorithm inside a generic Gibbs sampler over temporal blocks is proposed in Singh et al. [2017] , where the authors furthermore show a stable mixing rate as the length of the time series increases. Singh et al. [2017] also shows that the sharing of states via overlapping blocks increases the mixing rate as the overlap increase. While the issue of long time series has been addressed successfully by the algorithms detailed above, the curse of spatial dimensionality indicates that particle Gibbs and more sophisticated extensions are currently unworkable in practical smoothing applications featuring high spatial dimensions.

As a solution to the issues in high dimension, we propose a novel blocked sampling scheme based on irreversible, continuous-time piecewise deterministic Markov processes. Methods based on this class of stochastic process were originally introduced as event-chain Monte Carlo in the statistical physics literature by Bernard et al. [2009] , and subsequently further developed in the computational statistics literature recently, see for example Bouchard-Côté et al. [2018] , Bierkens et al. [2019] , Wu and Robert [2020] , Power and Goldman [2019] . In practice, the algorithms iterate persistent dynamics of the state variable with jumps to its direction at random event-times. They also only depend on evaluations of the gradient of the log-posterior. Local versions of these samplers, see Bouchard-Côté et al. [2018] and Bierkens et al. [2020] , can exploit any additive structure of the log-posterior density to more efficiently update trajectories, however as discussed above, long range dependencies of states indicate that sharing of information is desirable to achieve efficient mixing.

To allow for sharing of information, we introduce a blocked version of the bouncy particle sampler of Bouchard-Côté et al. [2018] that utilises arbitrarily designed overlapping blocks. (Our resulting algorithm is different from the approach presented in Zhao and Bouchard-Côté [2019] where the BPS is run on conditional distributions in a Metropolis-within-Gibbs type fashion.) The blocking scheme is implementable without any additional assumptions on the target distribution, and is therefore useful for generic target densities, particularly in cases where the associated factor graph is highly dense.

As our second contribution, we introduce an alternative implementation of the blocked sampler that leverages partitions to simultaneously update entire sets of blocks. The number of competing exponential clocks in the resulting sampler is independent of dimension and thus feature O(1) clocks for any target, and allows, for the first time for a piecewise-deterministic Monte Carlo algorithm, to carry out parallel updates at event-times. Our numerical examples indicate that the blocked samplers can achieve noteworthy improvements compared to the bouncy particle sampler, both in terms of mixing time and effective sample size per unit of time, even without the use of parallelization. In addition, the blocked sampler provides efficient sampling of state space models when particle Gibbs methods, which are widely considered state of the art for state space models, fail due to high spatial dimensions.

In what follows, subscript on a variable x will denote temporal indices, while superscript indicates spatial. By x ∼ N (0, 1) we mean that x is distributed as a standard normal variable, whereas we by N (x; 0, 1) mean the evaluation at x of the standard normal density; this notation is extended to other densities. We use the standard sequence notation i:j = (i, i + 1, . . . , j − 1, j) and [n] = (1, 2, . . . , n − 1, n). A generic Poisson process is denoted by Π and the associated, possibly inhomogeneous, rate function is the function t → λ(t). Let M m,n be the space of m × n real-valued matrices, with m referring to row and n to columns, respectively. We denote by the Hadamard product operator. The standard Frobenius norm of a matrix X ∈ M m,n is denoted X F = tr(X T X) = i j x 2 i,j , and the Frobenius inner product with another matrix Y ∈ M m,n is subsequently

The class of state space models we consider have differentiable transition and observation densities

It is thus natural to work in log-space for the remainder of the paper, and we note in this regard that all probability density functions are assumed to be normalised. The exponential notation is therefore merely a notational convenience to avoid repeated mentions of log-densities. We also only require access to derivatives of f and g which may have more convenient expressions than the full probability distribution. The negative log of the joint state density of the entire latent state x ∈ M d,N is denoted the potential energy U : M d,N → R, and is given as

f (x n−1 , x n ) + g(x n , y n ).

To ease notation we have dropped the explicit dependence on y 1:N when writing the log conditional joint state density from now on. We will often need to refer to the derivative ∂U/∂x, which we denote as the matrix map ∇U : M d,N → M d,N where the entry in the k'th row and n'th column is given by the partial derivative ∇U (x) k,n = ∂U (x)/∂x k n . Again, we remind the reader that subscript on a variable x will denote temporal indices, while superscript indicates spatial.

Recall that [n] = (1, 2, . . . , n − 1, n). A blocking strategy B is a cover of the index of set of the latent states I = [d] × [N ], and solely consists of rectangular subsets. A generic block B is always of the form i:j × l:m with i < j, l < m, with the coordinates referring to spatial and temporal dimensions, respectively. The size of a block is the ordered pair (|i:j|, |l:m|). Blocks are allowed to overlap, we denote by the interior of a block the indices that it does not share with any other block. The neighbourhood set of a block is

and always includes the block itself. A blocking strategy is temporal if each block in a strategy is of the form 1:d × l:m, these are the most natural strategy types to consider for state space models and will be the general focus in the rest of the paper, but the methods presented below work for arbitrary strategies. To improve mixing of blocked samplers in general it is often necessary to design a blocking strategy such that within-block correlation between variables is large while the correlation with out-of-block variables is small, see for example Liu et al. [1994] or Turek et al. [2017] . For state space models, this naturally implies blocking across time, and in Figure 1 a temporal strategy with overlap ξ and interior δ is illustrated. We can in this case divide the blocks into even (B k of Figure  1 with even index k) and odd subsets such that each subset consists of non-overlapping blocks, see again the Figure 1. As analyzed in Singh et al. [2017] for blocked Gibbs samplers, temporal overlap leads to improved sharing of information across time and subsequent improved mixing. If the spatial dimension is very high, it can be necessary to block in the spatial domain as well; blocking strategies should in this case aim to exploit any spatial decorrelation if possible.

Odd Blocks time Figure 1 : A temporal blocking strategy with overlap ξ and interior δ between blocks highlighted. The strategy will be efficient if the overlap ξ is large enough to incorporate relevant information from neighbours.

A few more remarks on notation: the restriction of x ∈ M d,N to a block B = i:j × l:m is the submatrix x B ∈ M |i:j|,|l:m| corresponding to deleting all but the rows i:j and columns l:m of x. Similarly, the block restriction of ∇U is the map

We extend this notation to also include the state and the velocity, with the submatrix under consideration being indicated by a subscript B.

In this section we derive conditions under which the bouncy particle sampler of Peters et al. [2012] , Bouchard-Côté et al. [2018] can be run in blocked fashion; the resulting algorithm therefore applies to any target distribution π. If we assume that B consists of a single block of size 1:d × 1:N , the description below reduces to the standard bouncy particle sampler and it is therefore redundant to describe both.

The class of piecewise-deterministic Markov process we consider is a coupling of the solution x(t) of the ordinary differential equation dx(t)/dt = v(t), and a Markov jump process v(t) where both transition operator Q(v, dv) and rate process λ(t) depends on x(t) as well; v(t) will henceforth be denoted the velocity. The joint process (x(t), v(t)) takes values in M d,N × M d,N . Given an initialization (x(0), v(0)), the state flows as (x(t), v(t)) = (x(0) + t · v(0), v(0)), until an event τ is generated by an inhomogeneous Poisson process with rate λ(t). At this point the velocity changes to v(τ ) ∼ Q(v(0), dv), and the process re-initializes at (x(τ ), v(τ )). To use such a process for Markov chain Monte Carlo, the jump rate λ(t) and transition kernel Q of v(t) are chosen such that the marginal stationary distribution of (x(t)) t∈[0,∞) is the target distribution of interest. Exactly as in Metropolis-Hastings algorithms, we always want to move into regions of higher probability but desire to change direction, by a new choice of velocity vector, as we enter regions of declining probability. This in turn implies that the rate is determined by the directional derivative of the energy U in the direction of v, while the transition kernel Q is a deterministic involution or random velocity change, for general details see Vanetti et al. [2017] .

Blocking of this process corresponds to a localization of the rate function and transition kernel such that each block is equipped with its own random clock and corresponding local velocity updating mechanism. Subsequently, only velocities within a single block are changed at an event, while preserving the overall invariant distribution. In comparison with discrete time blocking that updates the variables one block at a time while keeping every other variable else fixed, in continuous time the block direction is changed while keeping every other direction fixed. For dimensions that are in multiple blocks, the additional clocks implies an excess amount of events compared to the base case of no overlap; the φ variable introduced below adjusts for this discrepancy by speeding up the velocity of the shared dimensions. Intuitively, as a dimension shared by k blocks will have events k times as often, it should move at k times the speed to compensate. This also aligns exactly with discrete-time blocked sampling, where dimensions shared between blocks are updated twice as often.

We now present the blocked bouncy particle sampler in detail. We assume that the velocity is distributed such that each v k n ∼ N (0, 1) in stationarity, and the stationary joint distribution of all velocities has density p v (v). For a blocking strategy B, we introduce the auxiliary variable φ ∈ M d,N with entries φ k n = #{B ∈ B | (k, n) ∈ B}, φ k n counts the number of blocks that include the k'th spatial dimension and n'th temporal dimension. Given φ, the resulting block-augmented flow of the ordinary differential equation driving x(t) is t → x + t · (φ v); as mentioned, individual dimensions of x are sped up in proportion to how many blocks includes them. With x → {x} + ≡ max{0, x}, the rate function for the Poisson process Π

the associated superposition of all blocks is the Poisson process Π B = ∪ B∈B Π B . Events generated by Π B are denoted τ b with b referring to a bounce. Note that the inner product corresponds to the directional derivative ∂U (x + t · v)/∂t restricted to B. For the transition kernel, we define reflect B x (v) as the (deterministic) reflection of the velocity v B in the hyperplane tangent to the block gradient at x:

while the remaining components of v are unchanged by reflect B x (v). (Note only the velocities that correspond to the block B are updated.) Variable v will also be updated by full velocity resampling via an independent homogeneous Poisson process with rate γ to alleviate issues with reducibility, see [Bouchard-Côté et al., 2018, Section 2.2] , and these event-times are denoted τ r with r referring to refreshment. Without writing the refreshment operator, the infinitesimal generator of (

the sum of the block-augmented linear flow φ v driving x(t) and the sum of Markov jump processes updating the block-restricted velocities v B .

Proposition 1. Consider a blocking strategy B and a target density π(x) ∝ exp{−U (x)}. With the generator defined in Equation (1), the blocked bouncy particle sampler has invariant distribution π ⊗ p v .

Proof. See section A.1.

The most closely corresponding method to the blocked bouncy particle sampler is the factor algorithm presented in [Bouchard-Côté et al., 2018, Section 3.1 ]. If the target distribution factorises over a finite set of individual factors F such that

where x f corresponds to the restriction of the components in the factor, the local bouncy particle sampler of Bouchard-Côté et al. [2018] can be applied. Note that the derivation of the local bouncy particle sampler in Bouchard-Côté et al. [2018] is only considered under the above sum structure for the log density U (x) and where each block is the complete set of variables x f for a factor. This contrasts with the blocked sampler, where blocks are allowed to share variables arbitrarily and without the need for the energy to satisfy a sum structure. The blocked sampler algorithm in practice functions as a hybrid between the Zig-Zag sampler of Bierkens et al. [2019] and the bouncy particle sampler: it incorporates the reflection operator when performing bounces, which allows for updating the velocity vector for multiple dimensions at event-times, but combines a more local rate structure akin to that of the Zig-Zag sampler. In particular, if |B| = 1 for all B ∈ B and φ k n = 1 for all k, n ∈ [d] × [N ], the algorithm reduces to a process very close to the Zig-Zag sampler, with the velocity vector components "flipping" at their individual reflection event times (but an invariant normal distribution for the velocities compared to the binary uniform distribution of the standard Zig-Zag sampler.) In this sense, the Zig-Zag sampler is naturally blocked, but does not allow for sharing of information across dimensions. In Algorithm 1 the blocked bouncy particle sampler is presented in an implementable form.

Due to the simplicity of the flow the computational challenge of the algorithm is to generate correctly distributed event-times via Poisson thinning. The thinning procedures of Lewis and Shedler [1979] for simulating inhomogeneous Poisson processes is a two-step procedure that corresponds to finding a bounding process where direct simulation is available, and subsequently using rejection sampling to keep correctly distributed event-times.

To employ thinning, local upper bounds t →λ B (t) for each block needs to be estimated. For some fixed lookahead time θ > 0 and current position (x, v), local bounds satisfy

and after θ time has passed, the bounds are recomputed at the new location (x + θ(φ v), v), if no reflection or refreshment has occurred in the interrim. The right-hand side is the worst-case bound and in all of our numerical examples we use this bound. In some particular cases, universal global bounds can be derived, but generally these bounds will have to be estimated by evaluating the rate function at some future time-point. If the blocks are individually log-concave densities, evaluating the rate at the lookahead time, λ B (θ), gives a valid bound until an event occurs, or alternatively, one can apply the convex optimization procedure described in [Bouchard-Côté et al., 2018, Section 2.3 .1]. If blocks are overlapping, the local bounds of blocks in the neighbourhood N (B) become invalid after a reflection and require updating. The generic process is given in Algorithm 2. Given the global bounding functionλ

an event-time τ is simulated from Πλ B , a block B is selected with probability proportional to its relative rateλ B (τ )/λ B (τ ), and finally a reflection is carried out with probability corresponding to the true rate function relative to the local bound λ B (τ )/λ B (τ ). Given the local rate functions, the dominant cost is the unsorted proportional sampling of a block, which is done in O(|B|). We propose to choose θ such that the expected number of events generated by the bounding process on the interval [0, θ] is equal to 1, as we can always decrease the computational cost of calculating bounds by changing θ to another value that brings the expectation closer to 1. In our numerical examples we have tuned θ to approximately satisfy this requirement.

As mentioned in the introduction, Singh et al. [2017] shows that the even-odd blocking strategy with overlaps is known to improve mixing, and furthermore allows for parallelization of updates in the case of Kalman smoothers or particle filter-based smoothing algorithms. Conversely, the current crop of piecewise-deterministic Markov process-based samplers are all purely sequential, in the sense that at each event-time only the velocity of a single factor or dimension is updated, and these samplers therefore fail to exploit any conditional independence structure available. We will in this section provide an alternative implementation (see Algorithm 3) of the blocked bouncy particle sampler that mimics the even-odd strategy of discrete-time blocked samplers, extends to the fully spatially blocked setting, and allows for parallel implementation of updates at event-times. To utilise this method, we need a partition of the blocking strategy into sub-blocking strategies such that no two blocks in any sub-blocking strategy share any variables. To this end, we capture the no-overlap condition precisely in the following assumption:

Assumption 1. Consider a blocking strategy B. We will assume given a partition ∪ K k=1 B k = B of the blocking strategy that satisfies, for each sub-blocking strategy B k , k = 1, 2, . . . K and for all blocks

This assumption also applies to fully spatiotemporal blocking schemes and not just temporal strategies. We will for illustrative purposes only describe in detail the simplest even-odd scheme of temporal blocking, which corresponds to K = 2 sub-blocking strategies such that no blocks that are temporally adjacent are in the same sub-blocking strategy. As shown in Figure 1 , each block is assigned a unique integer number k. We then partition the strategy into two sets of blocks based on whether k is an even or odd integer, and denote the sub-blocking strategies {B odd , B even }. In Figure 1 we illustrate such a strategy, where the top row shows the even blocks, and the lower row the odd blocks. Note that individual even blocks have no state variables in common (similarly for individual odd blocks). Furthermore, for a Markovian state space model, each block is chosen to be a consecutive time sequence of states.

For such a sub-blocking strategy, we then find the maximum rate among all blocks inside a sub-blocking strategy

and denote their associated Poisson processes Π B odd and Π B even . By construction, we will have two exponential clocks, one for the set of blocks B odd and one for B even . To detail what happens at an event time, consider an event generated by the superposition of Π B odd and Π B even and say Π B odd generated the event. Then for each block B ∈ B odd , the following kernel Q B x (v, dv) is used to update the velocity of that block

Algorithm 1: Blocked Bouncy Particle Sampler Data: Initialize (x 0 , v 0 ) arbitrarily, set instantaneous runtime time t = 0, index j = 0, total execution time T > 0, lookahead time θ > 0 and valid bound time Θ = θ. // Variable t denotes instantaneous runtime, θ is lookahead time for computing Poisson rate bounds and Θ > t are integer multiples of θ.

Algorithm 2: LocalBounds(x, v, θ, B)

Data: (x, v), θ > 0 and set of blocks B.

This simultaneous velocity update of all the blocks in the particular set of blocks is permissible since the blocks of each set have no states in common, i.e. do not overlap. In Algorithm 3 we provide pseudo-code for a fully implementable version of the blocked Bouncy Particle Sampler under an even-odd partition of the blocking strategy.

We will show invariance for the particular case considered above; the result holds in general for any partition satisfying Assumption 1.

Proposition 2. Let {B odd , B even } be a temporal strategy for π and B satisfying Assumption 1. Then the Markov process with associated generator

Proof. See section A.2.

In contrast to the basic blocked BPS, the generator of Proposition 2 has a single overall eventtime generated from sum of odd and even strategies, but multiple overlapping event-times for the blocks contained in the sub-blocking strategy that generated the event. The even-odd algorithm therefore corresponds to an implementation that "lines up" the event times in such a way that is beneficial for a parallel implementation. Relative to the blocked bouncy particle sampler, the evenodd implementation iterates over every block in the sub-blocking strategy that generated the event, updating velocities of the blocks with probability proportional to the ratio of the block's rate λ B evaluated at the current state (x, v) to the rate of the sub-blocking strategy given by the max-bound. It therefore becomes possible to parallelize the updating step, for example with multiple processors allocated to each sub-blocking strategy, say one processor per block of the sub-blocking strategy. In contrast to the generator in Proposition 1, the event rate of the sampler in Proposition 2 is now the maximum over the rates in a sub-blocking strategy which should grow slower than the sum rate in Proposition 1 as the global dimension (d) and thus number of blocks grow.

If the spatial dimension is significant, it will be necessary to also carry out spatial blocking. Under a full spatiotemporal strategy, the above implementation naturally extends to a four clock system, consisting of alternating even-odd temporal strategies over each 'row' of spatial blocks, such that that no blocks from the same sub-strategy overlap; this in turn guarantees that Assumption 1 is satisfied.

In practice, Λ κ is not available, as we can not evaluate the gradient in continuous time. Similarly to Algorithm 1, we employ a lookahead time θ and a trivial global bound for the Poisson rates that is valid for the interval (t, t + θ] where as before t is the instantaneous runtime. For any fixed θ > 0, assuming (x(t), v(t)) = (x, v), let the globally valid boundΛ κ , with κ ∈ {odd, even}, be given as

As in Algorithm 1, we use this rate to define a bounding Poisson processes and apply thinning to find the appropriate events, see line 4 in the algorithm. In practical implementations of piecewisedeterministic algorithms, tighter bounds for the event-times are in general necessary to avoid wasteful computation from false events. Our max-type bound is tighter than the sum-type bound, and we can therefore have a larger lookahead time θ. (Again, θ should be chosen such that the expected number of events generated by the bound in an interval of size θ is 1.) With the max-type bound, the even-odd implementation will have larger event times compared to the blocked BPS.

The growth of the rate of the max-type bound, as a function of the number of blocks, is studied in the following result. In particular, under plausible assumptions on the tail-decay of the target distribution we can bound the expected rate.

for some α > 0. Then both the odd and even sub-blocking strategies, indicated by subscript κ, satisfies

In the Gaussian case, the rate function is a mixture of a point-mass at zero and a density proportional to the modified Bessel function of the second kind with order depending on the dimension, and this function is known to have sub-exponential decay for any d, see for example Yang and Chu [2017] . We note that the key point of Lemma 1 is to verify that utilizing the maximum over blocks does not lead to pathological behaviour.

To elaborate on the computational costs of the samplers, we compare the cost to run the samplers for one second. The exponential event-times of Poisson processes indicates we can expect O(log |B κ |) events per time unit (line 3.4) via Λ κ , each costing O(|B κ |) evaluations of blocks (Line 3.18) per kernel Q B

x . Thus the total cost of the even-odd sampler per second is O(|B κ | log |B κ |). In comparison, the local bouncy particle sampler has a rate function defined as Λ

, v }, with F is the set of factors of U , ∇U F (x) is the gradient of the factor U F (x). In this case, the event rate growth is of the order O(|F |) by the inequality

combined with O(1) costs per event-time, for a total cost of O(|F |) per sampler second. However, we note again that each of the O(|B κ |) evaluations of the blocks can be carried out fully in parallel as no velocities are shared across ringing blocks. Furthermore, in the continuous-time Markov Chain Monte Carlo literature the metric of effective sampler size per unit of trajectory length has been considered, and it is at this stage not known theoretically how the blocked BPS and the local BPS differ under this alternative measure of efficiency.

Data: Initialize (x 0 , v 0 ) arbitrarily, set instantaneous runtime t = 0, index j = 0, total execution time T > 0, lookahead time θ > 0 and bound time Θ = θ.

We will in the following two sections provide two numerical experiments comparing the blocked BPS, the local BPS, and particle Gibbs. As we are primarily interested in latent state estimation, we have not considered parameter inference in the examples below. A natural approach here would be to run a Metropolis-within-Gibbs sampler that proposes an update to the parameter vector after, for example, running the continuous-time sampler for a second. The proposal of the parameter vector could be done in discrete time, or alternatively using the BPS for parameter vector. This latter strategy was proposed for continuous-time Markov chains in Zhao and Bouchard-Côté [2019] . Alternatively, the parameters could be inferred jointly in continuous-time together with the latent states; the parameter vector could be included in the blocking strategy, in particular if the parameter vector is also dynamic across time.

We consider an autoregressive model of order 1 given by

x n = Ax n−1 + η n , η n ∼ N (0, I d )

with A an autoregressive matrix with entries A ij = kern(i, j)/ ψ + d l=1 kern(i, l) with kern(i, j) = exp − 1 2σ 2 |i − j| 2 and ψ > 0 a constant, and finally, x 0 ∼ N (0, I d ). First, we want to compare the empirical mixing speed of the blocked and factor bouncy particle samplers. We consider a simulated model of d = 3 and N = 1000, σ 2 = 5, and ψ = 0.1. We initialise each sampler at the zero vector, run until T = 1000, and thin at every 0.1 sampler second. In Table 1 we provide detailed specifications of the setups for the various algorithms and results from a representative run of the algorithms.

Local BPS Blocked BPS Even-odd Temporal width  20  20  20  Spatial width  3  3  3  Temporal overlap  -10  10  Spatial overlap  -0  0 Relative performance 0.48 0.67 1.00 Table 1 : Specification of implementations and results for the autoregressive Gaussian model with T = 1000 and d = 3. Performance is measured in terms of ESS/s relative to the even-odd bBPS.

In Figure 2a , we plot the log of the mean square error as a function of time for increasing block overlap; empirically the blocked sampler with block width 20 and overlap 10 reaches stationarity around 3 times faster than the factor version. In Figure 2b , we compare the mean squared jumping distance of the first spatial dimension after discarding the first 25% of samples. For the overlapping sections, the exploration is, due to the shared overlap and φ, happening at twice the speed, and, accordingly, four times the mean-square jumping distance compared to the factor algorithm. In Mean square jump distance for the standard bouncy particle sampler and blocked counter-part with overlaps 9 and 10, showcasing the impact φ has on exploration. In particular, the dips for the overlap 9 case corresponds to the variables that are part of a single block only, and subsequently are not sped up. We show a subset of 200 time points to enhance detail.

terms of effective sample size per second, the blocked and even-odd samplers are about 30-40% and 100% more efficient respectively than the factor sampler, without using any parallel implementation. It is observed in general for any choice of d and T that the benefits of speeding up the dimensions compensate for the increased computational cost due to the overlaps. We also note that for models like this where the spatial dimension is low, there is not a strong argument to use PDMP-based methods as particle Gibbs with a basic particle filter will be more than adequate.

Second, we consider the case where d = 200 and T = 100 to illustrate the benefits of spatial blocking in high dimensional scenarios. In this case we also include a spatiotemporal blocking strategy, and the details of the example and a representative simulation are provided in Table 2 . The model and example parameters are otherwise as described above.

The spatiotemporally blocked sampler significantly outperforms the other implementations, with effective sample size per second typically 2-4 times larger, evidenced over multiple runs with random trajectories generated under the model. The even-odd temporal implementation blocked strategy is often still efficient even when the number of dimensions per block is up to 400, but the relative ESS/s is on aggregate lower than the spatiotemporally blocked version. Furthermore, this discrepancy will only increase under models with even higher spatial dimension. As before, no concurrent implementation was used, indicating that additional improvements in performance are possible for the partitioned blocking schemes when parallelized over multiple processors.

We will in this section consider an example based a stochastic volatility model of the Dow Jones Industrial Average (DJIA) equity index to explore the efficiency of the even-odd implementation of the BPS in comparison with two benchmark implementations of particle Gibbs when the spatial dimension is moderate and the length of the time-series is long. Stochastic volatility models are widely used in finance and econometrics. They model the volatility of financial assets as a dynamic latent process to capture the time-varying and persistent nature of changes in asset returns. We will analyse a general model proposed by Ishihara and Omori [2012] that incorporates heavy-tailed observations and leverage effects, see Cont [2001] for empirical discussion of these effects. To test the blocked algorithms on a reasonably challenging dataset, we attempt to estimate the latent volatility of the 29 continuously available constituents of the DJIA between April 1st 2017 and April 6th 2020, for a total of 29 × 757 = 21953 latent states. This period is characterized both by relatively low volatility and historical high levels uncertainty due to the COVID-19 pandemic, see WHO [2020] for example. Let x n ∈ R d be an unobserved vector of volatilities, and y n ∈ R d be observed asset log returns. 

and Σ a 2d × 2d matrix. The off-diagonal block matrices introduce leverage effects in the model if they are negative definite. Finally, for some ν ∈ N, γ n ∼ Γ( ν 2 , ν 2 ) is a memory-less stochastic process independent of (η n , n ). The resulting observation noise is multivariate t-distributed with ν degrees of freedom, details are in Ishihara and Omori [2012] . For the initialization we assume that x 1 ∼ N (0, (I d − AA) −1 Σ η ). Define y γ n = √ γ n y n as the observations whenever γ n is known;

it follows that y γ n = Λ n n and inference can be carried out with this observation sequence instead. Conditional on γ 1:N and using basic properties of multivariate Gaussians, the transition distributions can be written as

implying that the distribution has a more complicated dependence structure, as the past observation feeds into the next realized state. Furthermore, the state transition is non-linear in the previous state variable due to the leverage effect.

For the blocking strategy, use a spatiotemporal strategy with blocks 9 timepoints wide, 7 spatial dimensions high, and each block has temporal overlap 4 and spatial overlap 3, giving a total of 151 × 6 = 906 blocks. Due to the better performance of partitioned blocked bouncy particle sampler in the previous example, we only compare this method with blocked particle Gibbs, see Singh et al.

[2017], and the particle Gibbs with ancestor sampling algorithm of Lindsten et al. [2014] , both using a bootstrap particle filter as proposal mechanism. For the blocked particle Gibbs sampler, we let the blocks be 25 observations wide and have overlap 5. For a fair comparison, we set the number of particles to 500 which leads to an average time per sample quite close to that of the spatiotemporal blocked bouncy particle sampler for both samplers. We generated 2000 samples via each algorithm, and initialized each at the d × N zero vector, and for the velocity we used the d × N vector of ones. Typically, estimation of latent states will be carried out inside a Gibbs sampling algorithm that also estimates parameters, indicating that prior knowledge of the states are retained, whereas this example tests the significantly more difficult case of no prior information on the latent states.

In Figure 4a , we illustrate the posterior energy. The blocked particle Gibbs sampler moves in a wide band of posterior energy, but never reaches levels of higher posterior probability. This is in contrast to the results reported in Singh et al. [2017] where the dimension of the hidden state is much lower and thus the state transition density has better forgetting properties than our higher dimensional example. Even if this issue could be remedied, see Bunch et al. [2015] , implementing particle Gibbs with both temporal and spatial blocking appears non-trivial in contrast to the ease of which it can be achieved with the BPS. The ancestor sampling-based particle Gibbs sampler similarly does not generate proposals that have high posterior probability. Conversely, the bBPS reaches stationarity in less than 100 samples, and subsequently mixes across the posterior: the auto-correlation function, plotted in Figure 4b , reaches zero around a lag of 20 samples, indicating adequate mixing for a posterior of this dimension. In Figure 5 , we plot the correlation matrix of the assets, and also the estimated latent volatility via the posterior mean. It is quite clear that the volatilities show weaker correlation across the assets, but appear to preserve some of the structure of seen in the correlation matrix of the log returns. Acknowledgement JVG acknowledges financial support from an EPSRC Doctoral Training Award.

A.1 Proof of Proposition 1

Proof. Denoting the domain of L bBPS by D(L bBPS ), invariance follows if ∀f ∈ D ⊂ D(L bBPS ), with D a core for (L bBPS , D(L bBPS )), that

see [Ethier and Kurtz, 2009, Chapter 9 ]. Identification of the core for the generators associated with piece-wise deterministic Markov processes is quite technical, we refer to [Durmus et al., 2018, Section 7] and [Holderrieth, 2019, Section 5] for details. We from now on assume for the test function that

The proof in essence follows that of Proposition 1 in Bouchard-Côté et al. [2018] . For the first part of the generator associated with the linear flow, consider first the integral with respect to π(x). We have

where Z is the normalising constant Z = e −U (x) . Applying integration-by-parts we immediately get

by integrability of the functions in the domain of L bBP S . For the second part, note that

where the last line follows by the definition of φ. Consider then the integral of the jump dynamics generator

Using that (reflect B x ) −1 = reflect B x by involution and the norm-preserving property of the restricted reflection operator we get

x (v))π(dx)p(dv) = λ B (x, reflect B x (v))f (x, v)π(dx)p(dv), so we have from using the identity above that

which implies the result.

Proof. We will show that the eoBPS is a special case of the blocked bouncy particle sampler, we again subdue dependence on refreshments. Writing out the integral with respect to Q B x , we have

which by Proposition 1 gives the result.

Proof. We will just consider the odd strategy in the proof, everything translates seamlessly. By Hölders inequality, we have for any p ∈ N E π⊗pv ( Λ odd ) = E π⊗pv ( max

For any B ∈ B odd , by adapting the proof of Lemma 1.10 in Rigollet and Hütter [2015] , we have by the sure positivity of the rate function that

such that in particular

Optimizing over p leads to p * = log |B odd |, which together with the bound gives

implying the result.

The daily asset prices (p n ) n=1,2,...,N is collected from eSignal Data Access and transformed to log returns via the relation y n = log p n /p n−1 . As our paper is centered on latent state estimation, we have foregone a full Bayesian parameter estimation. Instead, for all unobserved quantities we have used the parameters proposed by Ishihara and Omori [2012] Section 3, these are based on previous empirical studies by the authors and quite closely correspond to what is inferred during their parameter estimation procedure on S&P 500 data. In particular, we set the persistence parameter to α = 0.99 and use ν = 15 degrees of freedom for the multivariate t-distribution.

For the unobserved volatility covariance matrix Σ η , the cross-asset correlation is set at 0.7, and the same standard deviation is assumed for each asset, 0.2. For the leverage matrix, the intra-asset parameter is set at Σ ρ,ii = −0.4 and cross-asset leverage Σ ρ,ij = −0.3.

We estimate the return covariance matrix Σ directly from the observed log returns over the entire period. The values we arrive at from this procedure is again close to what is empirically observed, indicating it is a reasonable parameter value to use for a latent states estimate. If we were to run a spatially blocked scheme, we could for example apply a hierarchical clustering algorithm like the algorithm of Ward [1963] to learn the relationship between the assets, and then rearrange the order of the assets to match the order of the clustering procedure. As discussed this should have a beneficial effect on mixing, as the blocks become more localised.

Particle markov chain monte carlo methods

Uniform ergodicity of the iterated conditional smc and geometric ergodicity of particle gibbs samplers

Curse-of-dimensionality revisited: Collapse of the particle filter in very large scale systems

Event-chain monte carlo algorithms for hard-sphere systems

A stable particle filter for a class of high-dimensional state-space models

The zig-zag process and super-efficient sampling for bayesian analysis of big data

A piecewise deterministic monte carlo method for diffusion bridges

The bouncy particle sampler: A nonreversible rejection-free markov chain monte carlo method

Smoothing algorithms for state-space models

Particle gibbs with refreshed backward simulation

Inference in Hidden Markov Models (Springer Series in Statistics)

On particle gibbs sampling

Empirical properties of asset returns: stylized facts and statistical issues

Long-time stability and accuracy of the ensemble kalman-bucy filter for fully observed processes and small measurement noise

On the stability and the uniform propagation of chaos properties of ensemble kalman-bucy filters

Piecewise deterministic markov processes and their invariant measure

Accelerated sampling on discrete spaces with nonreversible markov processes

Can local particle filters beat the curse of dimensionality?

High dimensional statistics

Blocking strategies and stability of particle gibbs samplers

Automated parameter blocking for efficient markov chain monte carlo sampling

Particle filters for high-dimensional geoscience applications: A review

Piecewise deterministic markov chain monte carlo

Hierarchical grouping to optimize an objective function

Discussion on particle markov chain monte carlo methods

WHO. Coronavirus disease 2019 (covid-19): situation report

Coordinate sampler: a non-reversible gibbs-like mcmc sampler

On approximating the modified bessel function of the second kind

Analysis of high-dimensional continuous time markov chains using the local bouncy particle sampler