key: cord-0637760-ux0lgm0m authors: Rosato, Conor; Harris, John; Panovska-Griffiths, Jasmina; Maskell, Simon title: Inference of Stochastic Disease Transmission Models Using Particle-MCMC and a Gradient Based Proposal date: 2022-05-15 journal: nan DOI: nan sha: fb02bfc93d2f682a1c3e2a242c6b3df0903c3647 doc_id: 637760 cord_uid: ux0lgm0m State-space models have been widely used to model the dynamics of communicable diseases in populations of interest by fitting to time-series data. Particle filters have enabled these models to incorporate stochasticity and so can better reflect the true nature of population behaviours. Relevant parameters such as the spread of the disease, $R_t$, and recovery rates can be inferred using Particle MCMC. The standard method uses a Metropolis-Hastings random-walk proposal which can struggle to reach the stationary distribution in a reasonable time when there are multiple parameters. In this paper we obtain full Bayesian parameter estimations using gradient information and the No U-Turn Sampler (NUTS) when proposing new parameters of stochastic non-linear Susceptible-Exposed-Infected-Recovered (SEIR) and SIR models. Although NUTS makes more than one target evaluation per iteration, we show that it can provide more accurate estimates in a shorter run time than Metropolis-Hastings. Calibration of epidemiological models using disease specific data is important when striving to understand how a disease evolves with time. Inferring parameters such as the effective contact and recovery rates, especially at the beginning of an outbreak, is vital for public health officials when assessing the deadliness of a disease. CR © 2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. Two methods for modelling diseases include: Agent-based models (ABMs) [1] which simulate interactions with individuals within the population based on a set of rules; and the compartmental approach of splitting the population into unobservable compartments, for example Susceptible, Infected and Recovered (SIR) [2] and allowing a fraction at every timestep t, to progress to the next compartment. State space models (SSMs) are a favoured method for representing the dependence of latent states in non-linear dynamical systems in a wide range of research fields [3] . A SSM consists of a state equation, which is parameterised by θ and models how the dynamical system moves from the previous state to the current state, x t , at time t. They also include an observation equation which describes how the observation, y t , is linked to the state equation. Particle filters [4] , [5] are a popular method for inferring the time-dependent hidden states of SSMs and have been previously applied to compartmental disease transmission models [6] - [9] . The two overarching methods used for estimating parameters in the context of epidemiological models, as outlined in a recent systematic review [10] , are optimisation and sampling algorithms. Optimisation algorithms include grid search and iterative, descent-guided optimisation while sampling methods include Markov Chain Monte Carlo (MCMC) techniques, particle-MCMC (p-MCMC), Approximate Bayesian Computation (ABC) and history matching. One difference between these two groups of techniques is that optimisation algorithms find point estimates of the parameters while sampling methods can produce full Bayesian parameter estimates which in-turn will capture parameter uncertainty. MCMC algorithms work by drawing a sample, which is based on the previous sample, from a target distribution, π(θ). An accept/reject step is included which determines whether to keep or discard the drawn sample. Metropolis-Hastings (MH) randomly samples from π(θ) without reference to its structure. It can therefore take the Markov Chain a long time to reach its stationary distribution if θ has multiple dimensions. Hamiltonian Monte Carlo (HMC) [11] and the No-U-Turn Sampler (NUTS) [12] use gradient information about the logposterior of θ so can efficiently explore π(θ). Probabilistic programming languages (ppls), such as Stan [13] , have made MCMC an accessible tool for calibrating epidemiological models [14] , [15] because these ppls use NUTS. Particle-MCMC (p-MCMC) [16] combines the particle filter and MCMC to make state and parameter estimates by performing Bayesian inference on non-linear non-Gaussian scenarios where standard MCMC methods can fail. A detailed explanation is given in section II. Particle-MCMC has been used to infer parameters of simulated genealogies and time series data [17] , a dengue outbreak [18] , non-communicable diseases in Egypt [19] , cholera in Bangladesh [20] , COVID-19 in the United Kingdom (UK) [21] , the 2009 A/H1N1 pandemic in England [22] and an ABM in [23] . It has been suggested in [24] , [25] that a potential reason why p-MCMC has yet to become a mainstream method for modelling epidemics is due to the complexity of the mathematics behind the algorithm: [25] make the methodology more accessible by providing a step-by-step tutorial. The standard proposal used in calibration of epidemiological models when using p-MCMC is the MH random walk [17] , [19] - [23] , [25] which inherits the same drawbacks as explained previously for MCMC. In this piece of work we focus on estimating the parameters of stochastic disease compartmental models using the recent development of a variant of p-MCMC that uses NUTS as the proposal [26] . The paper is organised as follows: section II explains how to combine p-MCMC and NUTS in more detail, before section III describes the models we will consider and the results obtained. Section IV then concludes the paper. In this section we outline how to construct the p-MCMC algorithm and show how an unbiased estimate of the loglikelihood, given by the particle filter, can be used in the MH proposal within the MCMC algorithm. We also show how to estimate the gradient of the log-likelihood which is used when proposing a new vector of θ using NUTS. Bayesian calibration of SSMs seen in (1) and (2) is undertaken with the goal of estimating the parameter posterior distribution p(θ|y), which involves finding a set of θ that best represents the data, y. This can be done by using Bayes theorem: where p(θ) is the prior. In the well understood MH proposal a set of parameters θ is drawn from a proposal distribution q(θ |θ), which is commonly chosen to be a Gaussian centered around the current position, θ, and a variance parameter, also known as step-size, which is chosen by the user. An accept or reject step is introduced that determines whether to accept θ as part of the Markov chain or reject and revert back to θ. If θ is accepted the inputs for the next iteration are θ = θ and p(y 1:T |θ) = p(y 1:T |θ ). However if θ is rejected the inputs are θ = θ and p(y 1:T |θ) = p(y 1:T |θ). This process occurs for, M MCMC iterations and at each iteration two values are stored: a current value of θ and a particle filter estimate of p(y 1:T |θ). The acceptance probability is α(θ, θ ) = min 1, Note π(θ) = p(θ|y). When estimating π(θ) a tractable likelihood is needed which in many cases is not available. The particle filter (explained in section II-C) provides a method to approximate the marginal likelihood, p(y 1:T |θ) which allows us to formulate the acceptance step as This is the particle marginal Metropolis-Hastings (PMMH) algorithm described in [16] and is described in Algorithm 1. NUTS [12] is a gradient based method and an extension of HMC [27] , [28] . Both methods use the leapfrog numerical integrator to generate new samples by discretising Hamilton's equations which are defined to be where m is the momentum. The Hamiltonian H is defined to be where K(m) is the kinetic energy and U (θ) is the potential energy. Hamilton's equations are a pair of differential equations that describe a system in terms of its position and momentum where the potential of the system is defined by U = − log(π(θ)). Sampling a random initial m, from a Gaussian, a new state θ is proposed using the gradient of π(θ). The leapfrog method is symplectic, meaning it preserves the geometry of the system which leads to high acceptance rates, and reversible, meaning detailed balance is maintained. The generated samples from HMC are governed by two user defined parameters: the number of steps L and the stepsize used by leapfrog ∆. NUTS adaptively calibrates L by building trajectories that take steps both forward and backwards, doubling in number at each iteration. This is done until the trajectory starts to double back on itself. It is at this point that a state is sampled from a complete balanced tree. The result is a reversible sampling process that maintains detailed balance and is described in Algorithm 3. The benefit of using a particle filter is its ability to model SSMs that are nonlinear and non-Gaussian. It does this by representing the probability density function over the hidden states, x t , with a set of N random samples (particles), each being an independent hypothesis of x t . The particle filter propagates particles forward in time using the model function (often referred to as the dynamics) in (1) which encodes some prior information on how the state evolves with time. Due to θ being static, some noise needs to be introduced which will improve the diversity among the drawn samples. An observation model (often referred to as the likelihood), seen in (2), is used to associate a measurement with the state, x t , at every timestep. The dynamics and likelihood can be parameterised by θ such that The ith particle has a state sequence, x (θ,i) 1:t , which is modelled using a Markov chain to describe the current state at time, t, only being dependent on the previous state at time, t − 1.The ith sample has an associated importance weight, w (θ,i) t , which is determined by how probable it is given the dynamics and new measurement. Broadly speaking, large weights correspond to more accurate representations of the dynamical system and lower weights the opposite. A set of N particles can then be represented as x (θ,i) At t = 0, weights are set to be 1/N but are recursively calculated at every timestep when t > 0 using w (θ,i) where q x t−1 , y t is the proposal. Different options exist for formulating a proposal which are described in [26] . However, in this paper we choose the proposal to be equal the dynamic model, q x The joint distribution, p (y 1:t , x 1:t |θ), can be estimated using the unnormalised weights from (10) such that which is unbiased. The posterior distribution, p (x 1:t |y 1:t , θ), can then be estimated by which is similar to (11) . The normalised weights (which differ from those in (10)) can be calculated using which are used when estimating the integral We note that while (11) is an unbiased estimate, 15) is biased (due to it being a ratio of estimates). 1) Resampling: The methods described in section (II-C) until now results in the sequential importance sampling (SIS) algorithm. One limitation of SIS is that the normalised weights in (14) becoming increasingly skewed at each increment of t with one becoming close to 1 and the others 0. A widely used method for overcoming this is to calculate the number of effective samples at each iteration of t as and comparing N ef f with a user-defined threshold. If N ef f is less than the threshold, resampling is implemented. Multinomial resampling is a popular method which involves drawing from the current particle set N times, with probabilities proportional to the corresponding normalised weights. This allows for particles that have a better guess of x t and higher weights to get replicated while particles with lower weights die off. After resampling the normalised and unnormalised weights are set to 1 1:t and 1/N , respectively. In this section we outline how to calculate the likelihood and the gradient of the likelihood w.r.t θ. 1) Likelihood: The likelihood can be calculated from (13) which is the sum of the unnormalised weights when t = T . It is a byproduct of running the particle filter so no other calculations need to be undertaken. 2) Gradients: As previously described, MH randomly samples θ from a proposal distribution q(θ |θ) which is determined by a user-defined step-size. If θ has multiple dimensions or the correlation between θ is high, the sampler can take a long time to search π(θ) due to being stuck in local maximum peaks. It can also be time consuming to select a suitable stepsize for θ as choosing one that results in large jumps can mean θ rarely gets accepted while having too small a stepsize may result in π(θ) being poorly explored. This problem is accentuated if θ is high dimensional. The work done in [26] shows how to differentiate the particle filter so gradients of the log-posterior of θ can be used to efficiently propose θ . We present the gradient of the log-posterior of θ as ∇ log p(θ|y 1:t ) = ∇ log p(θ) + ∇ log p(y 1:t |θ), where ∇ log p(θ) is the gradient of the log-prior and ∇ log p(y 1:t |θ) is the gradient of the log-likelihood 1 . Note that we assume ∇ log p(θ) can be calculated explicitly. Differentiating the weights gives an approximation to the gradient of the likelihood: Applying the Chain Rule to (13) and (18) gives In the general case we need to calculate the derivatives of the dynamics, d dθ p x and proposal, d dθ q x (θ,i) t |x (θ,i) t−1 , y t which are described in detail in [26] . However, similarly to (10), because we set the proposal to equal the dynamic model, the derivatives of the weights simplifies to Therefore in this paper we will only describe how to calculate (21) . a) Derivatives of Particle States: It is evident that d dθ p y t |x (θ,i) t does not explicitly depend on θ but x (θ,i) t does. Therefore in order for the derivative of the likelihood w.r.t θ to not equal zero we firstly need to calculate dx (θ,i) t /dθ. If we define 1 Working with logs is often found to be more numerically stable. then the derivative can be calculated by b) Derivatives of Gaussian Likelihood: We now describe how to calculate the derivative of a Gaussian likelihood where and The derivatives of log N (f ; n, R) are given in Appendix A. It is stated in the literature that the resampling step, described in section (II-C1), cannot be differentiated. We will not outline here why this is the case but direct the reader to [26] for a comprehensive explanation. [26] also outlines how the resampling step can be differentiated when fixing the random number seed and employing multinomial resampling. The approach adopted here does result in discontinuities in an unbiased estimate of the log likelihood but also runs in O(N ) time. Unlike the standard particle filter described in section (II-C) we also need to resample: • The derivatives of the particle states (see (25) ): • The weight derivatives (see (21) ) d dθ w (θ,i) 1:t . This will ensure that any derivative calculations made after resampling are a function of the parent particle. In this piece of work we extend the SIR model in [8] to include an exposed compartment as shown in section III-B. The SEIR model follows a similar discrete time approximation to the ones outlined in [8] , [9] . Stochasticity is introduced when modelling the dynamics by including a noise term, x , for each time-varying parameter, x, which mimics the interactions between people in the population, P . Note x will be independent for different values of x and are drawn from x ∼ N (0, √ θ/P ). An example for β would be β ∼ N (0, √ β/P ). Table I gives a description of the parameters used in the statistical models below. Prior information for these parameters is also included and taken from [15] . Description Prior Information P The total population -S(t) The proportion of people in the susceptible compartment The proportion of people in the exposed compartment U nif (0.00016, 0.00024) The proportion of people in the infectious compartment U nif (0.00016, 0.00024) The proportion of people in the recovered compartment β Mean rate of people an infected person infects per day HalfNormal (0.0, 0.5) γ The proportion of infected recovering per day Normal(4.0, 5.0) δ Length of incubation period Normal(4.0, 5.0) R 0 The total number of people an infected person infects - (35) We model the log of the observations by where b l , φ l and σ l are all non-negative constants. For the purpose of this analysis b l , φ l and σ l are known which allows us to formulate the likelihood as D. Results 1) SIR: Firstly we compare a NUTS and MH proposal in p-MCMC by running the same simulation with parameter and prior choices as described in [8] for the SIR model III-A. We simulate (29) and (30) 3 . A graphical representation of the SEIR transmission model. and 10/5000, respectively. Figure 2(a) shows the percentage of the population in each compartment throughout the duration of the epidemic. Figure 2(b) shows the simulated syndromic observations, y 1:T , which are a fraction of the infected compartment. They are drawn from (36) with parameters b l , φ l and σ l set to 0.25, 1.07 and 0.0012, respectively. We run the particle filter to obtain estimates of the loglikelihood and the gradient of the log-likelihood w.r.t β across a range of 100 values of β equally spaced from 0.248 and 0.260. The aim is to maximise the log-likelihood and have a gradient of 0 when β = 0.254. The results can be seen in Figure 2 (c) and (d). The Mean Squared Error (MSE) between the true and inferred values of θ for both MH and NUTS as well as the time taken in seconds when using different numbers of particles, N , can be seen in Table II . The MSE and computational time are averaged over 10 runs, which use different random number seeds for 50 MCMC iterations. Using 10 different random number seeds will produce different realisations of when resampling occurs (as explained in Section II-D3) so in turn will converge to the true values of θ at different rates. Averaging over the 10 runs gives a more accurate representation of the typical convergence accuracy and the time taken. This is exemplified in Figure 4 when using M-H (subplots: a and b) and NUTS (subplots: c and d) for the SIR model. Each line is an independent traceplot of an MCMC chain with the same initial starting values of 0.15 and 0.21 for β and γ, respectively. Note for the MH proposal we use the same step-sizes for β and γ as are stated in [8] of 0.005 and 0.001, respectively. It is evident when looking at Table II Table II ). 2) SEIR: Secondly we show how well the Markov-chain converges to the true parameters when running with different numbers of particles, N . Table III outlines the results for the SEIR model in section III-B when using NUTS with ∆ set to 0.0055. We run the simulation for 2000 samples with the first 1000 discarded as burn-in. The true values of β, γ and δ set to 0.254, 0.111 and 0.4, respectively. Figure 5 (a) and (b) shows the trace plots for β when N = 16 and N = 256, respectively. By eye it is evident that using N = 256 results in a Markovchain that has better mixing which in turn explores π(θ) more efficiently than having N = 16. This is backed up by looking at the acceptance rate in Table III (d), respectively. ACF plots shows the auto-correlation between samples in the Markov-chain as a function of a user-specified lag which we select to be 100. A plot that trends towards 0 in fewer lags is seen to be better, which is the case when N = 256. Table III outlines the integrated auto-correlation time (IACT) for each parameter. This is a numerical measure of the area under the ACF plot and a lower value represents less correlation between consecutive samples. The effective sample size (ESS) for different numbers of particles is also shown in Table III as well as the time taken in seconds the p-MCMC takes to run. This gives an indication of the number of independent samples it would take to have the same estimation power as a set of auto-correlated samples. The ESS/time taken in seconds shows how efficient the algorithm is by calculating how many independent samples are drawn per second. The predicted marginal distributions, when using N = 256 and NUTS, for the parameters β, γ and δ can be seen in Figures 6 (a), (b) and (c), respectively. We also show the predicted distribution around the the basic reproduction number, R 0 , in Figure 6 (d). This is calculated by R 0 = β/γ. We have shown that using NUTS as the proposal can improve accuracy when inferring the parameters of a simple SIR model over the standard MH proposal. We have also outlined multiple methods to assess convergence of MCMC chains. These are a necessity to ascertain if the results obtained from the p-MCMC algorithm are useful. Another popular method for modelling epidemics described by Reed-Frost assumes that people becoming infected with a disease follows a Binomial distribution. We do not include this model in our analysis due to (25) being non-differentiable when the particle state equations follow a discrete distribution. A fruitful direction for future work would involve including the ideas described in [29] , which leverages optimal transport ideas to mimic the multinomial distribution, and mimic the Binomial distribution. In this piece of work we have only considered simulated data to provide a proof of concept. A next step would be to apply these methods to more sophisticated epidemiological models, as seen in [15] , and to use real data pertinent to disease outbreaks. One last direction for future work could involve ingesting multiple data streams with different reporting latency. It has been shown in [30] that certain combinations of datasets can improve accuracy when predicting COVID-19 related deaths. Covasim: an agent-based model of covid-19 dynamics and interventions Containing papers of a mathematical and physical character Sequential Monte Carlo methods in practice Novel approach to nonlinear/non-gaussian bayesian state estimation A tutorial on particle filters for online nonlinear/non-gaussian bayesian tracking Tracking epidemics with google flu trends data and a state-space seir model Detecting disease outbreaks using a combined bayesian network and particle filter approach Comparison of the performance of particle filter algorithms applied to tracking of a disease epidemic Sequential monte carlo filtering estimation of ebola progression in west africa Calibration of individual-based models to epidemiological data: A systematic review Handbook of markov chain monte carlo The no-u-turn sampler: adaptively setting path lengths in hamiltonian monte carlo Stan: A probabilistic programming language Contemporary statistical inference for infectious disease models using stan Refining epidemiological forecasts with simple scoring rules Particle markov chain monte carlo methods Inference for nonlinear epidemiological models using genealogies and time series Parameter elimination in particle gibbs sampling Quantifying noncommunicable diseases' burden in egypt using state-space model Predictive modeling of cholera outbreaks in bangladesh Key epidemiological drivers and impact of interventions in the 2020 sars-cov-2 epidemic in england Capturing the timevarying drivers of an epidemic using stochastic dynamical systems Bayesian estimation of agent-based models via adaptive particle markov chain monte carlo Tooling-up for infectious disease transmission modelling Introduction to particle markov-chain monte carlo for disease dynamics modellers Efficient learning of the parameters of non-linear models using differentiable resampling in particle filters Hybrid monte carlo A conceptual introduction to hamiltonian monte carlo Differentiable particle filtering via entropy-regularized optimal transport Fusing low-latency data feeds with death data to accurately nowcast covid-19 related deaths The authors would like to acknowledge Lee Devlin and the UK Health Security Agency (UKHSA) for their support. Run Algorithm 2 with θ to obtain estimates of log p(y 1:T |θ ) Compute acceptance probability, α(θ, θ ) using (5) 6:if u < α(θ, θ ) then If N ef f (16) ≤ threshold: ResampleSample the new particles x i t and calculate the particle derivatives using (25) . 5: Calculate the particle weights and derivatives of particle weights using (9) and (21)