key: cord-0465509-55urml7e authors: Yang, Hyunjoo; Lee, Jaeyong title: Laplace-aided variational inference for differential equation models date: 2021-04-07 journal: nan DOI: nan sha: b4dd989d65ef4e9c22c65c4b1ebee161d8b4f7fb doc_id: 465509 cord_uid: 55urml7e An ordinary differential equation (ODE) model, whose regression curves are a set of solution curves for some ODEs, poses a challenge in parameter estimation. The challenge, due to the frequent absence of analytic solutions and the complicated likelihood surface, tends to be more severe especially for larger models with many parameters and variables. Yang and Lee (2021) proposed a state-space model with variational Bayes (SSVB) for ODE, capable of fast and stable estimation in somewhat large ODE models. The method has shown excellent performance in parameter estimation but has a weakness of underestimation of the posterior covariance, which originates from the mean-field variational method. This paper proposes a way to overcome the weakness by using the Laplace approximation. In numerical experiments, the covariance modified by the Laplace approximation showed a high degree of improvement when checked against the covariances obtained by a standard Markov chain Monte Carlo method. With the improved covariance estimation, the SSVB renders fairly accurate posterior approximations. An ordinary differential equation (ODE) model assumes that the regression curves fitting the observed data are a set of solution curves for some specific ODEs, with the ODE parameters θ ∈ Θ ⊂ R q . Despite the good interpretability of ODE itself, parameter estimation from the observed data poses a challenge due to the frequent absence of analytic solutions and the complicated likelihood surface. In particular, as the model contains more parameters and variables, these tendencies become more severe, making fast and accurate parameter estimation increasingly difficult. For fast inference, ODE parameter estimating algorithms have been developed in a way that avoids the computation of numerical solutions. The two-step approach, which the frequentist methods are mainly based on, approximate the ODE solutions by other non-linear regression such as spline expansion (Ramsay and Silverman, 2005; Ramsay et al., 2007) , local polynomial regression (Liang and Wu, 2008; Liang et al., 2010) , and so on. In the Bayesian framework, similarly, the methods using Gaussian processes as a regression curve have been developed (Calderhead et al., 2008; Dondelinger et al., 2013; Wang and Barber, 2014) , while there are other methods to improve Markov chain Monte Carlo (MCMC) using parallel tempering algorithm (Campbell and Steele, 2012) , sequential Monte Carlo (Lee et al., 2018) . An important limitation of the existing algorithms is that the performance evaluations and comparisons have been made virtually under small models, such as the FitzHugh-Nagumo model. Yang and Lee (2021) proposed a Bayesian method, a state-space model with variational Bayes (SSVB), capable of fast and stable estimation in somewhat large ODE models. The two main strategies are an approximation to a state-space model and the variational Bayes. First, the SSVB relaxes the original ODE model of p variables to the following state-space model with the tuning parameter τ : iid ∼ N(0, λ −1 I p ), i = 0, 1, . . . , n, iid ∼ N(0, τ I p ), i = 0, 1, . . . , n − 1. (1) In the model, the 4th-order Runge-Kutta method was chosen as an approximating function g(·) as follows: where h i+1 = t i+1 − t i for i = 0, 1, . . . , n − 1. The relaxed model allows fast estimation by avoiding computations of a complete numerical solution in the likelihood. Second, as a Bayesian method, the SSVB algorithm exploits the variational Bayes method for computing the posterior. The variational Bayes method also enables fast estimation by converting the inference of the posterior into an optimization problem rather than an MCMC-like sampling method. Furthermore, it increases the accuracy of the inference since the concentrativeness of mean-field approximation makes it advantageous to estimate good combinations of the initial values x 0 and the ODE parameters θ, which is crucial to the reproduction of the true ODE curves. For more details, see Yang and Lee (2021) . Indeed, their simulation studies showed the SSVB's fast and accurate performance with strong stability even in a large model with more than 30 parameters. Especially, it was markedly superior in reproducing the ODE curves while all the other competing estimators (Haario et al., 2006; Ramsay et al., 2007; Hoffman and Gelman, 2014; Lee et al., 2018) have not provided valid inferences for the same large model. Taking advantage of it, the algorithm was also successfully applied to the time-varying SIR model (Yang and Lee, 2021) with many parameters for the COVID-19 epidemic data. Despite the excellent performance, however, the SSVB based on the mean-field variational method still has a well-known weakness: the underestimation of posterior covariances. The variational distribution in which all the parameters are mutually independent provides no information about the correlations between the different parameters, and above all, from the structure of the objective function, it underestimates the variances representing the uncertainty of the inference (Blei et al., 2017) . The good performance of the SSVB could be seen as a trade-off for this over-approximations. To improve the underestimated variance problem caused by the mean-field assumption, Giordano et al. (2015) proposed a method, linear response variational Bayes (LRVB), by generalizing linear response methods originated in statistical physics. However, the LRVB is difficult to apply to the SSVB because the state-space model has a strong dependency between neighboring latent variables. Since the method is affected by the second moment estimates from the variational distribution, the more severe correlations between variables exist, the worse its performance gets. This paper proposes a solution using the Laplace approximation, a kind of density approximating method, for the problem of variance underestimation. Using the second derivatives at the mode, the density function is approximated to that of a multivariate normal distribution. The idea of applying it to the SSVB is to treat the SSVB's parameter estimates as the mode of the posterior density and to suggest the Laplace approximated one as a better covariance estimate. In experiments, the modified covariance showed a high degree of improvement when checked against the covariances obtained by standard MCMC methods. With the improved covariance estimation, the SSVB can be a promising ODE estimation algorithm. Section 2 describes the Laplace approximation method applied to the SSVB. In addition to the simple application, the Laplace approximation applied to the original ODE model rather than the relaxed model (1) is also presented as a secondary method. The experimental results comparing the covariance estimates are provided in Section 3. In Section 4, the application results to the time-varying SIR model in Yang and Lee (2021) for the COVID-19 data are provided as a real-world data analysis. Discussions are given in Section 5, whereas details of computations are relegated to the Appendix. 2 Laplace approximation for the posterior covariance In terms of the probability density function, the Laplace approximation is an approximation to Gaussian distribution. It uses the second derivative at the mode of the density as the precision matrix, the inverse of the covariance matrix. Following MacKay (2003), let p * (·) be an unnormalized density of k-dimensional random vector X, which has the mode x * ∈ R k . Taylor's expansion for log p * (x) about the point x * produces the following approximate equation: where H is a k × k symmetric matrix whose elements are As a result, the unnormalized pdf p * (x) is approximated to the multivariate normal dis- with the mean vector x * and the covariance matrix H −1 . The Laplace approximation presupposes that the mode x * is known. In applying to the correction of posterior covariance for the SSVB, we use the point estimate from the SSVB, or the variational mean parameter, as the mode. This selection is based on the assumption that the point estimate of the SSVB, which shows high performance in reproducing the solution curve, would be close to the mode with the highest posterior density. Since the SSVB algorithm uses a state-space model relaxed from the ODE model, two posterior distributions can be considered for the Laplace approximation. One is from the SSVB, the state-space model, and the other is from the original ODE model. When the prior is given by the posterior of the SSVB model (1) can be obtained as follows: For the Laplace approximation, from the log-posterior the second derivative, the Hessian matrix H, can be calculated by the chain rule. Some are as follows: Here, J g wrt x (·) and J g wrt θ (·) stand for the Jacobian matrices of g(·) with respect to x and θ, respectively (for the detailed computation, see Appendix of Yang and Lee (2021)). For the simplicity of notation, the argument t i in g(x i , t i , θ) corresponding to the time of x i is omitted. The full equations of the second derivatives are given in Appendix A. The calculational details for the Hessian matrix of g with respect to (x T , θ T ) T in the equations are given in Appendix B. As mentioned above, the H −1 at the estimates (θ,X) from the SSVB is regarded as the modified covariance. Naturally, instead of the relaxed model, we can also consider the posterior from the original ODE model as the likelihood. Here, x i := x(t i ; θ, x 0 ) for i = 0, 1, . . . , n are the points on the ODE solution curve x(t; θ, x 0 ) determined by the initial values x 0 as well as θ. With the same prior of (3), the posterior of the original ODE model and the logarithm are: The second derivatives of the logarithm for the Laplace approximation can be calculated as follows: Here, J x i wrt x 0 and J x i wrt θ stand for the Jacobian matrices of x i with respect to x 0 and θ, respectively. The above computations include the second derivatives of an ODE solution x(t; θ, x 0 ) with respect to (θ T , x T 0 ) T . These calculations, also known as sensitivity analysis, can be computed as a solution of an ODE system that is extended from the ODĖ Dickinson and Gelinas (1976) and Barrio (2006) , the details of calculation are given in Appendix C. As in the previous case, the posterior covariance estimate is calculated by regarding the point estimate of the SSVB as the mode. To determine whether the proposed covariances are close to the true posterior covariance, we need information about the true one. Since the exact covariance is virtually impossible to calculate due to the ODE model's complex likelihood, sampling methods based on MCMC are used for comparison. The two sampling methods used for comparison are the DRAM (delayed rejection & adaptive metropolis) algorithm of Haario et al. (2006) and an HMC (Hamiltonian Monte Carlo) algorithm of Neal (2011) . The DRAM algorithm is a combination of the delayed rejection (DR) algorithm, which delays the rejection and considers more candidates at each sampling iteration, and the adaptive Metropolis (AM) algorithm, which adapts the covariance of the proposal distribution reflecting the samples so far. The computation is conducted through the R package FME (Soetaert and Petzoldt, 2010) , and a maximum of 2 candidates are considered in each update iteration for DR. The HMC is an MCMC method based on the Hamiltonian dynamics, taking the density function as a potential energy function. The algorithm can be implemented through the rstan package (Stan Development Team, 2020) with the default option for the no-U-turn sampler (NUTS) of Hoffman and Gelman (2014) , an adaptive variant of HMC. Since the purpose is to compare the covariance estimates, the starting values for the above two algorithms are given 'nicely'. In large ODE models, MCMC-based methods frequently fail to estimate the parameters, meaning that they even do not reach the main region with a dominant probability of the posterior. This also motivated the development of the SSVB. Therefore, to prevent these fails, the starting values are given as the resulting point estimates from the SSVB algorithm with good performance. Experiments were conducted on two ODE models, the FitzHugh-Nagumo model and the Lorenze-96 model. For each experiment, correlation structures and variances from the methods were compared. The FitzHugh-Nagumo model with two variables and three parameters, is a popular model in ODE parameter estimation studies. A data set was generated from the true model of θ = (0.2, 0.2, 3) T and x 0 = (−1, −1) T along the 201 equidistant time points t 0 = 0, t 1 = 0.1, · · · , t 200 = 20, with the error variance 1/λ = 0.25. All the methods (algorithms) were run under the same prior in the form of (3). For the DRAM algorithm, a chain of length 1,000, which is every 30th iteration retained by thinning from 30,000 iterations after 5,000 burn-in, was obtained. The NUTS algorithm of HMC, with better movements, also provided a chain of length 1,000, which is every 10th iteration retained by thinning from 10,000 iterations after 5,000 burn-in. For the SSVB, the tuning parameter τ was set to 0.1 5 . Figure 1 shows the resulting correlation structures. As mentioned earlier, basically, the SSVB method under the mean-field assumption does not provide any information about the correlation between parameters. The correlation matrix would be filled with zeros except for the diagonal. With the Laplace approximation, however, the modified correlation showed a similar pattern to those from the DRAM and NUTS, which were assumed as a standard. The Laplace approximation applied to the original ODE model, which has the same posterior of the DRAM and NUTS, also provided a similar structure. Though the exact correlation structure is unknown, the similarity of the results indicates that the proposed methods render reasonable posterior covariances. Secondly, the scale of posterior variance, representing the uncertainty of inference, was checked. In order to intuitively visualize how much the underestimated uncertainty of the SSVB was recovered, in Figure 2 , the estimates from the five methods relative to the largest one are plotted on a horizontal line for each parameter. The specific figures for Looking closely, the SSVB modified by the Laplace approximation generally showed a larger value than the others. This seems to be because the SSVB is based on a relaxed model rather than the original ODE model. The reasoning also supported by the result that the Laplace approximation of the original ODE model provided the closer estimates to the MCMC methods with the same likelihood. What is clear is that the Laplace approximated covariances are much better estimators than the underestimated variance of just the SSVB. In both the correlation structure and the scale of variance, the Laplace approximation performs reasonably well. A repeat experiment was conducted to confirm stability. We obtained the covariance matrices in the above four ways for 30 different data sets. These data sets also correspond to indices 1 ∼ 30 among the data sets used for the simulation study in Yang and Lee (2021). The Frobenius norm was used to measure the similarity between matrices. Figure 3 shows the histogram of the Frobenius norm between the two covariances obtained by the Laplace method and the MCMC method. Among the 30, the results of the case with the largest difference are also shown. The correlation structures were quite similar, even in the most different cases. The magnitude of variance was also much closer to the MCMC methods than under the mean-field assumption. The Laplace approximation method seems stable in improving the posterior covariance. The Lorenz-96 model, also chosen in Yang and Lee (2021), is a toy model whose size can be adjusted by selecting the number of variables p (≥ 3). In this paper, the experiment was conducted with the case of 4 variables and 12 parameters, considering that the DRAM and NUTS are much slower to run in the larger models. Unlike the FitzHugh-Nagumo model above, the Laplace approximation to the original ODE model was excluded in this experiment. In the computation, some correlation coefficients were outside of the range [−1, 1]. We believe that in the calculation of the first and second derivatives of an ODE solution x(t; θ, x 0 ) with respect to (θ T , x T 0 ) T , numeric errors seem to be accumulated. These values are obtained as a solution of the extended ODE system with more than 1 2 p(p + q)(p + q + 1) = 544 variables (see Appendix C). In addition to the high sensitivity of the ODE system itself, such a large scale of the extended ODE system can be expected to cause huge accumulated errors in numerical solutions. The magnitude of the variance estimates, which is irregularly far from the results of the other algorithms, also supported the guess. ones was not evident in this case. This seems to be the effect of the step size of 2, making the relaxed model closer to the original ODE model. Obviously, the Laplace approximation looks very meaningful in terms of the correction for the rough scale. A repeat experiment was also performed on the Lorenz-96 model. Similar to the FitzHugh-Nagumo model, Figure 6 shows the results for the data sets corresponding to indices 1 to 30 used for the simulation study in Yang and Lee (2021) . Even in the case with the largest difference, the correlation structures were very similar except for few small correlations. The improvement of the variance magnitude was also evident for that case. The considerable stability was confirmed once again. In Yang and Lee (2021), a SIR model with great flexibility, was devised for COVID-19 data fitting. The number of infectious people I(t) and the number of removed people R(t) are modeled with time-varying β(t) and γ(t) using the cubic B-spline basis functions. With 30 basis functions each, a total of 60 basis coefficients, the ODE parameters, were properly estimated by the SSVB algorithm for the South Korea data. For details, see Yang and Lee (2021) . When applying the Laplace approximation to the time-varying SIR model, however, significantly more parameters than the previous experiments cause numerical problems related to the inverse matrix computation. The straightforward computation following Section 2.1 results in the precision matrix H being not positive definite, which also yields the non-positive definite covariance matrix H −1 . Actually, the resulting covariance matrix included some negative variances. Two strategies were used to correct this. The first is the inverse of a partitioned matrix. A four partitioned matrix can be inverted by when D is nonsingular (Ouellette, 1981) . When the precision matrix H is partitioned into the above four blocks with square block A corresponding to the position for (θ, x 0 ), excluding the other state variables (x 1 , . . . , x n , λ), the covariance matrix for (θ, x 0 ) can be obtained by inverting A − BD −1 C. Compared to inverting the whole H, the numerical error can be reduced as the dimension of the matrix to be inverted becomes smaller. Indeed, the recovery of the underestimated variances differed greatly depending on whether this strategy was used or not. The second is to enforce positive definiteness. The resulting A − BD −1 C was not positive definite, and neither was its inverse, the covariance matrix. For this case, R library Matrix provides nearPD function that computes the nearest positive definite matrix to a given matrix, based on Higham (2002) . By using the nearest positive definite matrix of correlated results. Similarly, the coefficients for β(t) and γ(t) with the same basis function were more correlated with each other. With a posterior sample (θ * , x * 0 ) from the resulting covariance matrix, a posterior sample of the ODE solution curve, x * (t; θ * , x * 0 ), can be obtained. Figure 8 represents the 95% posterior credible intervals for the ODE solution curves with the data of Korea. A total of 1,000 sample curves were drawn, and the intervals based on the (25th, 975th)largest values at the observation times are plotted. Due to the sensitivity of the ODE solution, the credible intervals not being wide at the beginning generally widen over time. With the plausible covariance, it has become possible to represent the uncertainty of the regression curve graphically. In this paper, we propose to improve the posterior covariance estimation, which is a weakness of the SSVB method, using the Laplace approximation. To confirm how valid these modifications are, the MCMC-based methods were chosen as a reasonable standard on behalf of the true posterior covariance virtually impossible to know. Although MCMC methods do not perform well in ODE parameter estimation itself due to the limitations of the sampling method, they can be manipulated to obtain an ideal chain by selecting the chain starting point close to the true parameters. The effect of the Laplace approximation seems quite valid. In the experiments, the correlation structure of the SSVB, which is just an identity matrix due to the mean-field assumption, has become very similar to that of the MCMC methods. Furthermore, the underestimated variance, the uncertainty of the inference, also showed a jump in a scale comparable to that of the MCMC methods. Though we cannot say this is the perfect answer, it seems clear that it is a far better estimate than the simple SSVB's one under the mean-field assumption. For larger ODE models, such as the Lorenz-96 model with 10 variables, the proposed Laplace method can be applied but the results are not included in this paper. It is because the execution time of the MCMC methods, which was set as a standard for determining the improvement, is too long. From the results on the smaller models compared in Section 3, it is but inferred that the Laplace approximation could also provide a better covariance estimate in larger ODE models. As an example, the proposed method was also applied to the time-varying SIR model with 60 ODE parameters for the COVID-19 data of Korea. With some strategies to overcome the numerical problems, we were able to obtain better covariance and use them to represent the posterior credible intervals. The Laplace approximation is quick to execute, as long as the second derivatives and the mode are given. The success of the Laplace approximation is fundamentally based on excellent parameter estimation of the SSVB used as a mode. By combining the fast and stable parameter estimating method with a good enough density approximation, we can obtain a fairly useful ODE estimation algorithm. If y = f (u) and u = g(x), then the second derivative of f • g is: B.2 The case of step size m = 1 g(x, t, θ) = x + 1 6 (K 1 + 2K 2 + 2K 3 + K 4 ), Let u = (x T , θ T ) T and the Hessian matrix H f j = ∂ 2 f j ∂u∂u T for j = 1, . . . , p are given. B.3 The case of step size m ≥ 2 First, without considering the step size m, we can define some iterative functions like: Then, the Jacobian matrices with respect to x of the above functions can be recursively computed from H g j (·) above as follows: Now to return to the main, for the given step size m ≥ 2 in our proposed method, we j (x, t, θ) which is computed with h/m instead of h from the above formulas. C Appendix: Sensitivity analysis of ODE systems C.1 Jacobian of x(t; θ, x 0 ) with respect to θ and x 0 When a p-variable ODE systemẋ(t) = f (x(t), t, θ) is given, oṙ x j (t) = f j (x(t), t, θ), j = 1, . . . , p, consider the partial derivative of the j-th solution curve x j (t; θ, x 0 ) with respect to the k-th parameter θ k , Z kj (t) ≡ ∂x j (t; θ, x 0 ) ∂θ k , j = 1, . . . , p. From the following equations, we get another ODE system for the Jacobian of x(t; θ, x 0 ), ∂f j ∂x Z k , j = 1, . . . , p. The initial conditions for the system are Z kj (0) = 0 for all j = 1, . . . , p, from Z kj (0) = lim ∆θ k →0 x j (0; θ k + ∆θ k ) − x j (0; θ k ) ∆θ k , since x j (0; θ k + ∆θ k ) − x j (0; θ k ) = 0. The partial derivative with respect to x 0 = (x 01 , . . . , x 0p ) can be obtained in the same way, with the initial condition of Z j ≡ ∂x j ∂x 0 as follows:    when j = , x (0; x 0 + ∆x 0 ) − x (0; x 0 ) = ∆x 0 , so Z (0) = 1, when j = , x j (0; x 0 + ∆x 0 ) − x j (0; x 0 ) = 0, so Z j (0) = 0. C.2 Hessian of x(t; θ, x 0 ) with respect to θ and x 0 In the same way as above, we can obtain an ODE system about the Hessian of x(t; θ, x 0 ). When we define using that is a function of (x(t; θ, x 0 ), t, θ, Z(t; θ, x 0 )), When we define W j rk ≡ ∂Z kj (t) ∂x 0r = ∂ 2 x j (t) ∂x 0r ∂θ k , using that is a function of (x(t; θ, x 0 ), t, θ, Z(t; θ, x 0 )), When we define using that is a function of (x(t; θ, x 0 ), t, θ, Z(t; θ, x 0 )), Sensitivity analysis of odes/daes using the taylor series method Variational inference: A review for statisticians Accelerating bayesian inference over nonlinear differential equations with gaussian processes Smooth functional tempering for nonlinear differential equation models Sensitivity analysis of ordinary differential equation systems-a direct method Ode parameter inference using adaptive gradient matching with gaussian processes Linear response methods for accurate covariance estimates from mean field variational bayes Dram: Efficient adaptive mcmc Computing the nearest correlation matrix-a problem from finance The no-u-turn sampler: Adaptively setting path lengths in hamiltonian monte carlo Inference for differential equation models using relaxation via dynamical systems Estimation of constant and time-varying dynamic parameters of hiv infection in a nonlinear differential equation model Parameter estimation for differential equation models using a framework of measurement error in regression models Laplace's method, Information Theory, Inference and Learning Algorithms MCMC Using Hamiltonian Dynamics Schur complements and statistics Parameter estimation for differential equations: a generalized smoothing approach Functional Data Analysis Inverse modelling, sensitivity and monte carlo analysis in r using package fme RStan: the R interface to Stan Gaussian processes for bayesian estimation in ordinary differential equations A Appendix : The full second derivatives for the Laplace approximation of the relaxed modelfor i = 1, . . . , n − 1,∂ 2 ∂x n ∂x T n = 1 τ + λ I p .