key: cord-0597708-a7o2ur3l authors: Tronarp, Filip; Bosch, Nathanael; Hennig, Philipp title: Fenrir: Physics-Enhanced Regression for Initial Value Problems date: 2022-02-02 journal: nan DOI: nan sha: aa33ccbc43e359fa6ab4b6880aacb0c5d7a4b315 doc_id: 597708 cord_uid: a7o2ur3l We show how probabilistic numerics can be used to convert an initial value problem into a Gauss--Markov process parametrised by the dynamics of the initial value problem. Consequently, the often difficult problem of parameter estimation in ordinary differential equations is reduced to hyperparameter estimation in Gauss--Markov regression, which tends to be considerably easier. The method's relation and benefits in comparison to classical numerical integration and gradient matching approaches is elucidated. In particular, the method can, in contrast to gradient matching, handle partial observations, and has certain routes for escaping local optima not available to classical numerical integration. Experimental results demonstrate that the method is on par or moderately better than competing approaches. Consider the following initial value problem (IVP) where the vector field f θ : [0, T ] × R d → R d and the initial condition ϕ θ (0) = y 0 (θ) are both parametrised by θ. In this article, the concern lies in estimating θ from noisy measurements of the following form where t ∈ T D ⊂ [0, T ] is the finite set of measurement nodes and H is a measurement matrix of appropriate dimension. This is a ubiquitous problem in science and engineering. Examples include ecology (Benson, 1979) , pharmacokinetics (Gelman et al., 1996) , process engineering (Åström & Eykhoff, 1971) , and brain imaging (Friston, 2002) . 1 Department of Computer Science, University of Tübingen, Tübingen, Germany. Correspondence to: Filip Tronarp . The parameter θ may be estimated by maximising M. A persistent challenge in likelihood-based inference in initial value problems is the fact that ϕ θ , and therefore the likelihood, are intractable (Bard, 1974) . A standard appproach to approximating the likelihood is based on solving the IVP numerically (Hairer et al., 1987) . However, in optimisation-based inference it has been observed that this leads to many local optima (Cao et al., 2011) , and can lead to divergence of the optimiser (Dass et al., 2017) . On the other hand, slow convergence and poor mixing has been observed for Monte Carlo-based inference (Alahmadi et al., 2020) , which have led some authors to favour likelihood-free methods (Toni et al., 2009 ). Another alternative is gradient matching (Voit, 2000) with splines (Varah, 1982; Gugushvili & Klaassen, 2012) or Gaussian processes (Calderhead et al., 2009; Dondelinger et al., 2013; Gorbach et al., 2017; Wenk et al., 2020) . In the present work, a probabilistic numerics approach is developed for computing the marginal likelihood. Probabilistic numerics aims at producing probability measures for solutions of numerical problems, thus giving a probabilistic description of the numerical error (Hennig et al., 2015; . The marginal likelihood may be viewed as L D integrated against a Dirac measure located at ϕ θ according to While this representation is not immediately advantageous, it is instructive for understanding the probabilistic numerics approach. Namely, it produces an approximation to the Dirac measure, giving the following approximate marginal likelihood (Kersting et al., 2020a) M N (θ, κ) = L D (R θ , y) δ N (y | θ, κ) dy, where δ N is the output of a suitably chosen probabilistic numerical method, which is parametrised by κ. It should be noted that L D only depends on point evaluations of y on the grid T D . Therefore, it is sufficient to operate on the finite dimensional distributions of δ N to compute M N . The aim of this article is to show how both δ N and M N can be computed efficiently. The method consists of two parts: 1. Efficiently construct a Gauss-Markov representation of δ N (y | θ, κ) using probabilistic numerics. 2. Compute M N (θ, κ) and its derivatives via Gauss-Markov regression and automatic differentiation. The first step essentially takes the initial value problem and produces a physics-enhanced Gauss-Markov prior. The second step utilises this prior in standard Gauss-Markov regression to estimate parameters and reconstruct the trajectory (Särkkä & Solin, 2019) . Therefore, the method is called Physics-enhanced regression in initial value problems, or Fenrir for short. Here physics is used to refer to any mechanistic information pertaining to the dynamics of the data generating process. The method is illustrated in Figure 1 . The rest of the article is organised as follows. Probabilistic numerical solvers are reviewed in Section 2. In Section 3 it is shown how to use probablistic numerics to construct a physics-enhanced Gauss-Markov prior for initial value problems, thus reducing the marginal likelihood to Gauss-Markov regression. Related work is discussed in Section 4, which is followed by experimental results in Section 5. Finally, concluding remarks are given in Section 6. In the Bayesian formulation, an IVP solver is completely specified by a prior and the definition of the data, on which it is conditioned. The latter is obtained by means of an information operator (Cockayne et al., 2019) . For constructing a probabilistic numerical solver, we follow the account of Tronarp et al. (2019b; . The probabilistic numerics prior is defined as the output of the following stochastic state-space model where x ∈ R d(ν+1) models the solution and its ν first derivatives and E m are selection matrices for the mth derivative of the prior model for the solution of (1), which is denoted by y. Furthermore, x † θ denotes the initial condition of x, A ∈ R d(ν+1)×d(ν+1) and B ∈ R d(ν+1) are model matrices and w is a standard Wiener process in R d (Øksendal, 2003) . The state x is a Markov process by construction, with transition density given by (Särkkä & Solin, 2019) which facilitates fast computation for the probabilistic solver and our subsequent marginal likelihood approximation. Additional details on priors for probabilistic solutions of initial value problems can be found in Appendix A.1. In order to define a data model for probabilistic numerical solvers, a grid needs to be coupled with an information operator. The canonical information operator for initial value problems is given by (Tronarp et al., 2021 ) but there are alternatives that, for instance, also take geometric invariants into account (Bosch et al., 2021b) . Note that Z map solutions of the initial value problem to the zero function, which is a known value. In fact, the set of functions starting at y 0 (θ) which are mapped to the zero function by Z constitutes the set of solutions to the initial value problem (Arnol'd, 1992) . 1 An appropriate data model for a probabilistic numerical solver is thus given by where z(t) = 0 is enforced only on the chosen grid as to arrive at a practical algorithm. It should be noted that the grid T PN does not need to be specified a priori but can be constructed adaptively to control the solution error (Schober et al., 2019; Bosch et al., 2021a) . The prior (5), data model (6), and data definition (7) define a non-linear Gauss-Markov regression problem according to Tronarp et al. (2019b) x where ∆ n = t n − t n−1 is the step-size of the nth step, x(t 0 ) = x † θ by convention, and N (· , 0) denotes the Dirac distribution. The probablistic numerical solver for (1) associated with the prior (5) and the data (7) is on the grid T PN given by where c(θ, κ) is a norming constant. Due to the potential non-linearity of the vector field, this object is generally intractable. However, when the vector field is linear, say then the densities of the time marginals can be computed efficiently via Kalman filtering and Rauch-Tung-Striebel smoothing (Kalman, 1960; Rauch et al., 1965) . This fact is exploited for approximate inference when the vector field is non-linear as well. Indeed several linearisation approaches have been employed (Schober et al., 2019; Tronarp et al., 2019b; , which have been demonstrated to yield accurate solvers both empirically (Schober et al., 2019; Bosch et al., 2021a; Krämer & Hennig, 2020) and theoretically (Kersting et al., 2020b; Tronarp et al., 2021) . 1 Typically, we assume that the vector field is regular enough for there to be a unique solution of the initial value problem. The Kalman filtering recursion for (8) when the vector field is affine as in (10), recursively computes the densities π(x(t n ) | z(t 1:n )) = N (µ θ (t n ), Σ θ (t n )), which are the time marginals conditioned on all past data up to the present. The recursion is initialised by setting µ θ (t 0 ) = x † θ , Σ θ (t 0 ) = 0, and then alternates between prediction and update. • Prediction: • Update: The following parameters can be computed from the outputs of the Kalman filter They are used for the smoothing recursion and the representation of the probabilistic numerics posterior. In this section, it is shown that the probabilistic numerical solver yields a Gauss-Markov process approximation to (1). Consequently, inference given measurements (2) reduces to a Gauss-Markov regression problem with a physicsenhanced prior as determined by the probabilistic solver. Linearising the vector field allows for approximate computation of the time marginal densities via the Rauch-Tung-Striebel smoother. These linearisations imply a Gauss-Markov representation of the approximate posterior, which in fact is used in the Bayesian derivation of the smoothing algorithm (Särkkä, 2013 , c.f. proof of theorem 8.2). In any case, the probabilistic solver converts the original initial value problem into a Gauss-Markov process, which on the grid T PN is given by where and (G, P ) are given by (12). Note that γ N is represented as a Gauss-Markov process running backwards in time. It represents a probabilistic approximation to the solution of the IVP and its derivatives, in terms of a conditional distribution given numerical data (7) and the parameter θ. In the previous section, the approximate Dirac δ N (y | θ) was implicitly defined through γ N in (9). The purpose here is to turn this into an implementable algorithm for approximating the marginal likelihood. For ease of notation it is assumed that T D ⊂ T PN , in which case, Additionally, the calibration parameter κ is also included in the marginal likelihood approximation. In practice, γ N is replaced by its approximation γ N in (13). This results in the following approximation to the marginal likelihood Consequently, the problem of computing the marginal likelihood and trajectory estimates is reduced to inference in the following linear state-space model where the backwards transition densities can be read from (13). Therefore, estimating the trajectory of the solution (1) can also be done via Kalman filtering and smoothing. Furthermore, the marginal likelihood approximation can be computed via the Kalman filter through the prediction error decomposition (Schweppe, 1965) . Complete details on how to compute trajectory estimates and marginal likelihoods in (15) are given in Appendix B. Computational complexity The computation of γ N and M N can be implemented with Gauss-Markov regression with a state dimension of d(ν + 1). Therefore, assuming the measurement dimension is smaller, the computational complexity of the method is O(N d 3 (ν +1) 3 ). That is, it is linear in the number of data points, in contrast to cubic complexity for standard Gaussian process regresison. Further speed-ups may be obtainable by exploiting structural simplifications for certain probabilistic solvers (Krämer et al., 2021) . Hyperparameter estimation The present method provides a marginal likelihood (14); its derivatives can be computed with automatic differentiation. Consequently, Fenrir interacts with various inference methods, such as gradientbased optimisation or Markov Chain Monte Carlo, in a plugand-play fashion. In this paper, the maximum likelihood approach is examined. The marginal likelihood approximation (14) confers other benefits than providing a cost function for parameter inference. Namely, the possibility for a probabilistically motivated model comparisons, such as likelihood ratio testing for nested models (King, 1998) , or via various information criteria (Akaike, 1974; Stoica & Selen, 2004) . Three different approaches to parameter estimation in initial value problems can be discerned, namely (a) numerical integration, (b) gradient matching, and (c) probabilistic numerics. In order to get a comprehensive lay of the land of parameter estimation in ordinary differnetial equations, these approaches are reviewed in this section. Particular care is taken to highlighting similarities and differences. The traditional approach is to estimate the parameters via non-linear regression (Biegler et al., 1986) , where the correct solution to (1) is replaced by a numerical approximation, say Runge-Kutta (Hairer et al., 1987) . Thus the marginal likelihood approximation reads That is, likelihood computation via numerical integration computes the Dirac approximation δ N in (4) by approximating the location of the Dirac in (3). The main idea of gradient matching is to decompose the inference procedure into two steps: 1. Fit a curveŷ(t) to the data u(t), t ∈ T D . 2. Estimate the parameter θ by minimising the deviation from the differential equation: This procedure is vaguely formulated, purposely so. Indeed, different alternatives for these steps have surfaced throughout the years. Spline smoothing The first approach was to implement the curve fitting step with splines (Varah, 1982) or kernel regression (Gugushvili & Klaassen, 2012) , whereafter the gradient matching step is posed as a non-linear least squares problem. Another variant is to couple the curve fitting step with the gradient matching step, resulting both in higher accuracy and higher computational cost (Ramsay et al., 2007) . Gaussian process regression The effort to formulate gradient matching probabilistically was spear-headed by Calderhead et al. (2009) , where Gaussian process regresion is combined with a product of experts approach. This method was improved upon by Dondelinger et al. (2013) via joint sampling for GP and ODE parameters. It was subsequently shown that a mean-field formulation can offer computational speed-ups (Gorbach et al., 2017) . In search for a generative model There has been effort put to formulating Gaussian process-based gradient matching as inference in a generative model. First by Barber & Wang (2014) , who instead formulate a model directly linking state derivatives to measurements. However, their approach suffers from identifiability problems, as demonstrated by Macdonald et al. (2015) . It was later demonstrated by Wenk et al. (2019) that identifiability issues are also present for the product of experts approach. They propose to resolve this issue by formulating an alternative model; this approach was pursued further by Wenk et al. (2020). Relation to gradient matching It might be tempting to interpret the probabilistic numerics approach as a variant of gradient matching. But gradient matching fits a curve to the data and then the differential operator to the curve, while for probabilistic numerics the order of operation is reversed: 1. Fit a curve by attempting to satisfy the differential equation at a finite set of points. 2. Fit the parameters of the differential operator by using the aforementioned curve and the data likelihood. The first step is implemented by probabilistic numerics, resulting in a physics-enhanced Gaussian process prior, whereas the second step reduces to Gauss-Markov regression. By directly incorporating the physics of the problem into the prior, it is ensured that inference is done in a wellposed probability model. Consequently, issues regarding model specification and identifiability (Macdonald et al., 2015; Wenk et al., 2019) , that have been recurring in gradient matching, are avoided. Relation to numerical integration The difference between probabilistic numerics and numerical integration for computing the likelihood comes down to the Dirac approximation δ N . As can be seen in (16), numerical integration does so by simply approximating the locations of the Dirac. On the other hand, probabilistic numerics approximates the Dirac with a distribution of non-zero width, often Gaussian in practice. This has a smoothing effect on the likelihood and parallells can be drawn with the smoothing method in non-convex optimisation (Mobahi & Ma, 2012) . But the present method is not equivalent. For example, the smoothing is not with respect to the variable of interest θ, but rather with respect to the function ϕ θ . Previous probabilistic numerics approaches The probabilistic numerics approach to approximate the marginal likelihood has been explored to some extent by Kersting et al. (2020a) . However, the present approach confers certain advantages over the former, the most notable being that the Gauss-Markov representation of the probabilistic solvers ensures all computations cost at most O(N ). A probabilistic numerics approach has also been developed for estimating time varying parameters in the context of latent force modelling (Schmidt et al., 2021) . However, for the constant parameter problem, using linearised models can cause divergence in certain situations (Ljung, 1979 ). An alternative to the inference-based methods hitherto discussed is to model the error by stochastic perturbation of numerical integrators (Chkrebtii et al., 2016; Conrad et al., 2017; Matsuda & Miyatake, 2021; Teymur et al., 2018) . This section investigates the utility and performance of Fenrir in a range of numerical experiments. It is structured as follows. Section 5.1 evaluates Fenrir on two standard benchmark problems. Section 5.2 demonstrates the utility of the proposed marginal likelihood for model selection. Section 5.3 considers systems with only partially observable states and shows that Fenrir, unlike most gradient matching methods, is still applicable. Finally, Section 5.4 investigates highly oscillatory systems which present a particular challenge for numerical integration-based methods. Implementation The implementation of the probabilistic numerical IVP solvers follows a number of practices for numerically stable implementation established by Krämer & Hennig (2020) . All experiments are implemented in the Julia programming language (Bezanson et al., 2017) . Runge-Kutta reference solutions are computed with Differ-entialEquations.jl (Rackauckas & Nie, 2017) , and numerical optimizers are provided by Optim.jl (Mogensen & Riseth, 2018) . All experiments run on a single, consumer-level CPU. Code is publicly available on GitHub. 2 This experiment evaluates Fenrir on two benchmark problems that have been extensively studied in the both the gradient matching and the numerical integration literature (Calderhead et al., 2009; Wenk et al., 2020) , namely the Lotka-Volterra predator-prey model and the FitzHugh-Nagumo neuronal model. Detailed system descriptions, along with the ground-truth parameters, initial values, and the chosen observation noise levels, are provided in Appendix C.2. We perform 100 experiments for each experimental setup, in which noisy observations are drawn from the numerically computed, true system trajectories. The inference task then consists in estimating initial values and parameters from noisy state observations. The quality of the resulting parameter estimates is evaluated using the trajectory RMSE (tRMSE) metric as defined in Definition C.1. We compare Fenrir to the probabilistic gradient matching method ODIN (Wenk et al., 2020) and to a non-linear least squares regression using a Runge-Kutta solver, referred to as RK (Bard, 1974) . ODIN results are computed using the code published by Wenk et al. (2020); RK is described in more detail in Appendix C.1. All methods optimise their re-2 Code will be published upon acceptance. spective objectives with the L-BFGS algorithm (Nocedal & Wright, 2006) . More details are provided in Appendix C.2. Results of the experiment are shown in Figure 2 . In the median, Fenrir performs on par with ODIN and RK on Lotka-Volterra, but both RK and Fenrir outperform ODIN on FitzHugh-Nagumo and achieve more accurate state estimates as well as lower trajectory RMSEs. Both RK and Fenrir suffer from outliers, but this issue appears to be less severe for Fenrir; see also Figure 9 in Appendix C.2. For a given set of noisy observations, the true parametric form of the underlying dynamical system is often not known exactly. Instead, a set of plausible models has to be evaluated against the observed data in order to find the most fitting candidate. It has been previously shown that probabilistic gradient matching can be used for model selection, by comparing estimated noise parameters which are supposed to account for model mismatch (Wenk et al., 2020) . However, as Fenrir operates on a physics-informed probability model, model selection can be accomplished by statistically rigorous methods such as likelihood ratio testing (King, 1998) . The experiment follows the setup proposed by Wenk et al. (2020) . We consider the Lotka-Volterra system as ground truth from which we numerically simulate experimental data, and generate a set of four candidate models by combining the true ODEs with two additional, incorrect equations -all equations and parameters are provided in Appendix C.3. We obtain four models, We perform 100 individual model selection experiments to evaluate Fenrir's robustness regarding the observation noise. The resulting marginal likelihoods are shown in Figure 3 . We observe that Fenrir consistently attributes the lowest negative log-likelihood to the correct model M 11 , and is thus able to accurately identify the true model. Here, we evaluate Fenrir on an epidemeological model in which the system state can only be partially observed. We consider a compartmental SEIR model that describes the fractions of a population that are susceptible (S), exposed (E), infected (I; i.e. diagnosed with a positive test), and recovered (R) over time (Hethcote, 2000) . Such compartmental models are commonly used to model the development of infectious diseases, and variants of the SEIR model have been used to explain COVID-19 outbreaks (Menda et al., 2021) . The definition of the dynamics, ground-truth initial values, and parameters are provided in Appendix C.4. At each point in time, only the infected and recovered population can be (approximately) observed, but the exposed and susceptible population is unknown. Since Fenrir's "dynamics-first" approach only requires the observation to be linearly dependent on the system states (see Equations (2) and (15c)), no particular adjustments are needed for this experiment. Similarly, the Runge-Kutta-based approach considered in Section 5.1 is also applicable and will be used for comparison. However, most gradient matching methods require all dimensions of the system states to be measurable in order to construct an interpolant, and are therefore not applicable to problems with partial observability. Fenrir performs on par with the non-probabilistic Runge-Kutta baseline (RK) and is able to infer accurate parameter estimates from only partial observations of the SEIR system. ered population, which are furthermore given only from day 30 onwards. The results of 100 experiments are shown in Figure 5 . Fenrir is able to consistently infer accurate parameter and trajectory estimates from noisy, partial observations of the dynamical system. Finally, we evaluate Fenrir on a partially observable pendulum system that exhibits fast oscillations. Problems of this form are known to be challenging for simulation-based methods such as the previously considered Runge-Kutta least-squares approach which, with poor initialization, often fail to capture the high frequencies (Benson, 1979) . While gradient-matching methods are expected to be more robust to such problems, they require fully observable states and are therefore not applicable in the present setting. Thus, we investigate Fenrir's capabilities of performing trajectory, parameter, and initial value inference under these challenges. Figure 7 . Negative log-likelihood and optimization trajectory. By first increasing its diffusion parameter, Fenrir is able to recover the true pendulum length parameter L = 1 by minimising the negative log-likelihood using L-BFGS. The likelihood (i.e. the negative exponential of the main plot) is shown in the inset figure. Figure 6 visualizes the problem setup and a single experiment; a detailed description of the dynamics and the chosen hyperparameters is provided in Appendix C.5. In the shown example, the non-linear least squares regression converges towards the constant zero function and is unable to capture the high frequencies of the data. On the other hand, by first optimizing the diffusion and observation noise parameters separately, Fenrir interpolates the experimental data and is then able to accurately approximate the true system parameters. The chosen optimization trajectory is visualised with the corresponding loss landscape in Figure 7 . Figure 8 shows inferred parameters for a wider range of starting values; for simplicity, the initial value y 0 is assumed to be known here. RK often fails to converge towards the groundtruth, whereas Fenrir is able to recover the true parameter for a wide range of starting values. Both RK and Fenrir are evaluated on a wide range of initial parameter estimates, from which they attempt to recover the true parameter L = 1 (dashed line) by optimization via L-BFGS. RK is often unable to approximate the true parameter, whereas Fenrir accurately recovers the true parameter for a wide range of starting points. It has been demonstrated that the solution of an initial value problem can be approximated by a Gauss-Markov process, reducing the inference problem to Gauss-Markov regression. The method offers advantages such as O(N ) cost for inference, operability in the face of partial observations, regularised likelihoods, and moderate improvements in terms of estimation accuracy. But, perhaps more importantly, it has been shown that probabilistic numerics is a promising method for rigorously incorporating physics in Gaussian process regression. Dass, S. C., Lee, J., Lee, K., and Park, J. Laplace based approximate posterior inference for differential equation models. Statistics and Computing, 27 (3) Gugushvili, S. and Klaassen, C. A. J. √ n-consistent parameter estimation for systems of ordinary differential equations: bypassing numerical integration via smoothing. Bernoulli, 18 (3) In this appendix, the probabilistic solver is described in detail. Further details on the prior are given in Appendix A.1. In Appendix A.2, it is explained how to compute the marginal moments and the parameters of the backward Markov representation of the posterior when the vector field is linear (affine). In Appendix A.3 some linearisation methods for approximate inference when the vector field is non-linear are reviewed. Recall that the prior in state-space form is given by where y (m) models the mth derivative of the solution. By Itô's formula this implies that and for this to be consistent with the asserted derivative relations it must hold that This in turn implies that it must hold that while E T ν A and E T ν B are free parameters. Letting e m be the mth canonical basis vector in R ν+1 , I d be the d by d identity matrix, and fixing E m = e m ⊗ I d then gives the model where A ν,m = E T ν AE m and B ν = E T ν B. Any other state-space model of dimension d(ν + 1) modelling a vector valued function of dimension d and its ν first derivatives must be related to this via similarity transform. The canonical model in probabilistic numerics is the ν-times integrated Wiener process (Schober et al., 2019; Tronarp et al., 2019b; Krämer & Hennig, 2020; Bosch et al., 2021a; Kersting et al., 2020b) , where the parameters are given by A ν,m = 0, m = 0, 1, . . . , ν − 1. Though other priors are of course possible (Magnani et al., 2017; Tronarp et al., 2021; Kersting et al., 2020b) . Usually, the diffusion matrix B ν is set to identity as well, yielding the following prior which is used throughout the article. Suppose the vector field is linear: then the probabilistic IVP solver reduces to inference in the following model: subject to the data The posterior is Gaussian because the prior is Gaussian and the measurement functionals are linear, and it can be computed with the well-known forward / backward recursions (Kalman, 1960; Rauch et al., 1965) . More specifically, denote the numerics data up to time t by and up to just before time t by The forward recursion then computes the filtering densities which agree with the prediction densities unless t ∈ T PN . The filtering moments are then post-processed in the backwards recursion to produce the smoothing densities (time marginals of the posterior) For a more thorough exposition on filtering and smoothing refer to Särkkä (2013) ; Särkkä & Solin (2019) . Furthermore, the fact that the scaling κ is retained throughout the recursion follows from the fact that the initial covariance and all transition covariances are scaled by κ (Tronarp et al., 2019a). Forward recursion The forward recursion starts by initialising the filter mean and covariance according to whereafter the algorithm alternates between prediction and update. The prediction equations are given by where G θ and P θ are parameters associated with the subsequent backward recursion. The update relations are given by Backward recursion The backwards recursion starts by setting the smoother mean and covariance to the filter mean and covariance at the terminal point according to The backwards recursion is then given by Backward Markov process representation Lastly, the posterior may be represented, on the grid, by the following backwards Markov process This follows from the fact that When the vector field is non-linear, the posterior is in most cases intractable. However, approximate posteriors may be obtained by linearising the data relation in (7). Due to the structure of the information operator, there are multiple choices for doing this, namely 1. Zeroth order linearisation (Schober et al., 2019) :L 2. First order linearisation (Tronarp et al., 2019b) : The linearisation point is typicaly chosen as the predictive mean: However, other choices are possible as well, such as the smoothing mean (Tronarp et al., 2021) which leads to the fixed-point equations for the Gauss-Newton algorithm (Bell, 1994) . Using the probabilistic numerics posterior as a surrogate for the solution of the initial value problem leads to the following inference problem This is again, a problem of Gauss-Markov regresion and can be solved by the usual forward / backward recursions. What is unusual is that the latent process is specified in terms of a terminal distribution and backward transition densities. Therefore, the equations required for implementation are given in detail. The backward recursion is implemented by a forward recursion with flipped time axis. That is, start by initialising the filter moments:μ whereafter the algorithm alternates between a backward prediction and update. If t n ∈ T D , then an update is performed according toH The prediction step is given byμ Finally, the marginal likelihood approximation is given by the prediction error decomposition (Schweppe, 1965 ) The smoothing parameters for the forward recursion are given by and the forward smoothing recursion is given by In all experiments, Fenrir uses a 5-times integrated Wiener process prior and a first-order linearisation of the vector field during the probabilistic numerical ODE solve when computing its physics-enhanced prior. Optimization Throughout all experiments, the L-BFGS method has been used for optimization with both Fenrir and RK (Nocedal & Wright, 2006) ; L-BFGS is also the optimizer of choice in the official ODIN code by Wenk et al. (2020) . The specific L-BFGS implementation is provided by the Optim.jl software package (Mogensen & Riseth, 2018) . In all experiment, the observation noise σ 2 and the diffusion κ are optimised in log-space. Parameter Initialization As done in ODIN, ODE parameters are initialised with a folded normal distribution, i.e. as the absolute value of a sample from standard normal Gaussian, and initial values are initialised with their noisy observation u(t 0 ), unless specified otherwise. Observation noise is always initialised as σ 2 = 1. Given data D = {u(t)} on the grid t ∈ T D , the considered "RK" baseline method minimizes the loss whereŷ(t) is computed with a classical Runge-Kutta initial value solver and H is the measurement matrix as introduced in Equation (2). In most experiments, the Tsit5 (Tsitouras, 2011) solver is used, with adaptive step-size selection for absolute and relative tolerances τ abs = 10 −8 , τ rel = 10 −6 . Only on the FitzHugh-Nagumo system we use the implicit RadauIIA5 (Hairer & Wanner, 1999) method, since we observed it to be more robust as some parameter settings can lead to stiff dynamics. Both solvers are provided by DifferentialEquations.jl (Rackauckas & Nie, 2017) . C.2. Additional Details on Section 5.1: "Parameter Inference from Fully Observed States" Definition C.1 (Trajectory RMSE). Letθ be the parameters estimated by an inference algorithm, and let T D be the set of measurement nodes. Then, letŷ(t), t ∈ T D , be the estimated system trajectory, computed by numerically integrating the ODE with initial values and parameters as given by the estimatedθ. The trajectory RMSE (tRMSE) is then defined as Lotka-Volterra The Lotka-Volterra model describes the dynamics of biological systems in which two species interact, one as a predator and the other as prey (Lotka, 1925; Volterra, 1928) . It is described by the ODEṡ y 1 = αy 1 − βy 1 y 2 , (52a) y 2 = −γy 1 + δy 1 y 2 . (52b) As ground truth, we assume an initial value y 0 = [5, 3] T and parameters α = 2, β = 1, γ = 4, δ = 1. The experimental data is generated on the equi-spaced time grid is computed via accurate, numerical simulation, and with noise v(t) ∼ N (0, σ 2 · I). We further consider two different noise levels σ 2 low = 0.01 and σ 2 high = 0.25. Thus, the full set of parameters to be estimated is θ = {y 0 , α, β, γ, δ, σ}, as well as the diffusion κ. In this system, we found it helpful to first optimize the noise and diffusion parameters σ, κ individually until convergence, and only then optimize all parameters jointly; such an approach is also chosen by the gradient matching method ODIN (Wenk et al., 2020) . Furthermore, as in the original experimental setup by Wenk et al. (2020), we consider bounds y 0 ∈ [0, 100] 2 , α, β, γ, δ ∈ [0, 100], σ 2 ∈ [10 −6 , 10 2 ], and additionally κ ∈ [10 −20 , 10 50 ]. Finally, a step-size of ∆ = 5 · 10 −3 is chosen for Fenrir's probabilistic numerical integration. The FitzHugh-Nagumo neuronal model (FitzHugh, 1955; Nagumo et al., 1962) is given by the ODĖ We consider ground-truth parameters a = 0.2, b = 0.2, c = 3.0, and a true initial value y 0 = [−1, 1] T . The experimental data is generated on the grid t i ∈ T D = {0.0, 0.5, . . . , 10.0}, by disturbing a high-confidence numerical simulation of the true trajectory with Gaussian noise v(t) ∼ N (0, σ 2 · I), for two noise levels σ 2 low = 0.005 and σ 2 high = 0.05. The full set of (hyper)parameters to be estimated by Fenrir is then θ = {y 0 , α, β, γ, δ, σ}, as well as the diffusion κ. All of which are jointly optimised via L-BFGS, while assuming bounds y 0 ∈ [−100, 100] 2 , a, b, c ∈ [0, 100], σ 2 ∈ [10 −6 , 10 2 ], and κ ∈ [10 −20 , 10 50 ]. Fenrir's physics-enhanced prior is computed with a step size ∆ = 10 −2 . C.3. Additional Details on Section 5.2: "Model Selection" The Lotka-Volterra model with ground-truth parameters as described in Appendix C.2 is extended to a set of four candidate models, via the following additional ODEs:ẏ 1 = αy 2 1 − βy 2 , (54a) y 2 = −γy 2 . (54b) By combining these two wrong equations with the true ODEs, we obtain for models M ij , with i, j ∈ {0, 1} indicating if the correct (1) or incorrect equation (0) has been used; for instance, M 01 contains Equation (54a) and Equation (52b). The experimental data is generated as described in Appendix C.2, with a "low" noise setting of σ 2 low = 0.01. All parameters are optimised jointly by Fenrir via L-BFGS, with bounds for parameters and initial values chosen as in Appendix C.2. The compartmental SEIR model (Hethcote, 2000) describes the fractions of a population that are susceptible (S), exposed (E), infected (I; i.e. diagnosed by a positive test), and recovered (R). It is given in as differential equationṡ S = −(β E · S · E + β I · S · I), (55a) with infection rates β E and β I , transition rate γ from exposure to infection, and recovery / death rate λ. Following Menda et al. (2021) , which used an extension of the SEIR model to explain COVID-19 outbreaks, we consider ground-truth parameters β I = 0, β E = 0.5, γ = 1/5, and λ = 1/21 (the latter two correspond to realistic estimates of transition and recovery rate in COVID-19, given by Lauer et al. (2020) ; Bi et al. (2020) ). Furthermore, we generate data on the time grid such that only the infected and recovered population is measured, and disturbed by Gaussian noise v(t) ∼ N (0, σ 2 · I) with σ 2 = 5 · 10 −4 . Instead of estimating the full initial state, we parameterize it by the initial exposed and infected population count: Thus, the parameters to be estimated by Fenrir in this experiment are θ = {E 0 , I 0 , β E , γ, λ, σ}, as well as the diffusion κ. All parameters are jointly optimised via L-BFGS, with bounds E 0 , I 0 , β E , γ, λ ∈ [0, 1], σ 2 ∈ [10 −6 , 10 2 ], and κ ∈ [10 −20 , 10 20 ]. In each experiment, ODE parameters β E , γ, λ are initialised as uniformly random; the starting values for E 0 , I 0 are initialised as absolute values of samples from a Gaussian N (0, 10 −2 ). Fenrir's probabilistic numerical integration is performed with a step size ∆ = 0.2. C.5. Additional Details on Section 5.4: "Dynamical Systems with Fast Oscillations" The considered pendulum system is given by a second-order ODEÿ = − g L sin(y), which can be transformed to the following first-order equationsẏ 1 = y 2 , (58a) with the gravity constant g = 9.81. We assume a ground-truth parameter L = 1 and an initial value y 0 = [0, π/2]. The observation data is generated as u(t i ) = 0 1 ·ŷ(t i ) + v i , with observation noise v(t i ) ∼ N (0, σ 2 ), σ 2 = 0.1, on the grid t i ∈ T D = {0.01 · i} 1000 i=0 . In the corresponding experiment, we found it to be beneficial to first optimize the noise σ and diffusion parameter κ, before jointly optimizing all model parameters θ = {y 0 , L, σ} and the diffusion κ. while assuming bounds y 0 ∈ [−100, 100] 2 , L ∈ [0, 100], σ 2 ∈ [10 −8 , 10 4 ], and κ ∈ [10 −20 , 10 50 ]. Finally, Fenrir's physics-enhanced prior is computed with a fixed step size ∆ = 0.1. A new look at the statistical model identification A comparison of approximate versus exact techniques for Bayesian parameter inference in nonlinear ordinary differential equation models Ordinary Differential Equations System identification -a survey Gaussian processes for Bayesian estimation in ordinary differential equations Nonlinear parameter estimation The iterated Kalman smoother as a Gauss-Newton method Parameter fitting in dynamic models A fresh approach to numerical computing Epidemiology and transmission of COVID-19 in 391 cases and 1286 of their close contacts in Shenzhen Nonlinear parameter estimation: a case study comparison Calibrated adaptive probabilistic ODE solvers Pick-and-mix information operators for probabilistic ODE solvers Accelerating Bayesian inference over nonlinear differential equations with Gaussian processes Robust estimation for ordinary differential equation models Bayesian solution uncertainty quantification for differential equations Bayesian probabilistic numerical methods Statistical analysis of differential equations: introducing probability measures on numerical solutions Solving Ordinary Differential Equations I: Nonstiff Problems Probabilistic numerics and uncertainty in computations The mathematics of infectious diseases A new approach to linear filtering and prediction problems Differentiable likelihoods for fast inversion of 'likelihood-free' dynamical systems Convergence rates of Gaussian ODE filters Unifying political methodology: The likelihood theory of statistical inference Stable implementation of probabilistic ODE solvers Probabilistic ODE solutions in millions of dimensions The incubation period of coronavirus disease 2019 (COVID-19) from publicly reported confirmed cases: estimation and application Asymptotic behavior of the extended Kalman filter as a parameter estimator for linear systems Controversy in mechanistic modelling with Gaussian processes. Proceedings of Bayesian Filtering for ODEs with Bounded Derivatives Estimation of ordinary differential equation models with discretization error quantification Explaining COVID-19 outbreaks with reactive SEIRD models Gaussian smoothing and asymptotic convexity Optim: A mathematical optimization package for Julia An active pulse transmission line simulating nerve axon Numerical Optimization A modern retrospective on probabilistic numerics Stochastic Differential Equations -An Introduction with Applications jl-a performant and feature-rich ecosystem for solving differential equations in julia Parameter estimation for differential equations: a generalized smoothing approach Maximum likelihood estimates of linear dynamic system Bayesian Filtering and Smoothing Applied Stochastic Differential Equations A probabilistic state space model for joint inference from differential equations and data