key: cord-0109295-ps0tlr6e authors: Schmidt, Jonathan; Kramer, Nicholas; Hennig, Philipp title: A Probabilistic State Space Model for Joint Inference from Differential Equations and Data date: 2021-03-18 journal: nan DOI: nan sha: ac95aa381f112336b8aeaeafaeb43200b31eeaa3 doc_id: 109295 cord_uid: ps0tlr6e Mechanistic models with differential equations are a key component of scientific applications of machine learning. Inference in such models is usually computationally demanding, because it involves repeatedly solving the differential equation. The main problem here is that the numerical solver is hard to combine with standard inference techniques. Recent work in probabilistic numerics has developed a new class of solvers for ordinary differential equations (ODEs) that phrase the solution process directly in terms of Bayesian filtering. We here show that this allows such methods to be combined very directly, with conceptual and numerical ease, with latent force models in the ODE itself. It then becomes possible to perform approximate Bayesian inference on the latent force as well as the ODE solution in a single, linear complexity pass of an extended Kalman filter / smoother - that is, at the cost of computing a single ODE solution. We demonstrate the expressiveness and performance of the algorithm by training, among others, a non-parametric SIRD model on data from the COVID-19 outbreak. Only data Only ODE Both (ours) Data ; Truth Mechanistic models based on ordinary differential equations (ODEs) are popular across a wide range of scientific disciplines. To increase the descriptive power of such models, it is common to consider parametrized versions of ODEs and find a set of parameters such that the dynamics reproduce empirical observations as accurately as possible. Algorithms for this purpose typically involve repeated forward simulations in the context of, e.g., Markov-chain Monte Carlo or optimization. The need for iterated computation of ODE solutions may demand simplifications in the model to meet limits in the computational budget. This work describes an algorithm that merges mechanistic knowledge in the form of an ODE with a non-parametric model over the parameters controlling the ODE -a latent force that represents quantities of interest. The algorithm then infers a trajectory that is informed by the observations but also follows sensible dynamics, as defined by the ODE, in the absence of observations ( Figure 1 ). The main insight enabling this approach is that if probabilistic ODE solvers use the language of (extended) Kalman filters, conditioning on observations and solving the ODE itself is possible in one and the same process of Bayesian filtering and smoothing. Instead of iterated computation of ODE solutions, a posterior distribution arises from a single forward simulation, which has complexity equivalent to numerically computing an ODE solution, once, with a filtering-based, probabilistic ODE solver [38] . Intuitively, one can think of this as opening up the black box ODE solver and acknowledging that each task -solving the ODE and discovering a latent force -is probabilistic inference in a state-space model. The main contribution of this work is formalizing this intuition. Several experiments empirically prove the efficiency and the expressivity of the resulting algorithm. In particular, a practical model for the dynamics of the COVID-19 pandemic is considered, in which a non-parametric latent force captures the effect of policy measures that continuously change the contact rate among the population. Let x : [t 0 , t max ] → R d be a process that is observed at a discrete set of points T OBS N := (t OBS 0 , ..., t OBS N ) through a sequence of measurements y 0:N := (y 0 , ..., y N ) ∈ R (N +1)×k . Assume that these measurements are subject to additive i.i.d. Gaussian noise, according to the observation model for n = 0, ..., N and matrices H ∈ R k×d and R ∈ R k×k . Further suppose that x(t) solves the ODĖ and satisfies the initial condition x(t 0 ) = x 0 ∈ R d . The vector field f : R d × R → R d is assumed to be autonomous, which is no loss of generality (e.g. [21] ) but simplifies the notation. The latent force u : [t 0 , t max ] → R parametrizes f and shall be unknown. SIR-type models (e.g. [9] ) are the canonical choice of mechanistic model class to describe the evolution of the COVID-19 pandemic. In SIR-type models, a population partitions into a discrete set of compartments. The differential equation then describes the transition of counts of individuals between these compartments. For example, the SIRD model [13] formulates the transitions between susceptible, infectious, recovered, and deceased people aṡ governed by contact rate β(t) : [t 0 , t max ] → [0, 1], recovery rate γ ∈ [0, 1], and mortality rate η ∈ [0, 1] ( Figure 2 ). S, I, R, and D evolve over time, but the total population P (as the sum of the compartments) is assumed to remain constant. In this context, the contact rate β(t) is the latent force and varies over time (in the notation from Eq. (2), β is u). A time-varying contact rate provides a model for the impact of governmental measures on the dynamics of the pandemic. For simplicity, we will assume that γ and η are fixed and known. In this SIRD setting, the goal is to infer an (approximate) joint posterior over β(t) and the dynamics of S(t), I(t), R(t), and D(t) as well as to use the reconstructed dynamics to extrapolate into the future. Section 3 explains the conceptual details, Section 4 distinguishes the method from related work, and Section 5 evaluates the performance. This section explains how to infer the unknown process u(t) and the ODE solution x(t) in a single forward solve. Section 3.1 defines the prior model, Section 3.2 describes the probabilistic numerical ODE inference setup, and Section 3.3 describes approximate Gaussian filtering and smoothing in this context. Section 3.4 summarizes the resulting algorithm. The exposition of classic concepts here is necessarily compact. In-depth introductions can be found, e.g., in the book by Särkkä and Solin [33] . Let ν ∈ N. Define two independent Gauss-Markov processes U : [t 0 , t max ] → R and X : [t 0 , t max ] → R d(ν+1) that solve the linear, time-invariant stochastic differential equations [28] , with drift matrices F U ∈ R × and F X ∈ R d(ν+1)×d(ν+1) , as well as dispersion matrices L U ∈ R ×s and L X ∈ R d(ν+1)×d . W U : [t 0 , t max ] → R s and W X : [t 0 , t max ] → R d are Wiener processes. U and X satisfy the Gaussian initial conditions, defined by m U ∈ R , P U ∈ R × , m X ∈ R d(ν+1) , and P U ∈ R d(ν+1)×d(ν+1) . U(t) models the unknown function u(t) and can be any Gauss-Markov process that admits a representation as the solution of a linear SDE with Gaussian initial conditions. In other words, the first element in X(t) is an estimate for x(t), the second element is an estimate for d dt x(t), et cetera. Encoding that the state X consists of a model for x(t) as well as its first ν derivatives imposes structure on F X and L X (see e.g. [20] ). Examples include the Matérn, integrated Ornstein-Uhlenbeck, and integrated Wiener processes; the canonical choice for probabilistic ODE solvers would be integrated Wiener processes [34, 38, 5, 21] . The class of Gauss-Markov priors inherits its wide generalizability from Gaussian process models; recall that Gauss-Markov processes like U and X are Gaussian processes with the Markov property. While not every Gaussian process with one-dimensional input space is Markovian, a large number of descriptions of Gauss-Markov processes emerge by translating a covariance function into an (approximate) SDE representation [33, Chapter 12] . For example, this applies to (quasi-)periodic, squared-exponential, or rational quadratic kernels; in particular, sums and products of Gauss-Markov processes admit a state-space representation [35, 33] . Recent research has considered approximate SDE representations of general Gaussian processes in one dimension [23] . With these tools, prior knowledge over U or X can be encoded straightforwardly into the model. A functional relationship between the processes U(t), X(t) and the data y 0:N emerges by combining two likelihood functions: one for the observations y 0:N (recall Equation (1)), and one for the ordinary differential equation. The present section formalizes both. Let T = T OBS N ∪ T ODE M be the union of the observation-grid T OBS N , which has been introduced in Section 2, and an ODE-grid T ODE M := (t ODE 0 , ..., t ODE M ) . The name "ODE-grid" expresses that this grid contains the locations on which the ODE information will enter the inference problem, as described below. T OBS N contains the locations of y 0:N , in light of which the first of two observation models is for n = 0, . . . , N . This is a reformulation of the relationship between process x and observations y 0:N in Eq. (1) in terms of X (instead of x, which is modeled by X (0) ). Including this first measurement model ensures that the inferred solution remains close to the data points. T ODE M contains the locations on which U(t) connects to X(t) through the ODE. Specifically, the set of random variables Z 0:M ∈ R (M +1)×d , defined as where δ is the Dirac delta, describes the discrepancy between the current estimate of the derivative of the ODE solution (i.e. X (1) ) and its desired value (i.e. f (X (0) ; U)), as prescribed by the vector field f . If the random variables Z 0:M realize small values everywhere, X (0) solves the ODE as parametrized by U . This motivates introducing artificial data points z 0:M ∈ R (M +1)×d that are equal to zero, z m = 0 ∈ R d , m = 0, ..., M . Conditioning the stochastic processes X and U on attaining this (artificial) zero data ensures that the inferred solution follows ODE dynamics throughout the domain. Figure 3 shows the discretized state-space model. Figure 3 : Instance of the described state-space model, visualized as a directed graphical model. Shaded variables are observed. Either only data, only mechanistic knowledge, or both sources of information can be conditioned on during inference (recall Figure 1 ). Both X and U enter the likelihood in Eq. (7) through a possibly non-linear vector field f . Therefore, the posterior distribution (recall z 0: is intractable, but can be approximated efficiently. Even though the problem is discretized, the posterior distribution is continuous [33, Chapter 10] . There are mainly two approaches to computing a tractable approximation of the intractable posterior distribution in Eq. (8): approximate Gaussian filtering and smoothing [32] , which computes a cheap, Gaussian approximation of this posterior, and sequential Monte Carlo methods [27] , whose approximate posterior may be more descriptive, but also more expensive to compute. Like the literature on probabilistic ODE solvers [38, 5] , this work uses approximate Gaussian filtering and smoothing techniques for their low computational complexity. The continuous-discrete state-space model inherits its non-linearity from the ODE vector field f . Linearizing f with a first-order Taylor series expansion creates a tractable inference problem; more specifically, it gives rise to the extended Kalman filter (EKF) [16, 25] . Loosely speaking, if the random variable Z is large in magnitude, then X and U are poor estimates for the ODE and its parameter. An EKF update, based on the first-order linearization of f , approximately corrects this misalignment. If sufficiently many ODE measurements z 0:M are available, a sequence of such updates preserves sensible ODE dynamics over time. An alternative to a Taylor-series linearization is the unscented transform, which yields the unscented Kalman filter [40, 17] . The computational complexity of both algorithms is linear in the number of grid points and cubic in the dimension of the state-space. Detailed implementation schemes can be found, for instance, in the book by Särkkä [32] . The EKF approximates the filtering distribution It describes the current state of the system given all the previous measurements and allows updates in an online fashion as soon as new observations emerge. If desired, the Rauch-Tung-Striebel smoother turns the filtering distribution into an approximation of the full (smoothing) posterior (in Eq. (8)). In doing so, all observations -that is, measurements according to both Eq. (6) and Eq. (7) -are taken into account for the posterior distribution at each location t. As special cases, this setup recovers: (i) a Kalman filter/Rauch-Tung-Striebel smoother [18] if the ODE likelihood (Eq. (7)) is omitted; (ii) an ODE solver [38] , if the data likelihood (Eq. (6)) is omitted. In the present setting, however, both likelihoods play an important role. The procedure is summarized in Algorithm 1. The prediction step is determined by the prior and is available in closed-form (Appendix A.2). At times at which data is observed according to the linear Gaussian measurement model in Eq. (6), the update step follows the rules of the standard Kalman filter. Before updating on pseudo-observations according to the ODE likelihood (Eq. (7)), the non-linear measurement model is linearized at the predicted mean. More details are provided in Algorithm 1 Compute the filtering distribution by conditioning on both y 0:N and z 0:M . M then linearize measurement model and update X j and U j on z j end if [Eq. (7)] end for the supplementary material. The filtering distribution can be turned into a smoothing posterior by running a backwards-pass with a Rauch-Tung-Striebel smoother (e.g. [32] ). The computational cost of obtaining either, the filtering or the smoothing posterior, are both linear in the number of grid points and cubic in the dimension of the state- Only a single forward-backward pass is required. If desired, the approximate Gaussian posterior can be refined iteratively by means of posterior linearization and iterated Gaussian filtering and smoothing, which yields the maximum-a-posteriori (MAP) estimate [3, 37] . The experiments presented in Section 5 show how a single forward-backward pass already approximates the MAP estimate accurately. Latent forces and ODE solvers: The explained method closely relates to probabilistic ODE solvers and latent force models [42] , especially the kind of latent force model that exploits the state-space formulation of the prior [12] . The difference is that, in the spirit of probabilistic numerical algorithms, the mechanistic knowledge in the form of an ODE is injected through the likelihood function instead of the prior. A similar approach of linking derivative observations to mechanistic constraints has previously been used in gradient matching [6, 41] . Probabilistic ODE solvers have been used by Kersting et al. [19] for efficient ODE inverse problem algorithms, but their approach is different to the present algorithm, in which the need for iterated optimization or sampling is avoided altogether. Monte Carlo methods: (Markov-chain) Monte Carlo methods are also able to infer a time-dependent ODE latent force from a set of state observations. Options that are compatible with a setup similar to the present work would include sequential Monte Carlo techniques [27] , elliptical slice sampling [26] , or Hamiltonian Monte Carlo [4] (for instance realized as the No-U-Turn sampler [15] ). The shared disadvantage of Monte Carlo methods applied to the resulting ODE inverse problem is that the complexity of obtaining a single Monte Carlo sample is of the same order of magnitude as computing the full Gaussian approximation of the posterior distribution. In Appendix B we show results from a parametric version of the SIRD-latent force model (using the No-U-Turn sampler as provided by NumPyro [29] ). This sampler requires thousands of numerical ODE solutions, compared to the single solve of our method. In other words, the algorithm in the present work poses an efficient yet expressive alternative to Monte Carlo methods for approximate inference with dynamical systems. This section describes three blocks of experiments. 1 All experiments use a conventional, consumerlevel CPU. First, a range of artificial datasets is generated by sampling ODE parameters from a prior state-space model and simulating a solution of the corresponding ODE. Inference in such a controlled environment allows comparing to the ground truth, thereby assessing the quality of the approximate inference. We consider three ODE models to this end. Second, a COVID-19 dataset will probe the predictive performance of the probabilistic model and the resulting approximate posterior distribution. Third, some changes to the model from the COVID-19 experiments, for instance, ensuring that the number of case counts must be positive, will improve the interpretability (for example, of the credible intervals). Controlling the range of values that the prior state-space can realize introduces additional non-linearity into the model -which can also be locally approximated by the EKF -and makes the solution more physically meaningful. As a first test for the capabilities of the proposed method, we consider three simulated environments. To this end, the training data is generated as follows. The starting point is always an initial value problem with dynamics defined by a vector field f and a Gauss-Markov prior over the dynamics x and the unknown parameters u of the vector field. Then, (i) we sample the time-varying parameter trajectories from the Gauss-Markov prior; (ii) we solve the ODE, as parametrized by the sampled trajectories from (i), using LSODA [14] with adaptive step sizes using SciPy [39] ; (iii) we subsample the ground-truth solution on a uniform grid (which will become T OBS N ) to generate artificial state observations y 0:N ; (iv) we add Gaussian i.i.d. noise to the observations. The procedure described above generates both a ground truth to compare to and a noisy, artificially observed data set. Given such a set of observations, Algorithm 1 computes a posterior distribution over the true trajectories under appropriate model assumptions. In this posterior, we look for the proximity of the mean estimate to the underlying ground truth; the closer, the better. We measure this proximity in the root-mean-square error. Furthermore, the width of the posterior (expressed by the posterior covariance) should deliver an appropriate quantification of the mismatch. We report the χ 2 -statistic [2] , which suggests that the posterior distribution is well-calibrated if the χ 2 -statistic is close to the dimension d of the ground truth. Three mechanistic models serve as examples. Van-der-Pol: The first of three test problems is the van-der-Pol oscillator [11] . It has one parameter µ (sometimes referred to as a stiffness constant, because for large µ, the van-der-Pol system is stiff). As a prior over the dynamics we choose a twice-integrated Wiener process with diffusion intensity σ 2 X = 300. The stiffness parameter µ is modeled as a Matérn-3 /2 process with lengthscale U = 10 and diffusion intensity σ 2 U = 0.3. The posterior is computed on a grid from t 0 = 0 to t max = 25 units of time with step size ∆t = 0.025. Lotka-Volterra: The Lotka-Volterra equations [24] describe the change in the size of two populations, predators and prey. There are four parameters, which we call a, b, c, and d, which describe the interaction and death/reproduction rates of the populations. As a prior over the dynamics we choose a twice-integrated Wiener process with diffusion intensity σ 2 X = 10. The four parameters are modeled as Matérn-3 /2 processes with lengthscales Ua = U b = Uc = U d = 40. The diffusion intensities are σ 2 Ua = σ 2 Uc = 0.01 and σ 2 U b = σ 2 U d = 0.001. The posterior is computed on a grid from t 0 = 0 to t max = 60 units of time with step size ∆t = 0.1. As detailed in Section 2, the SIRD model is governed by a contact rate β(t). Recall that we assume a time-dependent β to account for governmental measures in reaction to the spread of COVID-19. The recovery rate γ and fatality rate η are fixed at γ = 0.06 and η = 0.002, like they will be in the experiments with real data in Sections 5.2 and 5.3 below. As a prior over the dynamics we choose a twice-integrated Wiener process with diffusion intensity σ 2 X = 50. The contact rate β is modeled as a Matérn-3 /2 process with lengthscale U = 14 and diffusion intensity σ 2 U = 0.1. The posterior is computed on a grid from t 0 = 0 to t max = 100 units of time with step size ∆t = 0.1. The model allows for straightforward restriction of parameter values by using link functions. The natural support for the SIRD-contact rate is the interval [0, 1], but U(t), as a Gauss-Markov process, takes values on the real line. A change in the basis of β(t) with a logistic sigmoid function ϑ before it enters the likelihood fixes this misspecification. Similarly, the Lotka-Volterra parameters are inferred in log-space to ensure positivity. It is an appealing aspect of the EKF that these non-linear transformations do not require significant adaptation of the algorithm. Instead, the EKF treats it as merely another level of linearization of Eq. (7). Section 5.3 extends this to the state dynamics. The results are shown in Figure 4 . On all test problems, the algorithm recovers the true states and the true latent force accurately. The recovery is not exact, which shows how the Gaussian posterior is only an approximation of the true posterior. The χ 2 -statistic for the van-der-Pol stiffness parameter µ is 1.11, which lies in (0.0039, 3.8415), the 90% confidence interval of the χ 2 distribution with 1 degree of freedom. The root-mean-square error (RMSE) to the truth is 0.14. The χ 2 -statistic for the Lotka-Volterra parameters is 8.06, which lies in (0.7107, 9.4877), the 90% confidence interval of the χ 2 distribution with 4 degrees of freedom. The RMSE to the truth is 0.04 in log space and 0.018 in linear space. The χ 2 -statistic for the contact rate β is 0.91, which lies in (0.0039, 3.8415), the 90% confidence interval of the χ 2 distribution with 1 degree of freedom. The RMSE to the truth is 0.2 in logit space and 0.033 in linear space. We continue with the SIRD model introduced in Eq. (3), now using data collected in Germany over the period from January 22, 2020, to May 27, 2021. Throughout the pandemic, the German government has imposed mitigation measures of varying severity. Together with seasonal effects, summer vacations, etc., they caused a continual change in the contact rate. The next experiments aim to recover said contact rate (and the SIRD counts) from the German dataset. The Center for Systems Science and Engineering at the Johns Hopkins University publishes daily counts of confirmed (y confirmed n ), recovered (y recovered n ), and deceased (y deceased n ) individuals [7] . One can transform this data to suit the SIRD model The counts I n , R n , and D n are available for each day, starting with January 22, 2020. Assuming a constant population over time, the numbers of susceptible individuals S n are always evident from the other quantities, thus left out of the visualizations. We fix the population at P = 83 783 945, based on public record. We rescale the data to cases per one thousand people (CPT). As a prior over X(t), due to its popularity in constructing probabilistic ODE solvers [38] , we assume a twice-integrated Wiener process. β(t) is modelled as a Matérn-3 /2 process with length scale q = 75 and diffusion intensity σ 2 q = 0.05. The state-space model is straightforwardly extendable to sums and products of (more) processes [35, 33] . As described in Section 5.1, the contact rate is inferred in logit space. We shift the logistic sigmoid function such that it fulfills ϑ(0) = 0.1 in which case the stationary mean U = 0 translates to a stationary mean ϑ(U) = β = 0.1 of the Matérn process that models the contact rate. The recovery rate and mortality rate are considered known and fixed at γ = 0.06 and η = 0.002 to isolate the effect of the inference procedure on recovering the evolution of the contact rate U(t) = β(t). We set the mean of the Gaussian initial conditions to the first data point that is available. The diffusion intensity of the prior process X(t) is set to σ 2 X = 10. The latent process U and all derivatives are Figure 5 : Estimated counts of infectious cases and contact rate based on real COVID-19 data. The case counts of infectious people are scaled to cases per thousand (cpt). The uncertainty over the contact rate increases when the case counts are low. After a single forward solve, the inferred mean is already close to the MAP estimate. The shaded areas show the 95 % credible interval. Hard lockdown, stringent contact restrictions 6 First nationwide decree of restrictions, increased intensification of measures initialized at zero. Note that due to the logistic sigmoid transform, an initial value U 0 = 0 amounts to an initial contact rate β 0 = 0.1. In the present scenario, we cannot take the SIRD model as an accurate description of the underlying data but merely as a tool that aids the inference engine in recovering physically meaningful states and forces. In order to account for this model mismatch, the Dirac likelihood from Eq. (7) is relaxed towards a Gaussian likelihood with measurement noise λ 2 = 0.01. This equals the data observation noise and thus balances the respective impact of either (misspecified) source of information. Intuitively, adding ODE measurement noise reduces how strictly the vector field dynamics are enforced during inference and therefore avoids overconfident estimates of β(t). The mesh-size of the ODE is ∆t = 1 /24 days, i.e. ODE updates are computed on an hourly basis. The final 14 observations are excluded from the training set to serve as validation data for evaluating the extrapolation behavior of the proposed method. Figure 5 shows the results. The mean of the state X estimates the case counts accurately in both interpolation and extrapolation tasks. The estimated contact rate rapidly decreases around late March, remains low until fall, increases momentarily, and is dampened again soon after. This aligns with a set of political measures imposed by the government (compare Figure 5 to Table 1 ). The uncertainty over the estimated contact rate is high in the early beginning when the case counts are still low. It then increases again in summer and with the beginning of the extrapolation phase. If the experiment is taken as-is, the credibility intervals of the posterior over X(t) include negative numbers (mostly where the case counts are low and the uncertainty high, and when extrapolating). Of course, in a system that models counts of people in different stages of a disease, negative numbers should be excluded altogether. The proposed method provides straightforward means to address this issue. Section 5.3 explains the details. The following experiment evaluates how the proposed method performs in combination with a statespace model that constrains the support of the dynamics. Concretely, let X(t) model the logarithm of the SIRD dynamics and the respective derivatives. With a slight abuse of notation, we will continue writing "X" even though it lives in a different space than in the previous sections. The structure of the dynamic model is the same. The diffusion intensity of the prior process X(t) is σ 2 X = 0.05. The diffusion is not comparable to the value in the previous section because the state dynamics moved to log-space. Using d dt exp(x(t)) = exp(x(t))ẋ(t), the ODE likelihood becomes with auxiliary quantities (recall the logistic sigmoid ϑ) The exponential function introduces an additional non-linearity into the state-space model, which necessitates smaller step-sizes for the ODE measurements (see below). The observed case count data y 0:N is transformed into the log-space, too, in which we assume additive, i.i.d. Gaussian noise. On the one hand, transforming the measurements into log-space implies that the measurement model for the counts remains linear; on the other hand, it imposes a log-normal noise model (if viewed back in "linear space"). Log-normal noise underlines how the estimated states cannot be negative. Again, we scale the counts to cases per thousand. As depicted in Figure 6 , the reconstruction of the driving processes in this setting yields results that at first glance, look similar to the previous experiment. The states match the data points well. However, the extrapolation is more realistic in that the credible intervals encode that negative values are impossible (which is due to the log-transform). The mean of the recovered contact rate closely resembles the estimate of the previous experiment. Again, upon implementation of strict governmental measures, the uncertainty decreases, whereas in the context of relaxations, the uncertainty is high. By coupling mechanistic and data-driven inference so directly, the algorithm builds on the core premise of probabilistic numerics -that computation itself is a data source that does not differ, formally, from observational data. Information from observations and mechanistic knowledge (in the form of an ODE) can thus be described in the same language of Bayesian filtering and smoothing. This removes the need for an outer loop over multiple forward solves and thus drastically reduces the computational cost. Our experimental evaluation corroborates that the resulting approximate posterior is close to the ground truth and drastically reduces computational cost over Monte Carlo alternatives. It faithfully captures multiple sources of uncertainty from the data, numerical (discretization) error, and epistemic uncertainty about the mechanism. We hope this framework helps empower practitioners, not just by reducing computational burden but also by providing a more flexible modelling platform. This section provides detailed information about the state-space model and approximate Gaussian inference therein. Appendix A.1 defines the augmented state-space model that formalizes the dynamics of the Gauss-Markov processes introduced in Section 3.1. Appendix A.2 provides the equations for prediction and update steps of the extended Kalman filter in such a setup, which is described in Section 3.4 (in particular, Algorithm 1). Section 3 describes the joint inference of both a latent process u(t) : [t 0 , t max ] → R l that parametrizes an ODE and x(t) : [t 0 , t max ] → R d , the solution of said ODE. The dynamics of the processes are modeled by the stochastic differential equation with Gaussian initial conditions The block-diagonal structure is due to the independent dynamics of the prior processes. The drift matrices F U and F X , as well as the dispersion matrices L U and L X depend on the choice of the respective processes U and X. The measurement models are given in Eq. (6) (for observed data) and in Eq. (7) (for ODE measurements). In the experiments presented in Sections 5.2 and 5.3 we model the latent contact rate β(t) as a Matérn-3 /2 process with characteristic length scale q . Hence, The SIRD counts are modeled as the twice-integrated Wiener process such that X = X (0) , X (1) , X (2) models the SIRD counts and the first two derivatives. Notice that F X ∈ R d(ν+1)×d(ν+1) and L X ∈ R d(ν+1)×d are block matrices. I d denotes the d × d identity matrix. In the context of the experiments, d = 4 (S, I, R, and D) and ν = 2 (twice-integrated Wiener process). More details on the use of integrated Wiener processes in probabilistic ODE solvers can be found in, for instance, the work by Kersting et al. [20] . This section is concerned with the exact steps that make up the algorithm summarized in Section 3.4. The stochastic differential equation defined in Eq. (A.1) formalizes the dynamics of the processes U(t) and X(t) that model u(t) and x(t), respectively. Define ∆t := t j − t j−1 > 0 for all t j = t 1 , ..., t max . The transition densities of U and X are [10] U(t + ∆t) | U(t) ∼ N (Φ U (∆t)U(t), Q U (∆t)), (A.5a) X(t + ∆t) | X(t) ∼ N (Φ X (∆t)X(t), Q X (∆t)), where transition matrices Φ U (∆t) ∈ R × and Φ X (∆t) ∈ R d(ν+1)×d(ν+1) , as well as the process noise covariances Q U (∆t) ∈ R × and Q X (∆t) ∈ R d(ν+1)×d(ν+1) are available in closed form and can be computed, for instance, with matrix fraction decomposition [36, 1] . Define the transition matrix and process noise covariance of the process in Eq. (A.1) as for time points t j ∈ T = T OBS ∪ T ODE . The predicted mean and covariance m − j and P − j are for given initial conditions m 0 , P 0 . The prediction step is the same, for both t j ∈ T OBS and t j ∈ T ODE . As detailed in Section 3 , two different update steps are defined for two kinds of observations. When observing data y 0:N , i.e. t n ∈ T OBS , the update step follows the rules of a standard Kalman filter. The updated mean m n and covariance P n at time t n are computed as v n = y n − Hm − n , (A.10) The matrices H and R are defined as in Eq. (6) in the paper. Recall the ODE measurement model from Eq. (7), which we here denote as h, as At locations t m ∈ T ODE , we condition on the ODE measurements z 0:M . Recall that these pseudoobservations are all zero. According to Eq. (10.79) in the book by Särkkä and Solin [33] , where [Dh(m − m )] denotes the Jacobian of h at m − m . In the case of a Dirac likelihood (see Eq. (7)), λ 2 = 0 holds. For numerical stability (especially for λ 2 = 0) one can instead implement square-root filtering (see, e.g., [10, 21] ). All experiments in Section 5 use square-root filtering. This section first introduces a functional form for β(t) that connects to the non-parametric model introduced in Section 3. Then, a generative model for Markov-chain Monte Carlo (MCMC) inference over the unknown parameters of β(t) is set up. We establish a parametric model for the latent, time-varying contact rate in an SIRD model in terms of Fourier features. In light of Mercer's theorem and the fact that stationary covariance functions have complex-exponential eigenfunctions [31, Chapter 4.3] , this closely connects to the Matérn-3 /2 process used in Sections 5.2 and 5.3 (see also [30] ). Concretely, we proceed as follows. Let T denote a dense time grid. First, (i) compute the kernel Gram matrix K on T, such that (K) ij = k(x i , x j ) with x i , x j ∈ T. k is the Matérn-3 /2 covariance function. Figure 7 : Estimated counts of infectious cases and contact rate. The estimates are obtained from MCMC sampling in an SIRD model with a parametric function for the contact rate β(t). The case counts of infectious people are scaled to cases per thousand (cpt). The shaded areas show the 95 % credible interval. Compared to the non-parametric approach presented in the paper, the estimate over β(t) is very confident in general. The posterior mean closely resembles the results obtained in Sections 5.2 and 5.3. The numbered markers in the right plot are explained in Table 1 in the paper. As in the experiments before, we set the characteristic lengthscale to = 75. Then, (ii) compute the eigendecomposition of K. In order to keep the dimensionality of the inference problem feasible, select r |T| eigenvectors that correspond to the r largest eigenvalues of K. In this experiment, we choose r = 25. (iii) For each eigenvector, the strongest frequency component ω is determined by the discrete Fourier decomposition. This yields a set of frequencies {ω i : i = 1, . . . , r}. Finally, the parametric model is defined as the sum of parametrized Fourier features of the form where ϑ is the logistic sigmoid function as described in Section 5. We aim to compute a posterior contact rate β(t) by MCMC inference over the coefficients a i and b i , i = 1, . . . , r. To this end, we define a prior over the parameter vector θ := (a 1 , b 1 , . . . , a r , b r ) and a likelihood for the COVID-19 case counts y 0:N with respect to θ. In order to ensure non-negative case counts, as in Section 5.3, we assume log-normally distributed measurements with i.i.d. noise LogNormal y n ; log x (θ) (t n ) , σ 2 I 2r , where σ 2 is inferred from the data along with θ. x (θ) (t n ) denotes the solution of the SIRD system at time t n , parametrized by the vector of coefficients θ through the contact rate from Eq. (B.1). Notably, each evaluation of the likelihood involves numerically integrating the SIRD system, which significantly increases the computational cost entailed by the inference algorithm. This is done by NumPyro's DOPRI-5 implementation [29, 8] . The prior distributions over the Fourier-feature coefficients and over σ 2 are chosen as p(θ) = N (θ; µ θ , Σ θ ) , p(σ 2 ) = HalfCauchy(σ 2 ; 0.01). The mean µ θ of the prior over θ is set to a maximum-likelihood estimate by minimizing the negative logarithm of Eq. (B.2) with SciPy's L-BFGS optimization algorithm [39, 22] . The covariance is chosen as Σ θ = 0.1 · I 2r . The goal of the experiment is to compute a posterior over the coefficients θ (and the measurement covariance σ 2 ) that is comparable to the results obtained in Sections 5.2 and 5.3. Like before, recovery rate and fatality rate are assumed fixed and known at γ = 0.06 and η = 0.002. We compute the posterior p(θ | y 0:N ) using NumPyro's implementation of the No-U-Turn sampler [15] . Discrete-time solutions to the continuous-time differential Lyapunov equation with applications to Kalman filtering Estimation With Applications to Tracking and Navigation: Theory Algorithms and Software The iterated Kalman smoother as a Gauss-Newton method A conceptual introduction to Hamiltonian Monte Carlo Calibrated adaptive probabilistic ODE solvers Accelerating Bayesian inference over nonlinear differential equations with Gaussian processes An interactive web-based dashboard to track COVID-19 in real time A family of embedded Runge-Kutta formulae Modelling the COVID-19 epidemic and implementation of population-wide interventions in Italy Kalman Filtering: Theory and Practice Using MATLAB Dynamics of the van der Pol equation State-space inference for non-linear latent force models with application to satellite orbit prediction The mathematics of infectious diseases LSODA, ordinary differential equation solver for stiff or non-stiff system The No-U-Turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo Stochastic Processes and Filtering Theory Unscented filtering and nonlinear estimation A new approach to linear filtering and prediction problems Differentiable likelihoods for fast inversion of 'likelihood-free' dynamical systems Convergence rates of Gaussian ODE filters Stable implementation of probabilistic ODE solvers On the limited memory BFGS method for large scale optimization General linear-time inference for Gaussian processes on one dimension The growth of mixed populations: two species competing for a common food supply Stochastic Models, Estimation, and Control Elliptical slice sampling Elements of sequential Monte Carlo. Foundations and Trends® in Machine Learning Stochastic Differential Equations Composable effects for flexible and accelerated probabilistic programming in NumPyro Random features for large-scale kernel machines Gaussian Processes for Machine Learning Bayesian Filtering and Smoothing Applied Stochastic Differential Equations A probabilistic model for the numerical solution of initial value problems Explicit link between periodic covariance functions and state space models Optimal Control and Estimation Iterative filtering and smoothing in nonlinear and non-Gaussian systems using conditional moments Probabilistic solutions to ordinary differential equations as nonlinear Bayesian filtering: a new perspective SciPy 1.0: fundamental algorithms for scientific computing in Python The unscented Kalman filter for nonlinear estimation ODIN: ODEinformed regression for parameter and state inference in time-continuous dynamical systems provides the sources used to list the governmental measures in Table 1. In order to provide reliable sources, we refer to the official press releases, as published by the German government. For each policy change, we provide a very brief idea of the imposed measures and official sources by the German government Citizens are urged to restrict social contacts as much as possible and the formation of groups is sanctioned in public spaces as well as at home The government puts the federal states in charge of appropriately relaxing the imposed measures. Different states handle the situation differently, according to the respective incidences The population is again urged to restrict contacts if possible One week later, new light restrictions are imposed. The number of people allowed in social gatherings is limited, according to local incidences Across the country, the number of people allowed in social gatherings is limited to ten, where the number of households present must not exceed two. Most of public services are closed or offered only virtually Across the country, the number of people allowed in social gatherings is limited to five, where the number of households present must not exceed two. Except for stores of systemic importance, the retail sector is mostly shut down the German government decides on a nationwide decree of measures to come into effect on the following day The authors gratefully acknowledge financial support by the European Research Council through ERC StG Action 757275 / PANAMA; the DFG Cluster of Excellence "Machine Learning -New Perspectives for Science", EXC 2064/1, project number 390727645; the German Federal Ministry of Education and Research (BMBF) through the Tübingen AI Center (FKZ: 01IS18039A); and funds from the Ministry of Science, Research and Arts of the State of Baden-Württemberg. The authors thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting N. Krämer. Moreover, the authors thank Nathanael Bosch and Marvin Pförtner for valuable discussions.