key: cord-0522474-xsb1flfj
authors: Marcondes, Diego
title: Robust parameter estimation in dynamical systems via Statistical Learning with an application to epidemiological models
date: 2020-07-28
journal: nan
DOI: nan
sha: 4ec9ffd0e2c54676b04a631523792afffd6f8ce0
doc_id: 522474
cord_uid: xsb1flfj

We propose a robust parameter estimation method for dynamical systems based on Statistical Learning techniques which aims to estimate a set of parameters that well fit the dynamics in order to obtain robust evidences about the qualitative behaviour of its trajectory. The method is quite general and flexible, since it dos not rely on any specific property of the dynamical system, and represents a mathematical formalisation of the procedure consisting of sampling and testing parameters, in which evolutions generated by candidate parameters are tested against observed data to assess goodness-of-fit. The Statistical Learning framework introduces a mathematically rigorous scheme to this general approach for parameter estimation, adding to the great field of parameter estimation in dynamical systems. The method is specially useful for estimating parameters in epidemiological compartmental models. We illustrate it in simulated and real data about COVID-19 spread in the US in order to assess qualitatively the peak of deaths by the disease.

Dynamical systems are important tools in Applied Mathematics for modelling phenomena studied by many branches of science, such as physics, cosmology, biology, epidemiology, medicine, chemistry and engineering [6, 15, 16, 19, 21] , being applied to describe the deterministic evolution in time of certain processes, when starting from a fixed initial condition. The equations which govern this deterministic evolution are specific to each application, and depend on some unknown parameters related to the phenomena being modelled. For example, dynamical systems applied to physics and cosmology usually depend on universal constants, while compartmental models in epidemiology depend on rates related to the spread of infectious diseases, as well fit the dynamics, and is bad otherwise. The method aims to find good parameters among candidate ones by sampling and testing parameters. To this end, we apply Statistical Learning techniques to determine an upper bound for the sample size needed to find a proportion at least 1−c, 0 < c < 1, of the good parameters with high probability, which is a measure of the computational complexity of the approach.

The method is quite general, since it does not rely on any specific property of the dynamical system, being at principle applicable to multidimensional systems with real-valued parameters. Observe that the agenda the method seeks to carry is much more humble than trying to identify exactly the value of the optimal parameter, since its aim is only to find good parameters in a pre-defined set. This is more plausible to be achieved in practice, specially in processes which are only approximated by dynamical systems or whose parameters change from time to time. Also, there is no need to fix a likelihood function for the process, nor prior distributions for the parameters, as is done in common Bayesian approaches to parameter estimation [5, 23] .

The Statistical Learning approach is specially suitable when modelling disease spread by compartmental epidemiological models. We illustrate in an example with real data about the spread of COVID-19 in the US how this method can be quite useful in obtaining qualitative information about the evolution of the process, which in the case of infectious disease may be its peak of deaths, in quite complex models. Indeed, we will see how the parameters of a SEIR model change from time to time due to the enforcement of measures to slow the disease spread, demanding a constant fit of them which in itself can provide evidences about the disease spread and the effectiveness of measures to overcome it.

In Section 2 we discuss the problem of robustly estimating parameters in dynamical systems from the evolution of the system in an interval. In Section 3 we propose a method based on Statistical Learning techniques to perform such robust estimation. In Section 4 we present a couple of examples of the proposed method for a SIR model on a simulated dataset, and for a SEIR model for the spread of COVID-19 in the US. In Section 5 we give our final remarks.

Let {{X θ (t)} t∈N : θ ∈ Θ} be a family of discrete time dynamical systems, with a same metric phase space (Ω, d), indexed by parameters θ ∈ Θ. The parametric space Θ may be such that Θ ⊆ R d , d ≥ 1, when the parameters are time independent, or Θ ⊆ N × R d , when the parameters are time dependent. We assume that the initial condition is the same for all systems in the family: X θ (0) = x 0 ∈ Ω for all θ ∈ Θ.

The problem of parameter estimation in dynamical systems we consider in this paper is characterized when one observes the time evolution of a process X θ (t) for t ∈ {1, . . . , T }, with a fixed T ≥ 1, and wants to identify the parameter θ ∈ Θ which generated such evolution. On the one hand, for T ∈ N, the map

which maps each parameter θ ∈ Θ to the evolution of X θ (t) until t = T , is in general not invertible, i.e., θ is not identifiable, so the evolution until a time T does not define uniquely the system. Furthermore, the set φ −1 T ({X θ (t)} T t=1 ) may not be computable in real problems, so one cannot identify a subset of candidate parameters by inverting such a map. On the other hand, it is not computationally feasible to test all parameters in Θ, in case it has infinite elements, or to perform an efficient grid search of Θ, when it is multidimensional, to find candidate parameters.

Precisely estimating parameters of a dynamical system from the time evolution in an interval may have an intrinsic lack of robustness due to the sensitivity of the evolution on parameter θ (see [33] for an example of sensitivity analysis for compartmental models). This implies that, even if we estimate θ by aθ close to θ , d (X θ (t), Xθ(t)) may be too great for t > T , rendering the estimative useless for the problem at hand, specially when the aim of estimating parameters of a dynamical system is predicting exactly its trajectory.

A manner of increasing the robustness of the estimation method is, rather than estimating θ precisely by aθ, identifying a setΘ ⊆ Θ of possible values for θ which will generate evolutions of the system: {{X θ (t)} t∈N : θ ∈Θ}. Establishing common properties of these evolutions, one may obtain evidences about the behaviour of X θ (t) for t > T . Even though the chosen models are still sensible to the parameters, there may be some qualitative behaviour, common to all of them, which may also be shared with the evolution generated by θ , so this method adds to one understanding of such trajectory.

The chosen models should well fit the observed evolution until time T . The definition of well fit is problem dependent and shall be given by a fitness map

which, for a θ ∈ Θ and observed evolution in Ω T , attributes 1 if the system generated by θ well fit the observed evolution, and attributes 0 otherwise. An example of fitness map is

for δ(t, X θ (t)) > 0, dependent on t and X θ (t). By (1), a model well fits {X θ } T t=1 if its time evolution is relatively close to the observed one for all t ≤ T . The fitness map is problem dependent and should be chosen according to the purpose of the dynamical system.

The estimation method for dynamical systems parameters proposed in this paper aims to find a subsetΘ ⊆ Θ of parameters such that F (θ, {X θ (t)} T t=1 ) = 1 for all θ ∈Θ. As an alternative to invert map φ T or perform a grid search on Θ, we will propose a probabilistic consistent method to randomly select parameters from Θ and takeΘ as the ones with goodness-of-fit according to a fitness map F . This method is based on Statistical Learning techniques [8, 25, 26] .

Remark 2.1. Goodness-of-fit, as defined above, is a dichotomous concept: a model either does, or does not, fit the observed evolution. Even if we considered some likelihood function or distance between the observed and generated evolution as numerical measures of goodness-of-fit, we would have to choose a threshold for it in order to generateΘ. Hence, the class of fitness maps contemplate these numerical goodness-of-fit measures, since they may be composed by such measures, as in example (1). Therefore, dichotomous goodness-of-fit measures are enough for our purposes.

3.1. Statistical Learning. The Statistical Learning method under the Empirical Risk Minimization (ERM) paradigm, restricted to classification problems, may be stated as follows. Let Z be a random vector, and Y a random variable, defined on a same probability space (Λ, S, P), with ranges Z ⊂ R d , d ≥ 1, and {0, 1}, respectively. Denote P (z, y) := P(Z ≤ z, Y ≤ y) as the joint probability distribution of (Z, Y ) at point (z, y) ∈ Z × {0, 1}, which we assume unknown, but fixed. Define a sample D N = {(Z 1 , Y 1 ), . . . , (Z N , Y N )} as a sequence of independent and identically distributed random vectors, defined on (Λ, S, P), with joint distribution P .

Let H be a set of functions with domain Z and image {0, 1}, whose typical element we denote by h : Z → {0, 1}. We call H Hypotheses Space. For each hypothesis h in H, we assign a value indicating the error incurred by the use of such hypothesis to predict Y from the value of Z, i.e., how well h(Z) predicts Y . To obtain such an error measure, we consider the simple loss function : Z × {0, 1} × H → R given by

In this framework, the loss of predicting y by h(z) is either zero if h(z) = y, or one if h(z) = y. The prediction error, known in the literature as out-of-sample error, risk or loss [1, 8, 25, 26] , of a hypothesis h ∈ H is defined as

in which E means expectation under P. This is the probability of error incurred when hypothesis h is used to predict Y from Z. The out-of-sample error is fixed, but unknown, as is P . Therefore, in order to asses the out-of-sample error of a hypothesis, one needs to estimate it. An estimator for L may be obtained by the empirical error on sample D N , called in-sample error and defined as

for h ∈ H. This is simply the classification error of h when applied to classify the points in sample D N . The main goal of Statistical Learning in this context is to approximate target hypotheses, which are minimizers of the out-of-sample error in H. and satisfy L(h ) ≤ L(h), ∀h ∈ H. We assume throughout this paper that L(h ) = 0, i.e., there is a hypothesis in H which predicts Y from the values of Z with probability one. Under the ERM principle, which proposes the minimization of the in-sample error as a method to approximate target hypotheses, we estimate them bŷ

the hypotheses which minimize the classification error in sample D N . Observe thatĥ =ĥ D N is actually dependent on D N , but we drop the subscript to ease notation.

In the ERM paradigm we are interested in considering Hypotheses Spaces H which are Probably Approximately Correct (PAC)-Learnable [24] . We say that Hypotheses Space H is PAC-

for all > 0. This means that, if the sample size N tends to infinity, the out-of-sample error of the estimated hypothesesĥ is arbitrarily close to the out-of-sample error of the target hypotheses, i.e., zero, with high probability. In Probability Theory [2] , it means that L(ĥ) converges in probability to zero. In a PAC-Learnable Hypotheses Space we may estimate h arbitrarily well, with great confidence, if the sample size is great enough.

In this paper, we will be interested in Hypotheses Spaces with a finite number of elements, i.e., |H| < ∞. Such Hypotheses Spaces are not only PAC-Learnable, but we can establish a distribution-free rate of convergence to zero of limit (2) . Distribution-free in this context means that such a rate is true for any possible joint distribution P of (Z, Y ). Theorem 3.1 presents such convergence rate. Its elementary proof, deduced for the first time by [27] and presented in [8, Theorem 12.1] , is in the Appendix.

Theorem 3.1. Assume that |H| < ∞ and L(h ) = 0. Then, for every N and > 0,

From Theorem 3.1 we obtain as a corollary an upper bound for the sample size needed to estimate a hypothesesĥ such that L(ĥ) < with probability at least 1 − δ. This bound is obtained by solving |H|e −N = δ for N , fixed and δ, and is stated below.

Remark 3.3. The concept of PAC-Learnability is related to Vapnik-Chervonenkis (VC) Theory [25, 26, 27, 28, 29, 30] . Bounds analogous to (3) may be obtained from VC Theory for Hypotheses Spaces with infinite cardinality or such that L(h ) > 0. For more details see [8, 20, 25, 26] .

Robust parameter estimation in dynamical systems may be achieved via Statistical Learning with the framework above. Let Z ⊆ Θ be a finite subset of candidate parameters. It can be, for example, a grid of Θ ⊆ R d . The joint distribution of (Z, Y ) is such that, for θ ∈ Z and y ∈ {0, 1},

in which 0 < q(θ) < 1 and θ∈Z q(θ) =, i.e., Z has a discrete probability distribution q in Z, and Y is equal to F (Z, {X θ (t)} T t=1 ) with probability one. Hence, Z represents a parameter chosen randomly from Z according to probability distribution q and Y is equal to one if the evolution generated by this parameter well fit the observed evolution, and zero otherwise. Therefore, independently of the Hypotheses Spaces H, the loss function is given by

.

The loss function tells us what the hypothesis h should do: point out what are the good (well fit) and what are the bad parameters. Let p ∈ N be the number of good parameters in Z:

is a subset of the good parameters, for all h ∈ H. In this scenario, h is the subset of the p good parameters in Z and L(h ) = 0. Observe that

so H has finite cardinality.

The number p of good parameters in Z is unknown, but strictly related to the choice of fitness map F . On the one hand, if the concept of well fit is too restrict, then less parameters will be good, hence one has a small p.

In this case, one may not be able to obtain robust evidences about X θ (t) due to the small number of candidate evolutions. On the other hand, if the concept of well fit is too loose, then more parameters may meet the criteria of a good one, so greater is p. As we want a robust estimation, p should be much greater than one, so there are a handful of evolutions available from which to obtain evidences about the evolution of X θ (t). Nevertheless, if p is too great, then probably our definition of well fit is too loose, so the candidate systems may not be homogeneously related to X θ , rendering useless evidences about X θ (t). Hence, both Z and F should be chosen based on all prior information available about the problem at hand, so p 1, but the good parameters actually generate meaningful trajectories.

After one fixes Z and F , the ERM principle is then applied to estimate a set of at most p good parameters. The estimated sets arê

i.e., the subsets of good parameters which contain all of the observed in the sample, which generates the following set of good parameters:

Therefore, the set de facto estimated by the ERM principle isĥ such that

that is, the set containing only the good parameters presented in the sample. From now on when we denoteĥ we mean the set above.

Observe that

the measure of the good parameters not present in the sample. Since

an upper bound for the quantity above, may be very small, simply having L(ĥ) < is not meaningful if such probability is lesser than . Hence, in order to assess

. In order to perform such a comparison, we apply Theorem 3, which state that the probability of having an out-of-sample error greater than cG({X θ } T t=1 ), which means that a proportion at least 0 < c < 1 of the measure of the set of all good parameters is not observed in the sample, is such that

so it follows from Corollary 3.2 that, in order to obtain a proportion 2 at least (1 − c) of the good parameters in the sample, with a probability greater than 1 − δ, one needs a sample of size at least

Assuming

This results is reasonable since, as N → |Z|, the cardinality of {Z 1 , . . . , Z N } does not converge to |Z| with probability one for there will be points which are sampled more than once. Hence, to observe a proportion of the good parameters one needs to sample more models than there are in Z, so this method is better than an exhaustive search of Z only if one wants a small number of good parameters when compared to the number of good parameters in Z.

The quantities on the right-hand side of (5) and (6) are not known, since p and G({X θ } T t=1 ) are unknown. However, the sample size above is an increasing function of p, therefore, given an upper bound to p, one can obtain a greater sample size that guarantees (4) . Also, G({X θ } T t=1 ) may be estimated by a pre-sample as the proportion of good parameters in it.

If (5) and the sample size (6) may be improved. The bound below is much better than (6), especially for c close to 1. (6) (dashed) and its upper bound when q conditioned on the good parameters is uniform (solid line), as a function of c, with δ = 0.1, G({X θ } T t=1 ) = 0.001, |Z| = 10 6 and p = 1, 000. We see that, for example, to obtain 10% of the good parameters one has to sample a proportion around 0.35 and 0.8 when q conditioned is uniform and in general, respectively. Therefore, taking q close to the uniform distribution when conditioned on the good parameters is better since less samples are needed to find a given proportion of the good parameters.

Since the sample size above is distribution-free, i.e., true for any distribution, it will be an overestimated number in most cases. Even the sample size when q conditioned on the good parameters is uniform tends to be overestimated. For example, if q is the uniform distribution on Z, in order to obtain a proportion 1 − c of the good parameters one should have a sample of the order (1 − c)|Z| which is lesser than the one calculated above. However, when q is far from the uniform distribution, the researcher does not have control over q, i.e., he cannot choose it, or it is unknown, the sample size above may be a tool to assess beforehand the computational complexity of the estimation process, rather than a number that must be achieved.

4.1. SIR. The spread of infectious diseases within a population is generally studied via compartmental epidemiological models [4, 31] , in which an individual is in either one of a handful of states related to the disease (Susceptible, Exposed, Infected, Recovered, etc...), and changes states according to some rates. In these models, the rates, which have epidemiological meaning, are the parameters, which should be estimated for each disease and different populations.

We consider the family of Susceptible-Infected-Recovered (SIR) dynamical systems generated by the following difference equations

with (S(0), I(0), R(0)) = c(0.95, 0.05, 0). In this case, Ω = {x ∈ [0, 1] 3 :

The observed evolution will be that generated by θ = (0.25, 1/21), for t ≤ 10, and the fitness map will be

t=1 for all three compartments, 0 < r < 1. In the simulations, we consider as candidate models a grid of squares with side 0.001 of the following three subsets of Θ

which represent distinct prior knowledge about parameters β and γ. If one does not known much about the parameters, he may assume that β is lesser than 1 and γ lesser than 0.5, considering Z 1 . If one has prior information about γ, that it is no more than 0.2, then he may consider Z 2 . Finally, if one has also prior knowledge about β, that it is between 0.1 and 0.5, then he may consider Z 3 . The cardinality of these sets are, respectively, 5 × 10 5 , 2 × 10 5 and 80, 200. We take q(θ) as the uniform measure in the respective set Z i .

Two values of r, namely, 0.05 and 0.1, are considered, and c = 0.9 is chosen to obtain 10% of the good models. To illustrate the method, an exhaustive search of all sets was performed in order to calculate G({X θ } T t=1 ) and p, even though it will not be performed when applying the method on practice. The value of p is the same in all sets, although G({X θ } T t=1 ) is greater in the smallest set Z 3 , that is the one which incorporated more prior information about the problem. Also, an upper bound for the sample size needed to obtain a same proportion of these good parameters, 10% in this case, is much lesser in the smallest set, illustrating that as more prior information about the problem at hand is incorporated in the estimation process, less samples are needed. Observe that the fitness map with r = 0.1 is a more loose definition of goodness-of-fit, hence there are almost four times the number of good models when r = 0.05. Figure 2 shows the evolution of the good models obtained in each scenario, and Table 2 presents descriptive statistics of the good parameters and the peak of infected, i.e., the day with more simultaneously infected individuals, simulated by the good parameters. The evolution generated by the good parameters is close to the evolution of the SIR for all times, not only the first ten which were used in the estimation. Also, Table 2 shows that the values of β and γ of the good parameters are close to the real ones. Finally, the peak of infected individuals, which for the observed model occurs at day 24, was predicted within an error of no more than two days by the evolution generated by the good parameters in all scenarios. We conclude that, by applying the method developed in this paper, we may obtain informations about the parameters and qualitative behaviour of the SIR model even when we do not have much prior information of the parameters (Z 1 ) or when we choose a more loose fitness map (r = 0.1).

In order to mitigate the effects of COVID-19 pandemic, the greatest world health crisis in a century, governments and the population have sought reliable information about its spread. More than quantitative predictions about cases and deaths tolls, the main questions regarding the disease spread are about certain milestones such as its peak, when it will be over and if or when the available intensive care units will be full. In this section we illustrate, with data about its spread in the US, how the method developed in this paper may be useful in answering such questions about COVID-19. Table 2 . Sample size, number of good parameters in the sample and descriptive statistics of β, γ and the peak of the disease for the evolution generated by the good parameters. In order to account for features of COVID-19, we add a compartment to the usual SEIR model (see [14] for a review of compartmental models). After being exposed to the disease, an individual becomes Infected with rate γ I and is able to infect others. However, once infected, an individual either recovers with rate ν R or is accounted by the official statistics and becomes Infected in Statistics with rate γ S , and is not able to infect others. Hence, we are assuming that a parcel of the infected is never accounted by official statistics and that, once diagnosed with COVID-19, an individual will not infect others, as it will be either admitted to a hospital or be in quarantine. These assumptions account for features of the disease such as the under-notification of cases. We also add a compartment for Deaths by the disease and assume that natural deaths and births balance out, so they are not considered to simplify the model, which is illustrated in the diagram of Figure 3 .

Infected in Statistics Figure 3 . Diagram of the SEIR model for COVID-19.

The evolution of the Susceptible (S), Exposed (E), Infected (I), Infected in Statistics (I s ), Recovered (R) and Deaths (D) by COVID-19 is ruled by the difference equations

in which N is the population size. We assume that the time scale is that of days, so t + 1 is the day after t, and consider the absolute number of individuals in each compartment, rather than the proportion of the population.

This model has a nice biological meaning and its parameters are related to characteristics of the disease. Indeed, apart from the force of infection, they are functions of the disease average exposed time τ E , average time infected before getting into statistics τ S , average time infected before recovering without being accounted by statistics τ R , average time before recovering after being accounted by statistics τ RS , average time until death once accounted by statistics τ D , and the proportions of infected individuals accounted by statistics p S , and of deaths among them p D :

Hence, after being exposed, an individual become infected after a time with mean τ E . Among the infected, a proportion p S will be accounted by official statistics after a time with mean τ S , and not be able to infect any more, while a proportion 1 − p S will recover after a time with mean τ R without ever being accounted by official statistics. Then, a proportion p D of the Infected in Statistics will die after a time with mean τ D and a proportion 1 − p D will recover after a time with mean τ RS . Once recovered, an individual cannot become infected again. Apart from p D , which can be estimated directly from data as the proportion of deaths among infected accounted by statistics, the other parameters need to be estimated. Since the rates of the model may be calculated from the mean average time in each compartment and the proportion of infected accounted by statistics, we consider

as the free parameters to be estimated. The candidate values for each parameter are presented in Table 3 . The number of candidate models is |Z| = 116, 121, 600.

In order to illustrate the method, we will estimate the parameters multiple times considering goodness-of-fit during consecutive periods of 7 days starting on March 20th 2020, i.e., we will apply the method for multiple t 0 , with 7 days apart from one to the next, starting on March 20th until mid-July. Goodness-of-fit will be tested against data about COVID-19 spread in the US compiled by Johns Hopkins Coronavirus Resource Center [9] . The observed data is smoothed by taking a centred seven day average of the incidence of confirmed cases, deaths and recovered, and the prevalences calculated by summing the smoothed incidences. Since only two compartments, namely Infected in Statistics and Deaths, are fully observed, the goodnessof-fit of a candidate model should be defined comparing the predicted with their observed values in these compartments. Denote X(t) = (I s (t), D(t)) and X θ (t) = (I s (t; θ), D(t; θ)) the observed and predicted by model with parameter θ, respectively, Infected in Statistics and Deaths. With this notation, the fitness map considered is

for each t 0 considered and r(t 0 ) ∈ (0, 1), in which division by X(t) is componentwise. Hence, a parameter θ well fit the observed data if the predicted value of Infected in Statistics and Deaths is within 100r(t 0 )% of the respective observed value for everyday in the week starting on t 0 . The value of r must depend on t 0 since there are moments when the disease spread is better explained by a SEIR model, as when few measures are implemented to slow its evolution, as will be seen below. We estimated the model every seven days in order to asses its peak if the disease were to keep spreading as in the week starting on t 0 . Due to the enforcement of measures to flatten the epidemiological curve, the parameters may be actually time-dependent, specially the force of infection β. Hence, estimating the peak assessing goodness-of-fit only against the first weeks of the disease spread is doomed to fail, since the force of infection changes from time to time. On the other hand, fitting the model every week, one can see changes on the peak estimate over time, which will reflect the enforcement or looseness of measures to slow the spread, providing a robust and better picture of the future. Observe that the method provides evidences about the disease spread if it were to continue as in the week when goodness-of-fit was assessed, not providing forecasts of distinct scenarios.

To estimate the parameters and then simulate the chosen models starting on t 0 , we need to estimate the initial condition for compartments E and I, which are not observed, and R which is only partially observed. Assume that we fixed a vector θ of candidate parameters. Denoting R s as the recovered among the accounted by statistics, we take

for t ∈ {t 0 , . . . , t 0 + 9}. Proceeding in this manner, we have that among all infected (I +I s ) and recovered (R), a proportion p S is accounted by statistics.

To estimate the exposed, we obtain from the equations ruling the evolution of the disease that

so we calculate E(t 0 ) from I(t 0 ) and I(t 0 + 1). Observe that each candidate model has a distinct initial condition on the unobserved compartments, dependent on its parameters. We take p D as the moving death rate considering only the confirmed cases and deaths according to the smoothed data of the week ending on t 0 , so it express the death rate around t 0 . The error r(t 0 ) was calculated by sampling 100, 000 models and observing the minimum difference

over the sampled θ, which gives an estimative how how well the SEIR model approximates the evolution on the week starting on t 0 . The error r(t 0 ) was then taken as 1.1 times this minimum difference. We could not choose an absolute r beforehand since there is not a model approximating arbitrarily well the evolution for all t 0 as it does not follow exactly a SEIR model. The values of r(t 0 ) were around 0.03 until April 10th when it dropped to around 0.01. On May and until mid-June the values of r(t 0 ) were more or less 0.07, dropped to 0.03 on the second half of June and then to 0.01 on the first half of July. Fixed r(t 0 ), for each t 0 there were sampled 500, 000 models. Figure 4 shows the box-plot of the parameters of the good models for each t 0 , and the number of good models (m) among the sampled ones. Figure 5 shows the Infected in Statistics and Deaths according to the smoothed data and simulated by the good models for the seven days starting on their t 0 . We first see that the distribution of τ E , τ R and τ RS , which are mainly related to features of the disease, does not vary very much over time. Also, we see that the good models better fit observed data when implemented measures to slow its spread were not effective, namely, until the end of April, when it was spreading mainly in the Northeast region, and from mid-June when it started spreading rapidly in other states, specially in the South. This is reasonable, since SEIR models assume that every individual which has not ever been infected is susceptible to the disease and that any contact between an infected and a susceptible has a same probability of infection, which is not true if part of the population is on lockdown or if social distance and other measures, such as mask wearing, are implemented. Hence it is expected a poor performance of the model when such measures are implemented, at least in what regards the number of confirmed cases and deaths.

The behaviour of the estimated parameters over time evidences interesting qualitative features of the disease spread in the US. The moving rate of death p D started low, but more than tripled in the following weeks, slowly decreasing over time, what may be due to the increase in testing or the fact that deaths among the new infected is yet to occur [7] . The number of good models was great when the disease was spreading more slowly, what is due to the employment of a more loose fitness map, since a SEIR model could not well fit data, in a absolute sense, in this period, as can be seem in the simulated and observed curves of Figure 5 .

The values of τ S and τ D started small and then increased until an equilibrium around 11 and 27 days, respectively. This may be due to the fact that early on there have been a lot of people dying at home, specially on nursing homes [22] , so they may have died rapidly due to underlying health problems not having the time to procure health assistance, so their time infected before getting into statistics was small, and they were accounted as infected in statistics and deaths at the same time, so their time until death after being accounted by statistics is zero; these cases cause the mean times τ S and τ D to be small. Figure 4 . Number of good parameters among the 500, 000 sampled (m) and box-plot of the set of good parameters sampled for each t 0 . The value of p D is calculated directly from data hence is the same for all models.

Observe that, when the model was not well fitting data, the range of most estimated parameters contemplated all the candidates, what tells us that, with a right combination of parameters, any given value of a parameter can generate an evolution that fit data, according to the loose definition of well fit. On the other hand, when the model was absolutely well fitting the data, until the end of April and from mid-June, the parameters are more meaningful. This is specially true for p D , which started low until mid-April, and when the disease spread increased again at mid-June it was again low, but greater than on March and April, illustrating the increase on testing. Finally, the parameter β tended to be greater on mid-June than on the time until the end of April, evidencing that the force of infection in the second moment of fast spread is greater; the values when the model was not fitting data are not meaningful since vary over the range of candidate parameters. Figure 6 presents the median and selected percentiles of the peak of deaths, i.e., the day with more deaths after t 0 , estimated by the good models in each t 0 . The models fitted between March 20th and April 3rd predicted that the the peak could happen around April 15th, which was the peak observed on the smoothed data, since the percentile 2.5% of them was on this day. Also, for t 0 equal to April 10th, the median of the models was pointing to a close peak, although it got it wrong by one week. After the first peak, for some t 0 it was predicted as a good scenario no more peaks (when the peak is in t 0 ), but then, for the months in which the disease was spreading more slowly, the peak was predicted for a sequence of t 0 as around one month after it. This was the behavior until the end of June when the peak started to be predicted as more distant, evidencing the future increase on the incidence of deaths in the mid-term before it start decreasing.

This example illustrates how this simple estimation method may offer many insights into qualitative properties of the disease spread and aid in Figure 6 . Median estimate for the peak of deaths among the good models sampled for each t 0 (dots), with the percentiles 2.5 and 97.5% represented in red. The horizontal line represents the first peak of deaths according to the smoothed data. The solid line is the identity function.

predicting its future spread. By fitting this model every week during an outbreak, one may analyse its predictions and estimated parameters over time to assess the effectiveness of measures implemented to slow the spread, and some qualitative features of present and future evolution.

The estimation method for dynamical systems parameters proposed in this paper aims to find a subsetΘ ⊆ Θ of parameters that well fit the observed evolution in a time interval. This is a more humble agenda when compared with methods that try to find exactly the parameter which generated the evolution, and is quite robust when evidencing qualitative behaviour of evolution, specially ones that are only approximated by dynamical systems, as was illustrated by the study of the COVID-19 spread on the US. In essence, the proposed method is a mathematical formalization of the procedure consisting of trial and error for parameter estimation, in which the evolution generated by candidate parameters are tested against observed data to determine which well fit it. The Statistical Learning framework introduces a mathematically rigorous scheme to this general approach for parameter estimation, adding to the great field of parameter estimation in dynamical systems.

The method is quite general, as it does not rely on any specific property of the system, and highly flexible, since one may define the set of candidate parameters Z, the fitness map F and select any sampling algorithm on Z to obtain distinct methods for parameter estimation. An interesting topic for future research is to compare the method's results and computation complexity with established ones in the literature, such as Bayesian methods, which can also be computational demanding. Another topic for future research would be to consider Z with infinity cardinality, but choose H and F in a way such that H has finite VC-dimension, so results analogous to Theorem 3.1 are true. With such a result, one could consider sampling parameters from a set with infinity cardinality.

The method may be quite useful for disease spread models which have the properties it was developed to address: disease spread does not exactly follow a compartmental model and there is usually an interest in the qualitative behaviour of the evolution, rather than in its exact trajectory. These qualitative properties may aid the population and government officials in the decision making process during an outbreak, hence, as illustrated with the COVID-19 spread in the US, fitting a compartmental model weekly may be a rich source of information about the disease spread, even if it cannot give reliable predictions on the number of cases and deaths for the long-term.

The shortcomings of the method are the need for high computational power to simulate the evolution generated by thousands of parameters, and the need to carefully define Z beforehand, properly identifying the set of candidate parameters. Nevertheless, it may be a good option to robustly estimate parameters in complex models, specially when there is interest in the qualitative behaviour of the trajectories.

Learning from data

Probability and measure

Estimation of parameters in a structured sir model

Mathematical structures of epidemic systems

Bayesian state and parameter estimation of uncertain dynamical systems

Dynamical systems and cosmology

The rise in testing is not driving the rise in u.s. virus cases. The New York Times

A probabilistic theory of pattern recognition

An interactive web-based dashboard to track covid-19 in real time. The Lancet infectious diseases

Parameter estimation for linear dynamical systems

Learning nonlinear dynamical systems using an em algorithm

Bayesian and markov chain monte carlo methods for identifying nonlinear systems in the presence of uncertainty

A least squares identification algorithm for a state space model with multi-state delays

The mathematics of infectious diseases

Applications of Dynamical Systems in Biology and Medicine

Dynamical modeling and analysis of epidemics

Modeling future spread of infections via mobile geolocation data and population dynamics. an application to covid-19 in brazil

Data2dynamics: a modeling environment tailored to parameter estimation in dynamical systems

Dynamical system theory in biology, Volume I. Stability theory and its applications

Understanding machine learning: From theory to algorithms

Nonlinear dynamics and chaos: with applications to physics, biology, chemistry, and engineering (studies in nonlinearity)

More than 40% of u.s. coronavirus deaths are linked to nursing homes. The New York Times

Approximate bayesian computation scheme for parameter inference and model selection in dynamical systems

A theory of the learnable

Statistical learning theory

The nature of statistical learning theory

Theory of pattern recognition

Oordered risk minimization ii. Automation and Remote Control

Ordered risk minimization i. Automation and Remote Control

On the uniform convergence of relative frequencies of events to their probabilities

An introduction to infectious disease modelling

How generation intervals shape the relationship between growth rates and reproductive numbers

Sensitivity analysis of infectious disease models: methods, advances and their application

Application of the newton iteration algorithm to the parameter estimation for dynamical systems

The author has received financial support from CNPq during the development of this paper. I thank C. Peixoto, P. Peixoto and S. Oliva for fruitful discussions about the modelling of disease spread, specially COVID-19, and about the development of compartmental models to address it.

L(h) > hence P L(ĥ) > is lesser or equal toApplying the union bound on the last expectation above we obtain that P L(ĥ) > then at least cp of the good parameters are not in the subset generated by h since measure G({X θ } T t=1 ) is uniformly spread among the good parameters. As there are subsets with no more than (1−c)p good parameters the first assertion follows. To show the second assertion, by applying Corollary 3.2 we haveSince the left-hand side of (9) is lesser than