key: cord-0899427-u3bqkhur
authors: Lal, Rajnesh; Huang, Weidong; Li, Zhenquan
title: An application of the ensemble Kalman filter in epidemiological modelling
date: 2021-08-19
journal: PLoS One
DOI: 10.1371/journal.pone.0256227
sha: 5e8a3c9de1bf4403297d5809e32cc46bd74844b1
doc_id: 899427
cord_uid: u3bqkhur

Since the novel coronavirus (COVID-19) outbreak in China, and due to the open accessibility of COVID-19 data, several researchers and modellers revisited the classical epidemiological models to evaluate their practical applicability. While mathematical compartmental models can predict various contagious viruses’ dynamics, their efficiency depends on the model parameters. Recently, several parameter estimation methods have been proposed for different models. In this study, we evaluated the Ensemble Kalman filter’s performance (EnKF) in the estimation of time-varying model parameters with synthetic data and the real COVID-19 data of Hubei province, China. Contrary to the previous works, in the current study, the effect of damping factors on an augmented EnKF is studied. An augmented EnKF algorithm is provided, and we present how the filter performs in estimating models using uncertain observational (reported) data. Results obtained confirm that the augumented-EnKF approach can provide reliable model parameter estimates. Additionally, there was a good fit of profiles between model simulation and the reported COVID-19 data confirming the possibility of using the augmented-EnKF approach for reliable model parameter estimation.

The outbreak of the novel coronavirus disease in early December 2019 in Wuhan, China, attracted many researchers to evaluate the dynamics of infectious COVID-19 virus using various mathematical models [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] . Mathematical compartmental models, such as SIR (Susceptible-Infectious-Recovered) [18, 19] , in epidemiology, are generally expressed by a system of ordinary differential equations (ODE). Recent studies on COVID-19 modelling includes using the basic SIR model [12, 18, 19] or its extension (modified) versions such as SEIR (Susceptible-Exposed-Infectious-Recovered) [7, 10, 11, [19] [20] [21] , SIRD (Susceptible-Infectious-Recovered-Dead) [1-4, 16, 17, 22] and SEIRD (Susceptible-Exposed-Infected -Recovered-Dead) [13] [14] [15] .

There are many significant practical importance of epidemiological modelling via computer simulation, which includes understanding the disease and dynamics of the infectious virus, e.g. predicting the number of new cases [23] . Also, the outcome of modelling and simulations can provide vital information to governments and decision-makers [23] . However, a mathematical model's performance and reliability depend on uncertainties in the model and the model parameters [24] . While identifying accurate parameters of models, e.g. infection rate and mortality rates, is an essential exercise in mathematical modelling, precise prediction of parameters is a difficult task [23] . Model parameters can be estimated by solving inverse problems using observational data [24] . Some recent works on inverse COVID-19 modelling with different models include works of Wan et al. [7] , Sun et al. [10] , Libotte et al. [18] , Lobato et al. [2] , Li et al. [11] , Anastassopoulou et al. [3] , Loli Piccolomini and Zama [14] , Korolev [15] , Ndaïrou et al. [9] , Ifguis et al. [12] , Engbert et al. [21] , Yang et al. [20] , Arroyo-Marioli et al. [25] , and Ghostine et al. [26] . Recent methods of inverse modelling for parameter estimation include the least-square techniques and optimization algorithms [3, 10, 12, 14, 15] , Differential Evolution method [18] , Stochastic and Multiobjective Fractal Search algorithm [2] , and data assimilation methods [11, 20, 21, 25, 26] .

Introduced by Geir Evenson [27] , the ensemble Kalman filter (EnKF) is a data assimilation technique that can be employed to update both model parameters and states variables with their associated uncertainties [28] . Yang et al. [20] used the EnKF for joint state-parameter estimation, where one of the SEIR model parameters, the time-varying rate of infection, was estimated. Similarly, Li et al. [11] used the Ensemble Adjustment Kalman Filter (EAKF) on the augmented state-parameter space to estimate a modified deterministic SEIR model's parameters. The augmented state-parameter can cause the EnKF to fail due to a strong nonlinear relation between the model parameters and its state [29, 30] . To overcome this, Engbert et al. [21] adopted a two-stage approach where the EnKF was initially used to estimate the states, followed by the likelihood-based inference of one of the SEIR model parameters.

Arroyo-Marioli et al. [25] applied the Kalman filter to estimate the time-varying growth rate of the COVID-19 cases. This was followed by an estimation of the time-varying effective reproduction number of the coronavirus disease. Finally, the time-varying effective reproduction number and disease transmission rate were employed by the SIR model in tracking the dynamics of COVID-19. Furthermore, Ghostine et al. [26] demonstrated the effectiveness of a joint-EnKF based assimilation scheme in estimating eight constant parameters of an extended SEIR model using the COVID-19 data.

The entire population is vulnerable to the disease at the first level of the outbreak. However, fewer individuals of size S are susceptible through control measures such as restriction of movements, self-isolation, and social distancing [31, 32] . While the initial number of susceptible individuals, S(0), is required for inverse modelling, estimating the actual population size under study can be challenging [31, 33, 34] . In the recent studies considering inverse COVID-19 modelling, several assumptions or methods were used to choose the total population size, N. These include assuming a fixed value for N, e.g. population of a city or a country [3, 7, 9-12, 14, 15, 21] or using a normalised version of a compartment model [2, 18, 20] .

During the COVID-19 pandemic, there was more control of individual movements due to restrictions imposed by various governments. The restrictions included lockdown of cities, social distancing and quarantine measures. Other preventive measures were hand sanitation and wearing face masks. Considering the implementation of restrictions and preventive measures, in this study, we assumed that the infection rate, recovery rate and death rates of COVID-19 cases were all time-dependent. Estimating time-varying parameters of a model can be challenging, and the inverse problem may demand richer models [25] .

Our contribution in this work includes evaluating the ensemble Kalman filter's capability to estimate the time-varying parameters of the SIRD model. To overcome any challenges with the estimation of time-varying parameters, we used the EnKF with an augmented state-parameter scheme. To mitigate the problem associated with the nonlinearity between parameters and state, we tested the efficiency of the EnKF with different values of the damping factor [30, 35] . Additionally, we provide the EnKF algorithm to estimate the time-varying parameters.

The proposed method is demonstrated with test cases using synthetic data and the real COVID-19 data of Hubei province, China. There were some outliers in the reported number of cumulative cases of COVID-19 in the Hubei due to the change in diagnosing and revision of the definition of COVID-19 cases by the National Health Commission of the People's Republic of China [3, 25, [36] [37] [38] . The time-varying model parameters were estimated using both the reported data of cumulative cases of COVID-19 in the Hubei province and for data consistency, using systematically modified data after removing the outliers.

The rest of the paper is structured as follows. We review the SIRD model and the ensemble Kalman filter algorithm for inverse modelling. Firstly, we demonstrate the use of EnKF using synthetic data. The effect on the estimated parameters using EnKF with different damping constant is illustrated with numerical simulations results using the synthetic case. Secondly, we show the usefulness of EnKF using the real COVID-19 data of Hubei province, China. Finally, we discuss the test cases results and end with the conclusion.

SIRD (Susceptible, Infectious, Recovered, and Dead) is a four-compartment model that has been widely used as a forecasting method of infectious disease [1-4, 16, 17, 22, 34] . In the SIRD model, the number of susceptible individuals (S), infected individuals (I), recovered individuals (R), and dead individuals (D) vary with time (t) as follows [2, 16, 17] :

where β is the transmission rate (infection), γ is the recovery rate, and δ is the death rate. The model estimation using COVID-19 data employs data on diagnosed cases. Hence, in that empirical context, δ can be better considered as the case fatality rate. Following Calafiore et al. [16] and Ianni and Rossi [34] , N is defined as the fraction of the total population size that is affected by the contagion. The model assumes that each individual who has already been infected can transmit the virus to those susceptible. Furthermore, the time length is considered short so that births and deaths not related to the virus are neglected. The SIRD model does not consider the effects of exposure, quarantine, confinement, or an asymptomatic population. This model is suitable for the case without any protection measures and restriction on activities, e.g. wearing masks and lockdown measures [22] . The capability of the simple SIRD model to capture the dynamics of COVID-19 has been demonstrated by Fanelli and Piazza [39] and Anastassopoulou et al. [3] .

In the model, N(t) = S(t) + I(t) + R(t) + D(t) which is assumed constant [1, 2, 34] . The solution of the system of ODE depends on the initial conditions S 0 = S(0), I 0 = I(0), R 0 = R(0), and D 0 = D(0) for the susceptible, infected, recovered, and death populations, respectively. However, the initial number of susceptible population, S(0) = N(0) − I(0) − R(0) − D(0), is usually unknown since N 0 = N(0) is typically unknown [16, 34] . This study assumes that the entire population, i.e. Hubei province, is vulnerable to the disease at the first level of the outbreak, and we let N = N(t) = N(0). However, it is noted that this number can be influenced by several factors such as geographical, social, and economic characteristics of the region under study.

Factors such as restricted movements of individuals, lockdown of cities, social distancing, quarantine measures, and preventive measures such as hand sanitation and wearing face masks allow the transmission rate to vary over time [25, 40, 41] . Hence, the transmission, recovery and death rates were all allowed to be time-dependent in this study. Similar to Avila-Ponce de León et al. [40] and Gupta et al. [41] , the three time-varying parameters were defined as follows:

The infection rate: the time-varying infection rate before and after lockdown is described by

β(t) is a function of three characteristic constants β 0 , β 1 and τ β . Before lockdown, β(t) = β 0 is a constant. When the lockdown is imposed at time t = t lockdown , β(t) decreases exponentially from β 0 +β 1 to the final value β 1 with a characteristic time of decrease τ β . The recovery rate: With a new disease such as COVID-19, the health care system and medical staff have to learn and adopt new therapeutic procedures, including treatment of patients with new symptoms [40, 41] . Hence, the recovery time for patients may change with time. In this study, γ(t) is described by the function

where γ 0 is the initial rate of recovery, and after t = γ τ the final recovery rate becomes γ 0 +γ 1 . The death rate: the death rate may also decrease with time due to factors such as adaptation of the pathogen and development of advanced treatments and vaccinations, including nonpharmaceutical interventions such as social distancing, lockdown of cities and increase in public awareness about the disease [40, 41] . The death rate δ(t) is described using the function

where at time t = t lockdown , δ 0 + δ 1 is the initial death rate that decreases exponentially to the final value δ 1 with a characteristic time of decrease τ δ . For the SIRD model to simulate a particular epidemic with the three time-varying parameters, the nine characteristic constants (β 0 , β 1 , τ β , γ 0 , γ 1 , τ γ , δ 0 , δ 1 , τ δ ), need to be estimated via inverse modelling. Evensen [27] introduced the ensemble Kalman filter (EnKF), an algorithm for sequential data assimilation problems. Several papers are available for the derivation of the ensemble Kalman Filter (EnKF), including its algorithm, e.g. [42] [43] [44] [45] . An ensemble of states is employed to approximate forecast states statistical information, including the model covariance matrix. The states are estimated by assimilating observations into the model in accordance with the Kalman filter formula [45] . The EnKF can be further adapted to estimate both model states and the unknown parameters using an augmented state-parameter scheme [26, 44, 46] . The steps of the augmented EnKF are summarized below [26, 44, 46] .

Consider a discrete nonlinear model:

where s k = [S k I k R k D k ] is the vector of the state variables at time t = k, S is the nonlinear operator (SIRD model (1)

is the vector of parameters that are assumed to remain constant in time, w k+1 is the model noise that is assumed to follow zero-mean Gaussian noise with covariance matrix Q k+1 , y k+1 is the vector of observation (active number of infected cases, cumulative number of recovered cases, and cumulative number of death cases), M is the observation operator which connects the observed values to the state values of the model and e k+1 is the observation noise that is assumed to follow zero-mean Gaussian noise with covariance matrix R k+1 . At the forecast step, state variables, s k+1 , and parameters, θ k+1 , are augmented to form a vector

For an ensemble of size n, an initial forecast ensemble of augmented vectors X kþ1 ¼

is assumed known. The superscript f i for i = 1, 2, . . ., n is the i th forecast member of the ensemble X. Each i th member of the ensemble is used to generate i th realization of the model state vector using the forward model S. A set of corresponding measurement vector,

p denotes the number of observations at t = k + 1, and in this study p = 3.

In the analysis (assimilation) step, a perturbed observation vector,ŷ i kþ1 ¼ŷ kþ1 þ e i kþ1 , for each i th member of the ensemble is obtained using the current available observed datâ y kþ1 2 R p . The random perturbations e i kþ1 � N ð0; R kþ1 Þ [46] . The i th forecast member of X k+1 is then updated using the difference between perturbed observations and measurements according to:

where x a i kþ1 represents the i th analyzed (updated) member of X k+1 and K k+1 is the Kalman gain matrix calculated as

In Eq (10), R k+1 is the observation covariance matrix, and the covariance matrices C f xy kþ1

and C f yy kþ1 are defined as [45] :

where �

At each j th EnKF iteration, each member of X k+1 , i.e x f i kþ1 for i = 1, . . ., n, is updated by assimilating perturbed observations using Eq (9) . Henceforth, one EnKF iteration corresponds to one assimilation cycle. The procedure is iterated with the updated ensemble until a userdefined stop criterion is met, e.g. stopping criteria based on the maximum number of iterations or setting a threshold of the change of parameter values between two consecutive EnKF iterations. After the final iteration, the average of the ensemble is taken as the best estimate of the states and the unknown parameters, and the spread of the ensemble as the error variance [47] .

The nonlinear relations between the model parameters and the measurements can cause the ensemble variance of parameters to collapse after a few cycles during the update step, leading to filter inbreeding (divergence) [30, 48] . In previous studies, Hendricks Franssen and Kinzelbach [30] and Rasmussen et al. [35, 49] showed that using a damping factor mitigates filter inbreeding and improves the parameter update in the assimilation step. The damping factor in the update step reduces spurious covariance resulting from an abrupt update of parameters [28, 30] . By applying a damping factor, α, in Eq (9), the i th member of the ensemble is updated using

0 � α � 1 where α = 0 means no update of parameters during the assimilation step, and α = 1 means the basic scenario without any damping effect. In this study, the damping factor is only applied on parameter updates keeping the states updates undamped. To evaluate the influence of the damping factor on the performance of EnKF, different scenarios with α = 0.1 to 1.0, in step size of 0.1, were explored in this study.

In this study, the EnKF iterations (assimilation steps) are repeated until the following convergence criterion is satisfied:

The complete procedure for estimating uncertain and unknown model parameters using the ensemble Kalman filter is summarized in Algorithm 1. 2: while r > tol do 3: for k = 0 to t obs do 4: get observations:

for i = 1 to n do 7: measurements:

end for 10:

compute cross-covariance:

for i = 1 to n do 14:

assimilate (update): 

The proposed damped-EnKF-based parameter estimation technique was applied to two test cases considering synthetic and real data. Firstly, we use the synthetic data to study the effect of different damping factors on the quality of the estimated parameters by the filter. This is followed by studying the sensitivity of the filter with different ensemble size. Finally, the EnKF with the selected damping factor and the ensemble size is used in the second test case using the real data.

In the first test case, the performance of the EnKF was assessed using simulated data with synthetic observations. Table 1 shows the model parameters, Table 1 are referred as "true" (target) values. The system of ODE (Eq (1)) was solved numerically for, 0 � t � 100, with initial values I(0) = 350, R(0) = 1, 

The ensemble Kalman filter and epidemiological models D(0) = 7, and S(0) = N − 350 − 7 − 1, using MATLAB's (version R2016a) ode45 solver. The population size N and t lockdown were taken as 60M and 15, respectively. Synthetic observations were then recorded by extracting state values I(t), R(t) and D(t) representing the active number of infected cases, the cumulative number of recovered cases and the cumulative number of death cases. The inverse problem involves employing EnKF to estimate the initially assumed true values of the parameters using synthetically generated observed values of I(t), R(t) and D(t). To study the effect of different damping factor on the filter's performance, an ensemble of size n = 200 was chosen. The use of the EnKF with a 200-member ensemble has recently been shown to produce desirable results [26] . We use the state-parameter augmented EnKF, where θ = [β 0 , β 1 , τ β , γ 0 , γ 1 , τ γ , δ 0 , δ 1 , τ δ ] and s = [SIRD], and the augmented vector is

Each state ensemble is initialised, for i = 1, 2 . . ., n, using normal distributions as follows:

where s � N ð0; 1Þ and μ is set to 20%. Initial ensemble for parameter values, for i = 1, 2 . . ., n, were randomly drawn from uniform distribution: The time span between two EnKF assimilation steps was taken as dt = 1day. Hence, the observationsŷ t 2 R 3 , i.e. values of I(t), R(t) and D(t), were assumed to be known at t = 0, 1, . . ., 100.

In the assimilation step, perturbed observations are generated usingŷ t . Therefore, the perturbed observationsŷ i k ¼ŷ k þ e i k , for i = 1, 2, . . ., n. The additive noise e i k � N ð0; R k Þ where R k ¼ diag½s 2 I ; s 2 R ; s 2 D �. σ I , σ R and σ D are the observation errors taken as 10% of the observed data values at time t = k. For convergence criteria, we set the tolerance value in Eq (13) as tol = 0.001. With the above setting, the proposed method was applied to retrieve the "target" (true) value of the model parameters using ten different damping constants. The accuracy of the proposed method was assessed by computing the Relative Mean Absolute Error (RMAE) of the simulated model state as

where y(j) is the observed state, s(j) is the simulated state using the estimated parameters, and N is the sample size of the observed data. Fig 1 shows the percentage error in the estimated parameters using EnKF (ensemble size, n = 200) with different damping factors. In this synthetic case, the best performance of EnKF is achieved by using the basic scenario without any damping effect, i.e. with α = 1. The most difficult parameter to estimate was τ β that even with the basic scenario had an error of around 13% in the estimated value. Fig 2 shows the RMAE of the model states simulated using the EnKF estimated parameters with different damping factors. With the basic scenario, the computed RMAE's was the least, with a value of less than 1%. The variability of the estimated parameters and RMAE with different damping factors led us to choose the EnKF with the basic scenario for the remaining test cases in this study.

The sensitivity of the filter with different ensemble size is also studied. The augmented basic EnKF assimilation system was executed using six different ensemble sizes: n = 50, 100, 200, 300, 400, and 500. Fig 3 compares the RMAE of the simulated model states (infected, recovered Table 2 presents the parameters estimated together with their associated uncertainties. All parameter estimates either converged to their target values or close to them. The parameter τ β had the largest error (13%) in its estimated value. The results show that an ensemble of size n = 200 is sufficient to capture the true parameters. Fig 5 shows the best-fit parameters of the time-varying infection, recovery and death rates. There is a good fit between synthetically generated (truth), and the model estimated variation of β(t), γ(t) and δ(t). Fig 6 shows the curve fitting accuracy between the observations and the simulated results of the SIRD model using the estimated parameters. There is a good fit of profiles indicating that the true model states are captured well through data assimilation using synthetic observations. In addition to computing RMAE to numerically quantify the accuracy and agreement between the observations and model-simulated results, the coefficient of determination, R 2 , values are computed using

where the variables y, s and N are as defined in Eq (16) . Table 3 lists the RMAE and R 2 values of the model states simulated using the EnKF estimated parameters. The RMAE values are less 

The ensemble Kalman filter and epidemiological models 

The ensemble Kalman filter and epidemiological models classification rule and revision of the definition of COVID-19 cases by the National Health Commission of the PRC [3, 25, [36] [37] [38] .

To study the effect of data quality on the filter's performance, the model parameters were estimated using both the reported data of cumulative cases and for data consistency, using systematically modified data. The modified data was obtained after removing the outliers and creating a new data series for the cumulative number of cases from the daily new cases. The reported 14,840 confirmed cases on February 12, 2020, included 13,332 clinical cases confirmed by the new diagnosis classification rule [51] . Recently, attempts have been made to remove such outliers from COVID-19 data and reconstruct a new time series from the number of new cases, e.g. Arroyo-Marioli et al. [25] , Fu et al. [52] , and Liu et al. [53] .

We reconstructed the time-series data for the cumulative number of cases following the methods similar to that presented in Fu et al. [52] and Liu et al. [53] . It is noted that 4,823 new cases were reported on February 13, 2020. Compared to other days, this huge number of cases may also include cases that failed to meet the earlier diagnosis classification rule. For the number of new cases on February 12 and 13, we set it to 14,480 − 13,332 = 1,508. The extra 13,332 + 4,82 − 1,508 = 16,647 cases were added to the reported number of new cases (New-Case) from January 22 to February 14 in proportion to the original daily increment of the new cases. The new time-series data for the cumulative number of cases was obtained using C(t) = C(t − 1) + NewCase(t). The modified cumulative number of cases is shown with black dots in Fig 7. The number of infected (active) cases were then obtained using

Time-varying parameter estimation. As in the synthetic case, nine parameters describing the three time-varying model parameters, (β(t), γ(t) and δ(t)), were estimated using the observed time-series data of I(t), R(t) and D(t). To study the effect of the quality of the reported data on the filter's performance, two models were estimated using different observed timeseries data. Firstly, we used the values of I(t) obtained using Eq (18) , where C(t) is the reported values. We refer to this as case_orig. Secondly, I(t) was obtained using the modified values of C(t), and we refer to this as case_mod.

The population size of the Hubei province was taken as N = 59M (https://data.stats.gov.cn/ english/easyquery.htm?cn=E0103). For both, case_orig and case_mod, the initial state ensemble is generated using Eq (15) Table 4 . All other EnKF parameters and settings were the same as for the synthetic case.

The observed reported data of I(t), R(t) and R(t) are assimilated until the stopping criterion, Eq (13), is met. If the convergence criterion is not met once all observations are assimilated, the EnKF assimilation process is repeated with a different initial state ensemble. Figs 8 and 9 show the estimated parameter evolutions for case_orig and case_mod, respectively. The shaded areas show the uncertainties in the final estimate around the mean values. For case_orig, it took a long time to achieve convergence (333 assimilation cycles) compared to case_mod, which met convergence with 250 assimilation cycles. With case_orig and case_mod, the parameter β 1 had the largest uncertainty in its final estimate. On the other hand, both cases had a small uncertainty in the estimation of β 0 . Table 5 presents the parameters estimated together with their associated uncertainties for case_orig and case_mod. The EnKF estimated different model parameters for case_orig and case_mod. Fig 10 compares the best-fit parameters of the estimated time-varying infection, recovery and death rates of case_orig and case_mod. Even though there are some differences between the estimated parameters from case_orig and case_mod, the estimated β(t) from the two cases show similar profiles. However, we notice some differences in the estimated γ(t) and δ(t) in the two cases between t = 0 and t = 25. A possible cause for this can be attributed to modifying time-series data in the same time range.

Finally, in Fig 11, we show the curve fitting accuracy between the observations, i.e. reported I(t), R(t) and D(t), and the simulated results of the estimated model for case_orig and case_mod. We see a good fit for the infected population (active cases) for case_mod. However, there is a misfit with case_orig between t = 5 to t = 30. After t = 30, both the cases show similar profiles with a good fit. Likewise, in comparison with case_orig, case_mod shows a slightly better fit of the recovered population. However, both cases underestimate the recovered population between t = 20 to t = 30. Both cases well estimate the dead population. Overall, case_orig and case_mod show a good fit of the recovered and death populations, while case 1 shows an improvement in the estimation of the infected population. This is confirmed by the R 2 values of the simulated states for the two cases, as presented in Table 6 .

Recent works on the COVID-19 modelling using COVID-19 data of China include the works of Libotte et al. [18] , Lobato et al. [2] , Li et al. [11] and Cooper et al. [54] . To estimate the parameters using Stochastic Fractal Search (SFS) and Multiobjective Stochastic Fractal Search (MOSFS) algorithm, Lobato et al. [2] used a normalized version of the SIRD model using data of China. Similarly, Libotte et al. [18] used the normalized version of the SIR model and employed the Differential Evolution (DE) method to estimate the model parameters. Cooper et al. [54] estimated the SIR model via data fitting with a nonlinear function using COVID-19 data of China. Similarly, Li et al. [11] used an SEIR model based on deterministic assumptions and applied the EAKF to estimate model parameters using the data of China.

The transmission rate, β, is considered as an important parameter that needs to be estimated for epidemic modelling [29] . Table 7 presents the comparison between the EnKF estimated infection rate, β(0), with the estimated β from the recent works mentioned above. The values in Table 7 are directly taken from the reported results of five different methods (SFS, MOSFS, DE, EAKF and data fitting) presented in the reference literature [2, 11, 18, 54] . We observe that the EnKF estimated value of β = 0.3848 is very close to the values estimated from other methods in [2, 11, 18, 54] . This means that the infection rate was similar irrespective of the population size. In Fig 10, we see that the recovery rate from case_orig and case_mod is initially slower and later reaches a constant value of β � 0.074, corresponding to a recovery time 

The ensemble Kalman filter and epidemiological models 

The ensemble Kalman filter and epidemiological models of �14 days. The estimated value of β agrees with the median recovery time of 2 weeks for mild COVID-19 cases as reported by the World Health Organisation [55] .

In this study, the best performance was achieved with a damping factor of α = 1. The EnKF method presented may not be the best method for estimating a basic SIR model and thus should be considered as an alternative for inverse modelling. Other more straightforward methods and optimization techniques such as least-square techniques [3, 10, 12, 14, 15] , Differential Evolution method [18] , and Stochastic and Multiobjective Fractal Search algorithm [2] can be employed. An advantage of using EnKF lies in the fact that it can provide a reliable uncertainty in the estimated parameter values. Hence, the EnKF makes it easier to quantify estimation uncertainty. Moreover, in comparison to other optimization methods, the observed data is assimilated in real-time with EnKF. EnKF is a derivative-free method in the sense that it does not require derivatives of the model function. This gives the EnKF an advantage over other optimization techniques that require derivatives, such as an extended Kalman filter. Hence, the EnKF can be used with any forward model, including complex and high dimensional models. However, the EnkF can be computationally demanding, especially with a larger ensemble size.

Our study's obvious limitation is determining an optimal value of the damping factor for inverse modelling using the EnKF. It is important to emphasize that COVID-19 data of only Hubei province, China, was used in this study. Also, the inverse modelling was performed using the data after January 22, 2020, when there was more control of individual movements due to the Chinese government's various restrictions. Further studies are warranted to find an optimal value of the damping factor. The one straightforward recommendation is to apply the proposed method using COVID-19 data of other countries to identify any similarities in the damping factor.

Two cases considering the real data, case_orig and case_mod, were used to study the performance of EnKF in terms of model estimation. Even though there was a slight difference between the estimated models from the two cases, one can apply a different procedure to remove the outliers from the reported data and obtain another time-series data for the infected (active) number of cases, e.g. one may adopt the method presented in Arroyo-Marioli et al. [25] . Moreover, different forms of time-varying parameters can be used in this study, e.g. Yang et al. [20] estimated the transmission (infection) rate, β(t), as a piecewise constant function in time while Arroyo-Marioli et al. [25] used a random walk model for β(t).

However, the results obtained from the use of the EnKF demonstrate its usefulness in estimating the unknown and uncertain parameters of an epidemic model. Moreover, the EnKF 

In this study, we evaluated an augmented Ensemble Kalman Filter's capability to estimate time-varying model parameters using two types of observational data, i.e., synthetic data and with COVID-19 data of Hubei province, China. Furthermore, we investigated the effect of the damping factor on the performance of the EnKF. Three time-varying SIRD model parameters were determined by estimating nine constant parameters. The best performance of EnKF was obtained using the basic EnKF scheme. Good performance was achieved with a small ensemble size of 200. The results presented in this study shows that epidemiological models can be estimated using EnKF even from imperfect data that can result from missing, incomplete or incorrect data. As an alternative to existing 

The ensemble Kalman filter and epidemiological models optimization techniques, one can use the EnKF algorithm presented in this paper to estimate uncertain and unknown model parameters with their associated uncertainties.

Conceptualization: Rajnesh Lal, Weidong Huang, Zhenquan Li. 

COVID-19: The unreasonable effectiveness of simple models

Identification of an Epidemiological Model to Simulate the COVID-19

Epidemic Using Robust Multiobjective Optimization and Stochastic Fractal Search

Data-based analysis, modelling and forecasting of the COVID-19 outbreak

Mathematical analysis of SIRD model of COVID-19 with Caputo fractional derivative based on real data

A conceptual model for the coronavirus disease 2019 (COVID-19) outbreak in Wuhan, China with individual reaction and governmental action

Early transmission dynamics in Wuhan, China, of novel coronavirus-infected pneumonia

When will the battle against novel coronavirus end in Wuhan: A SEIR modeling analysis

Estimating the prevalence and risk of COVID-19 among international travelers and evacuees of Wuhan through modeling and case reports

Mathematical modeling of COVID-19 transmission dynamics with a case study of Wuhan

Modeling and forecasting the spread tendency of the COVID-19 in China

Substantial undocumented infection facilitates the rapid dissemination of novel coronavirus (SARS-CoV-2)

Simulation of the Final Size of the Evolution Curve of Coronavirus Epidemic in Morocco using the SIR Model

A fractional-order model for the novel coronavirus (COVID-19) outbreak

Monitoring Italian COVID-19 spread by a forced SEIRD model

Identification and Estimation of the SEIRD Epidemic Model for COVID-19

A time-varying SIRD model for the COVID-19 contagion in Italy

Modeling and forecasting the early evolution of the COVID-19 pandemic in Brazil

Determination of an optimal control strategy for vaccine administration in COVID-19 pandemic treatment

Modeling the progression of COVID-19 deaths using Kalman Filter and AutoML. Soft Computing

Short-term forecasts and long-term mitigation evaluations for the COVID-19 epidemic in Hubei province

Sequential data assimilation of the stochastic SEIR epidemic model for regional COVID-19 dynamics

Use of a Modified SIRD Model to Analyze COVID-19 Data. Industrial & Engineering Chemistry Research

How simulation modelling can help reduce the impact of COVID-19

Multi-mission satellite remote sensing data for improving land hydrological models via data assimilation

Tracking r of COVID-19: A new real-time estimation using the Kalman filter

An extended SEIR model with vaccination for forecasting the COVID-19 pandemic in Saudi Arabia using an ensemble Kalman filter

Sequential data assimilation with a nonlinear quasi-geostrophic model using Monte Carlo methods to forecast error statistics

Improving parameter and state estimation of a hydrological model with the ensemble square root filter

Probabilistic forecasting and Bayesian data assimilation

Real-time groundwater flow modeling with the ensemble Kalman filter: Joint estimation of states and parameters and the filter inbreeding problem

A note on the estimation of the initial number of susceptible individuals in the general epidemic model

Improved estimation of the initial number of susceptible individuals in the general stochastic epidemic model using penalized likelihood

The applications of capture-recapture models to epidemiological data

Describing the COVID-19 outbreak during the lockdown: fitting modified SIR models to data

Data assimilation in integrated hydrological modeling using ensemble Kalman filtering: evaluating the effect of ensemble size and localization on filter performance

Official news on the change of diagnosed approaches issued by the Chinese govenment

Diagnosis and treatment of Corona Virus Disease 2019 (trial version 5)

Summary of the COVID-19 epidemic and estimating the effects of emergency responses in China

Analysis and forecast of COVID-19 spreading in China

Data analytics and mathematical modeling for simulating the dynamics of COVID-19 epidemic-a case study of India

An SEIARD epidemic model for COVID-19 in Mexico: mathematical analysis and state-level forecast

Nonlinear measurement function in the ensemble Kalman filter

What is the ensemble Kalman filter and how well does it work?

Ensemble Kalman filtering

A modified ensemble Kalman particle filter for non-Gaussian systems with nonlinear measurement functions

Comparing seven variants of the ensemble kalman filter: How many synthetic experiments are needed?

Analysis scheme in the ensemble Kalman filter

Joint inference of groundwater-recharge and hydraulic-conductivity fields from head data using the ensemble Kalman filter

Data assimilation in integrated hydrological modelling in the presence of observation bias

COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University. 2020. Available online

National Health Commission of China

Simulating and forecasting the cumulative confirmed cases of SARS-Cov-2 in China by Boltzmann function-based regression analyses

Estimation of infection density and epidemic size of COVID-19 using the back-calculation algorithm. Health Information Science and Systems

A SIR model assumption for the spread of COVID-19 in different communities

Report of the WHO-China Joint Mission on Coronavirus Disease 2019 (COVID-19)

The authors have declared than 1% for recovered and death cases and less than 2% for active cases. Also, all R 2 values are