key: cord-0682381-cqo1xn3d
authors: Tat Dat, Tô; Frédéric, Protin; Hang, Nguyen T. T.; Jules, Martel; Duc Thang, Nguyen; Piffault, Charles; Willy, Rodríguez; Susely, Figueroa; Lê, Hông Vân; Tuschmann, Wilderich; Tien Zung, Nguyen
title: Epidemic Dynamics via Wavelet Theory and Machine Learning with Applications to Covid-19
date: 2020-12-18
journal: Biology (Basel)
DOI: 10.3390/biology9120477
sha: 302a8cea677a60e3c534e5e3cd73828647403c97
doc_id: 682381
cord_uid: cqo1xn3d

SIMPLE SUMMARY: Using tools from both mathematics (especially wavelet theory) and computer science (machine learning), we present a general new method for modelling the evolution of epidemics which is not restricted to human populations. A crucial novel feature of our approach is that it significantly takes into account that an epidemic may take place in certain types of waves which cannot only be of a global as well as local nature, but can also occur at multiple different times and locations. In the particular case of the current Covid-19 pandemic, based on recent figures from the Johns Hopkins database we apply our model to France, Germany, Italy, the Czech Republic, as well as the US federal states New York and Florida, and compare it and its predictions to established as well as other recently developed forecasting methods and techniques. ABSTRACT: We introduce the concept of epidemic-fitted wavelets which comprise, in particular, as special cases the number [Formula: see text] of infectious individuals at time t in classical SIR models and their derivatives. We present a novel method for modelling epidemic dynamics by a model selection method using wavelet theory and, for its applications, machine learning-based curve fitting techniques. Our universal models are functions that are finite linear combinations of epidemic-fitted wavelets. We apply our method by modelling and forecasting, based on the Johns Hopkins University dataset, the spread of the current Covid-19 (SARS-CoV-2) epidemic in France, Germany, Italy and the Czech Republic, as well as in the US federal states New York and Florida.

The present work proposes a novel method for modelling epidemic dynamics by combining wavelet theory and data-driven model section techniques in machine learning.

In understanding epidemic diffusion and the growth rate of an infectious disease at population level, the actual number of reported cases of infections always plays a (if not the) crucial role and, far beyond that, at least in the case of diseases afflicting human societies, directly influences government and health care system decisions and measures regarding, e.g., protection, containment and hospital capacities. However, due to both the manifold practical as well as conceptual issues involved, a rigorous and accurate detection of this number turns out to be a rather difficult and complex problem.

To illustrate at least some of the theoretical difficulties involved here by a prominent and important case which calls the entire world now to action, let us note that most current mathematical modelling and forecasting techniques for the spread of the Covid-19 disease are based on classical Susceptible-Infectious-Recovered/Removed (SIR) and Susceptible-Exposed-Infectious-Recovered/Removed (SEIR) compartmental epidemiological models [1] [2] [3] . Yet, with regard to predicting the number of infectious cases I(t) at time t, they suffer from severe and model-inherent principal limitations:

All these models, as well as all their derivatives, are not suitable to build a model for the function I(t) which is compatible with any given population. This is because these models are based on the assumption that the population is homogeneously composed and distributed (i.e., the chance that an arbitrary infected person will infect an arbitrary susceptible person is taken to be constant throughout the epidemic, and, moreover, it is assumed that at any given time every infected person has one and the same constant chance to recover).

In real life, however, there are actually many and rather diverse waves of outbreaks, stemming from different times or locations. One faces here not only drastically varying growth rates, but also hot spots versus no-cluster locations, infection rates depending on age or other parameters, etc. which altogether entails that the homogeneity assumption approach taken in SIR models and their variations is oversimplified and cannot give realistic forecasts.

To overcome the drawbacks caused by homogeneity assumptions, the new approach presented in this work is based on the following idea: we shall decompose the growth curve of infection numbers into several basic "waves", where each basic wave is considered as a representation of the epidemic, and localised both in time and position.

This point of view naturally calls for the use of wavelet theory. Wavelets as such are special families of functions which came up in the 1980s by combining older concepts from mathematics, computer science, electrical engineering and physics, having since found fruitful applications in many other disciplines. In particular, some precursor, wave-based approaches to modelling epidemic growth appeared already a long time before wavelets emerged in both deterministic and stochastic models, compare, among others, the works in [4] [5] [6] [7] , and only very recently, Krantz et al. (compare with the work in [8] ). Moreover, the latter work has also proposed building epidemic growth models by combining wavelet with discrete graph theory (see also below).

In this article, we propose an approach to epidemic dynamics by modelling the number of daily reported cases using specially designed wavelets, called epidemic-fitted (EF) wavelets. For instance, the number I(t) of infectious individuals at time t in the classical SIR and SEIR models is an EF wavelet, see Section 3.4. Another example of an EF wavelet is the log-normal one, which we will use in our Covid-19 spread forecasting applications, see Sections 3.4 and 4.1 for more details.

In our approach, the number of daily reported cases is the value of a function that is a positive linear combination of N EF wavelets at the given day. We fix the number N of summands of EF wavelets entering in our modelling function (and in our applications N is usually taken to be 3 or 5). The wavelet series coefficients themselves are then obtained by machine learning-based curve fitting methods with square loss function, see Sections 2.2 and 4.1.

We then proceed with specific applications to Covid-19 scenarios. Here, we present, now using in addition data-driven machine learning-based curve fitting, some of our model's predictions to selected countries and US federal states, which are based on the currently existing respective data for these locations provided by the most recent numbers supplied by the Johns Hopkins University Covid-19 database.

Before mentioning and commenting upon other related works, let us adopt from now on, and throughout all following parts of the present work, the following convention: as we shall consider only reported cases in our paper, we will omit the adjective "reported" from "reported cases of infected". In [9] , the authors present three basic "macroscopic" models to fit data emerging from local and national governments: exponential growth, self-exciting branching process and compartmental models. The compartmental models are the classical SIR and SEIR models; the self-exciting branching process has been used before with regard to treating Ebola disease outbreaks and other dynamics of social interaction. In the exponential growth model, the number I(t) of infectious individuals at time t is expressed as I(t) = I 0 e αt , where α is the rate constant. The exponential growth model is related to our approach, in which the exponential function is modelling the reported infections. However, as this is a one-parameter model, it works only well for fitting the data at the beginning of an outbreak.

In [10] , the authors use a log-normal density function with three parameters to fit the daily reported cases. However, as they tried to fit the data with only one function, the curve of reported cases may not be well fitted, as there are usually several waves of the epidemic for a period while one function presents only one wave. As explained above, our wavelet approach does overcome this difficulty. In [11] , the authors use the function f (x) = kγβα β x −1−β exp(−γ(α/x) β ) with parameters α, β, γ, k to fit all data. This method, too, can fit the data only for one wave. In [12] , the authors fit the data of daily reported cases with a two-wave model, using the sum of two Gaussian functions.

In [13] , the authors introduce an epidemic model composed of overlapping sub-epidemic waves, where each wave is a generalised logistic growth model given by solution of differential equations. A short-term forecast of the Covid-19 epidemic in China from 5 to 24 February 2020 was given in [14] using three phenomenological models (generalised logistic growth model, the Richards growth model and sub-epidemic wave model in [13] ) and ensemble methods (see also [15] for the ensemble approach in forecasting epidemic trajectories). In [16] , a multi-wave model combining several SIR models, namely, a Multiple-Wave Forced-SIR model, was introduced to fit the data of daily cases.

Recently, Krantz et al. [8] have proposed an approach to construct epidemic growth models using fractional wavelets. These are built from the number of reported cases to construct wavelets that model the dynamics of the number of completed cases [8] . In their paper, the number of completed cases is the sum of the number of reported cases and the number of unreported cases. Furthermore, the proposed approach there is to update their models assuming the availability of the reporting error which improves over time and tends to zero eventually. This assumption appears to us, however, as a too idealistic one.

Those two last approaches are the ones which are most closely related to our own. However, while those use single waves coming from solutions differential equations, we use general wavelet functions such as Gaussian functions, log-normal functions, Gompertz density functions and Beta prime density functions, which all satisfy our general condition of being epidemic-fitted in the sense of Definition 2. We also refer to the works in for other approaches on modelling and forecasting the spread of Covid-19 epidemic using deep learning, machine learning, time series analysis, network model, stochastic model and deterministic compartmental framework.

The remaining parts of the present paper are organised as follows. In Section 2, we first recall the notion of a wavelet (Definition 1) and the fundamental theorem of wavelet theory (Theorem 1), which we are going to put to use in the sequel. We proceed by introducing the notion of an epidemic-fitted (EF) wavelet (Definition 2) and propose our method for modelling epidemic dynamics (Proposition 1), justified by the fundamental Theorem 1. In Section 3, we consider several important examples of EF wavelets and impose constraints on an EF wavelet to be suitable as a basic EF wavelet in epidemic dynamics. In Section 4, we present applications of our method to modelling and forecasting the current spread of Covid-19 in France, Germany, Italy, the Czech Republic and several US federal states, all based on the most recent JHU data.

In this subsection, we recall and collect some basic concepts and facts from Wavelet Theory (cf. [42] [43] [44] ), which will be needed in our approach for modelling epidemic dynamics.

A wavelet or mother wavelet is a function ψ ∈ L 1 (R) such that the following admissibility condition holds:

whereψ is the Fourier transform of ψ, i.e.,ψ(ξ) = R ψ(x)e −iξx dx.

Notice that condition (1) is only satisfied ifψ(0) = 0 or ψ(x)dx = 0. Conversely, we have the following sufficient condition for (1).

A basic example of a wavelet is the function

From a mother wavelet one can generate other wavelets (called children wavelets), using affine transformations (i.e., dilations and translations):

These wavelets provide us with the following decomposition of L 2 (R). [25] [26] ). Let ψ be a mother wavelet. Then, any f ∈ L 2 (R) decomposes as

Any function f ∈ L 2 (R) can then be written as a superposition of ψ a k ,b , i.e.,

We refer to the work in [42] for more details on the analysis of discrete wavelet decomposition and, especially, for precise formulas for the coefficients α k, .

Notice that from a machine learning point of view, finding the α k, , a k , b can be thought of as a curve fitting problem, and this is how we will combine wavelet theory and machine learning techniques in our approach to modelling epidemic dynamics.

As we already explained in the introduction, the time development of an epidemic features local as well as global wave-type phenomena. This leads us to the concept of epidemic-fitted wavelets. Informally speaking, such a wavelet is given by a positive real function W : R → R >0 , whose value W(t) at a given time t describes the number of new infected cases in a homogeneous population with respect to an epidemic that occurs in one wave only, and thus will satisfy some sort of homogeneous compartmental model (without network structure).

As we are interested in the daily infected cases, we can assume that W(t) is strictly positive but tends to 0 when t tends to ±∞. Setting w(t) = ln W(t) so that W(t) = e w(t) , the (multiplicative) growth rate of W is its log-derivative:

We wish W(t) to "start" at t = a, (reach its) "peak" at t = χ, and "stop"

and ψ admits its maximum at some point in (a, b).

We can interpret ψ as a waveletψ in the sense of Definition 1 by simply settingψ(

Rψ (x)dx = 0 and R |ψ(x)|(1 + |x|)dx < ∞, therefore Cψ < ∞ by Lemma 1 andψ is a wavelet.

The first examples of EF wavelets which come to mind are polynomial functions of degree 3 (restricted to some finite interval). Other examples of functions with start-peak-stop behaviour are Gaussian functions, log-normal functions, Gompertz density functions

and, in SIR models, the solution function giving the number of I(t), the number of infectious individuals (cf. the work in [45] , etc.). In our applications to real data (see Section 4), we will employ log-normal functions as epidemic-fitted (EF) wavelets. For treating an epidemic, we will concentrate on the curve of daily (reported) infected cases, denoted by RC(t), and try to understand the epidemic growth based on this information. Theorem 1 implies that our following ansatz is "asymptotically" correct, as the number N grows to infinity. In particular, numerical simulations involving bigger and bigger numbers N will lead to better and better accuracy. Proposition 1 (Ansatz). A positive function (or curve) whose value is the number of infected cases at time t is representable as a finite linear combination of epidemic-fitted wavelets:

where each such wavelet W i can be obtained from a basic (mother) EF wavelet ψ by adding some parameters

Using this ansatz, we shall model epidemic dynamics by finding the wavelet series coefficients α i and θ i in the decomposition (5), when given the number of infected cases over a sufficient long time frame. This amounts to solving a curve fitting problem in machine learning.

In this section, we introduce some epidemic models with different basic (mother) epidemic-fitted (EF) wavelets. In Section 4, we show by fitting the Covid-19 data that log-normal EF wavelet models are highly compatible with the data and lead to very good forecast projections.

The standard Gaussian function is a fundamental example of a function which has start-peak-stop behaviour and exponential growth:

x → exp(−x 2 /2).

After dilating and translating, we obtain a general Gaussian function

We remark that, in general, lim x→−∞ ψ b,c (x) = 0, but for certain b, c > 0 we have ψ b,c (0) 1. In this case, we can simply setψ(x) = max(ψ b,c (x) − ψ b,c (0), 0) as the corresponding Gaussian EF wavelet.

In [12] , the authors fitted the data of daily reported cases with a two-wave model using the sum of two Gaussian functions. However, as these are symmetric with respect to the the vertical line x = b, this model may be not compatible with the curve of daily cases. We will explain this point in further detail in the next section.

We define here the log-normal function, which is a Gaussian function in which the variable x is interchanged by log x:

We then define the corresponding log-normal wavelet by extending

Thus, we can rewrite it as

By dilating and translating, we obtain a general log-normal EF wavelet 

Based on probability distributions, we can also choose many other functions to build a basic EF wavelet. For example, one can start here from Gompertz density functions

or Beta prime density functions

where B is the Beta function. For appropriately chosen parameters b, c, they all satisfy the epidemic-fitted condition in Definition 2.

Another important class of EF wavelets is given by the function reporting the number of infectious individuals I(t) in compartmental SIR models and their variations (such as SEIR and SIRD models, etc.). The SIR (compartmental) model was introduced by W. O. Kermack and A. G. McKendrick [2] , in which they considered a fixed population with only three compartments, and the numbers S(t) (for "susceptible"), I(t) (for "infectious"), and R(t) (for "recovered" (or "removed")).

In Figure 2 , these curves show the number of infectious individuals I(t).

In general, I(t) is an implicit function defined by a system of differential equations, which can lead to difficulties when trying to fit the data. However, we can use here the implicit solutions for simple SIR models which were deduced recently in [45] . 

We explain here how to choose good EF wavelets for building an epidemic model. The first criterion to meet is the start-peak-stop behaviour as discussed in Section 2. Our second criterion is based on the following analysis of the number I(t) of infectious individuals in the SIR model:

A closer look at SIR models reveals that the number S(t) of susceptible individuals is decreasing in time. Therefore, the number I(t) of infectious also grows less and the rate of infectious, i.e., dI/dt, before the peak is always less than the one after the peak. This is an important criterion when choosing EF wavelets.

Log-normal EF wavelets actually turn out to be very good candidates in this regard. Indeed, the first advantage here is the start-peak-stop behaviour, where the start for a log-normal wavelet is at x = 0 (or near 0), the peak is achieved at x = e b and the stop depends on the constant c. The second advantage is that at the same value of ψ, the rate of the curve before the peak is less than the one after the peak. This can be easily seen as follows. The derivative of ψ b,c is

Now, suppose that ψ(

These are the main reasons why we first chose log normal functions as basic EF wavelets for our numerical simulations (see Section 4) . We also remark that in [10] the authors used the log-normal density function, i.e., f a,b,c (x) = a √ 2πcx ψ b,c (x), to fit the number of daily reported cases. However, as they used only one single function, and as there are in general many waves of the epidemic, the data may not be well-fitted enough to produce realistic projections.

In this section, using log-normal EF wavelets we provide numerical results on the fitting and forecasting of daily new cases of Covid-19 epidemic for some European countries and US federal states.

Our EF wavelet model for the curve of daily new cases is a finite representation by log-normal EF wavelest introduced in Section 4.1:

where a i , b i , c i are parameters, N is the number of log-normal EF wavelets and t is the time variable.

We intend to find the parameters a i , b i , c i such that W(t) is close to the number of daily infections RC(t) by a suitable loss function L(·, ·). In other words, we want to find parameters which minimise L(W, RC). For our numerical simulations presented in the next section of this work, we shall use the Levenberg-Marquardt algorithm (cf. [46, 47] ) for the least squares loss function. The main advantage of this approach is that the loss function helps us to force the peaks of EF wavelets close to the peaks of real data.

The number of log-normal wavelets N depends on the data of each population level, since it presents the numbers sub-epidemic. In our numerical simulations, we first try with N = 3, 5. It would be interesting to estimate N before fitting the model. Otherwise, we will need to choose N sufficiently large, and redundant wavelets will have very small coefficients and, correspondingly, very little effect.

We will be using the data supplied by the Johns Hopkins University Center [48] , noting, however, that almost all data from countries or US federal states are subject to (high) noise. One of the main reason for this is the reporting delay (cf. [49, 50] ). As explained in [50] , "there will be two main sources of delay in monitoring trends. First of all, there will be a testing delay between the actual date when an individual becomes infected and the date when that individual is ultimately tested. Second, unless test samples are very rapidly processed, there will be a further reporting delay between the date of testing and the date the test results are communicated by the reporting entity."

In order to reduce noise, we do smooth out the real data using a (two-sided) moving average method (cf. [51] Chapter 3, cf. [52, 53] ). A moving average is a time series constructed by taking averages of several sequential values of another time series which is a type of mathematical convolution. In statistics, two-sided moving averages are used to smooth a time series in order to estimate or highlight the underlying trend. If we represent the original time series by x 1 , . . . , x n , then a (simple) two-sided moving average of the time series will be given bȳ

If the data are showing a periodic fluctuation, moving averages of periods of equal length will eliminate the periodic variations (cf. [51, 52] ). Observing various population levels indicates that there is periodic fluctuation of 7 days on the data, and thus we will take the average of 7 days

In Figures 3-11 , the green curve shows the approximate number of daily confirmed new cases and also a possible scenario with a 60-day projection for the Czech Republic (or, in short: Czechia), France, Germany and Italy. Other curves present log-normal EF wavelets where each one can be seen as a sub-epidemic, localised both in time and location. These EF wavelets then give us the nowcasting for the epidemic situation for each population level, i.e., forecasts present sub-epidemics, recent sub-epidemics and the combination of sub-epidemics.

For validation, we use the metric relative percentage difference:

where y i is the real data at day i smoothed by a 7-days moving average andŷ i is the prediction of our model. We fit our model with the data of daily cases until 19 October and keep the last 6 days (20-25 October) for the validation set, then obtain the average error of 4.17% for Czechia, 7.48% for Germany and 3.25% for Italy (see Table 1 ). However, we obtain an average error of 32.61% for France (see Figure 6 ) on the validation set from 20-25 October. We remark here that in some periods of 3 consecutive days the total cases of France remain constant in the Johns Hopkins University data [48] , and the total cases are updated by summing up for the day after these 3 days. For example, the periods 9-11 October and 16-18 October show 732,434 and 876,342 total cases, respectively. This makes the daily reported cases equal to zero in some 2 consecutive days. Using a moving average of 7 days we overcome this situation and then use the smoothing data for the projections shown in Figures 5-7 . 

In Figures 16 and 17 , the green curve shows the projections for Florida, New York from 25 October 2020. 

In this section, we compare our approach to other methods in statistical analysis for forecasting: simple moving average (SMA), autoregressive moving average (ARMA) and autoregressive integrated moving average (ARIMA). We chose two situations: before the first epidemic peak and after the first epidemic peak. We take the average of 7 days for SMA. The parameters for ARMA are p = 7, q = 7 and the ones for ARIMA are p = 7, d = 2, q = 7.

In Figure 20 , we compare the forecastings of 20 days from 20 March. We can see that our model can give a good prediction for the peak. In Figure 21 , we compare the forecastings of 20 days from 06 April. This shows that our model also gives good results here. 

The numerical results in the last section of our paper suggest that our models are actually able to predict the number of daily infected Covid-19 individuals many days ahead in many different countries. In particular, our approach also gives reasonable results for the epidemic situation on population levels by precising sub-epidemics corresponding to EF wavelets.

For solving the curve fitting problem in our model selection, we only have to use relatively few parameters. The model can be seen as a neural network containing only one hidden layer with a log-normal function activation, entailing that we do not have to deal with overfitting problems and that the estimation error of our model is low [54] Our method for modelling the number of daily reported cases of infectious individuals also applies to other epidemics characteristics, e.g., to the number of active cases, and thus is also important for health care system decisions.

In future work, we will present refinements of our approach as well as refinements of the curve fitting techniques employed here. We will also extend our approach based on the epidemic-fitted wavelet approach to situations where EF wavelets are multivariate functions of time variables, measurement levels, or other variables such as death rate, recovery rate, etc. 

Mathematical epidemiology

A Contribution to the Mathematical Theory of Epidemics

Mathematical models for COVID-19: Applications, limitations, and potentials

Deterministic and stochastic models for recurrent epidemics

Measles periodicity and community size

Modeling Infectious Diseases in Humans and Animals

The interpretation of periodicity in disease prevalence

True epidemic growth construction through harmonic analysis

The challenges of modeling and forecasting the spread of COVID-19

Curve-fitting approach for COVID-19 data and its physical background

Predicting the growth and trend of COVID-19 pandemic using machine learning and cloud computing

A two-wave epidemiological model of COVID-19 outbreaks using MS-Excel

A novel sub-epidemic modeling framework for short-term forecasting epidemic waves

Real-time forecasts of the COVID-19 epidemic in China from

Real-time forecasting of epidemic trajectories using computational dynamic ensembles

Multiple Epidemic Wave Model of the COVID-19 Pandemic: Modeling Study

Modeling behavioral change and COVID-19 containment in Mexico: A trade-off between lockdown and compliance

A Spatial-Temporal Model for the Evolution of the COVID-19

A mathematical model for the spatiotemporal epidemic spreading of COVID19

Mathematical Parameters of the COVID-19 Epidemic in Brazil and Evaluation of the Impact of Different Public Health Measures

SI epidemic model applied to COVID-19 data in mainland China

Host and infectivity prediction of Wuhan 2019 novel coronavirus using deep learning algorithm

Prediction and Analysis of Corona Virus Disease

Forecasting of COVID19 per regions using ARIMA models and polynomial functions

In-host Mathematical Modelling of COVID-19 in Humans

Multiple-Input Deep Convolutional Neural Network 2 Model for COVID-19 Forecasting in China

Mathematical Modeling and Analysis of COVID-19 pandemic in Nigeria

Examining COVID-19 Forecasting using Spatio-Temporal Graph Neural Networks. arXiv 2020

Early dynamics of transmission and control of COVID-19: A mathematical modelling study

Predicting the cumulative number of cases for the COVID-19 epidemic in China from early data

A COVID-19 epidemic model with latency period

Predicting the number of reported and unreported cases for the COVID-19 epidemics in China

Modeling COVID-19 pandemic using Bayesian analysis with application to Slovene data

Modeling COVID-19 scenarios for the United States

Forecasting COVID-19 outbreak progression using hybrid polynomial-Bayesian ridge regression model

Towards unified and real-time analyses of outbreaks at country-level during pandemics. One Health

Summer COVID-19 third wave: Faster high altitude spread suggests high UV adaptation

Examining Deep Learning Models with Multiple Data Sources for COVID-19 Forecasting

A data-driven network model for the emerging COVID-19 epidemics in Wuhan, Toronto and Italy

Modified SEIR and AI prediction of the epidemics trend of COVID-19 in China under public health interventions

Series Attention Model for COVID-19 Forecasting. arXiv 2020

Algorithms and Applications

Vibrations and Scalings

Exact solution to a dynamic SIR model

A Method for the Solution of Certain Non-Linear Problems in Least Squares

An Algorithm for Least-Squares Estimation of Nonlinear Parameters

Covid-19 Data

Interpreting SARS-CoV-2 fatality rate estimates-A case for introducing standardized reporting to improve communication

Overcoming Reporting Delays Is Critical to Timely Epidemic Monitoring: The Case of COVID-19 in

Methods and Applications

Moving Averages

Smoothing Methods in Statistics

Understanding Machine Learning: From Theory to Algorithms

Acknowledgments: It is our pleasure to thank Tat Dat Tran and Vit Fojtik for useful suggestions and comments on an earlier version of this paper.

The authors declare no conflict of interest.