key: cord-0777673-so3gam2t
authors: Maleki, Mohsen; Mahmoudi, Mohammad Reza; Heydari, Mohammad Hossein; Pho, Kim-Hung
title: Modeling and forecasting the spread and death rate of coronavirus (COVID-19) in the world using time series models
date: 2020-07-25
journal: Chaos Solitons Fractals
DOI: 10.1016/j.chaos.2020.110151
sha: 10a86e02703bbf468c8a909394b6774c1ba1c934
doc_id: 777673
cord_uid: so3gam2t

Coronaviruses are a huge family of viruses that affect neurological, gastrointestinal, hepatic and respiratory systems. The numbers of confirmed cases are increased daily in different countries, especially in Unites State America, Spain, Italy, Germany, China, Iran, South Korea and others. The spread of the COVID-19 has many dangers and needs strict special plans and policies. Therefore, to consider the plans and policies, the predicting and forecasting the future confirmed cases are critical. The time series models are useful to model data that are gathered and indexed by time. Symmetry of error's distribution is an essential condition in classical time series. But there exist cases in the real practical world that assumption of symmetric distribution of the error terms is not satisfactory. In our methodology, the distribution of the error has been considered to be two-piece scale mixtures of normal (TP–SMN). The proposed time series models works well than ordinary Gaussian and symmetry models (especially for COVID-19 datasets), and were fitted initially to the historical COVID-19 datasets. Then, the time series that has the best fit to each of the dataset is selected. Finally, the selected models are applied to predict the number of confirmed cases and the death rate of COVID-19 in the world.

Coronaviruses are a huge family of viruses that affect neurological, gastrointestinal, hepatic and respiratory systems. This family can be grown among humans, bats, mice, livestock, birds, and others [1] [2] [3] . In 2003, a type of coronavirus, called SARS coronavirus (SARS-CoV), was distributed from animal to animal [4] . In 2012, another type of coronavirus, named as MERS coronavirus (MERS-CoV), was significantly distributed from human to human [4] . Late in year 2019, the World Health Organization (WHO) reported many cases in China with respiratory diseases. It was verified that most of the reported cases contacted with the persons that had went to a seafood market in Wuhan [5] . Recently, a new type of coronavirus, named COVID-19 (it may be also named 2019-nCoV), is spreading in Wuhan [6] . The scientists believe that the COVID-19 acts in human similar to that are in bats. However, to know the main source of the COVID-19, more scientific studies are needed. Based on the reports, the COVID-19 has been observed in others cities in China and also in about other 198 countries (up to 06 February 2020). The Centers for Disease Control and Prevention (CDC) verified that the COVID-19 is distributed from human to human. Based on the CDC's reports, the COVID-19 is spread by touching surfaces, close contact, air, or objects that contain viral particles. The COVID-19 is a dangerous virus, because the incubation period of the COVID-19 is at least 14 days [7] , and it can spread to others in the incubation period. A recent research indicates that the median age and incubation period of confirmed cases are respectively 3 days and 47.0 years [8] .

The number of confirmed cases has increased daily in different countries, specially in United State American, Italy, Spanish, Germany, Iran, China and other countries. The spread of the COVID-19 has many dangers and needs strict special plans and policies. Therefore, to consider the plans and policies, the prediction and forecasting the future confirmed cases are critical. The number of the unreported COVID-19 cases in China has been mathematically estimated by [9] . Using a data-driven analysis, they estimated that there are 469 unreported COVID-19 cases in China in 1-15 January 2020. Based on the information of some Japanese passengers in Wuhan, Nishiura et al. [10] estimated the rate of the infection for COVID-19 in Wuhan. The results indicated a rate of 9.5% for infection and a rate from 0.3% to 0.6%, for death. Since the size of the considered population is very small, there is doubt in about accuracy of estimated rates. Based on a mathematical model, Tang et al. [11] concluded that the transmission risk of COVID-19 is averagely about 6.47 persons and predicted the time that the peak of COVID-19 will be reached. Using the information of 47 patients, Thompson [12] estimated a sustained human-to-human transmission equal to 0.4 for COVID-19. Based on two different scenarios, Jung et al. [13] concluded that the risk of death is 5.1% and 8.4%. Al-qaness et al. [14] proposed an optimization method, named FPASSA-ANFIS, to model the number of confirmed cases of COVID-19 and to predict its future values using previous recorded dataset in China. They introduced a technique that was a combination of neuro-fuzzy system, flower pollination algorithm, and salp swarm technique. Generally, the salp swarm technique was applied to develop flower pollination algorithm to prevent its disadvantages such as returning trapped at the local optimum. The theory of FPASSA-ANFIS model is based on the improvement in the ability and accuracy of neuro-fuzzy system by considering the parameters of adaptive neuro-fuzzy inference system using salp swarm and flower pollination algorithms. The ability and applicability of FPASSA-ANFIS technique were studied using the real dataset including the outbreak of the COVID-19 given by WHO. Moreover, FPASSA-ANFIS technique was applied to forecast the confirmed cases in future days.

The modeling, forecasting, predicting and estimating the characteristics of the epidemiological problems were considered in some previous researches. For example, the forecasting of the cases and transmission risk of West Nile virus (WNV) [15] , the forecasting of the infection of hepatitis A virus [16] , the forecasting of the seasonal outbreaks of influenza [17, 18] , the forecasting of the outbreaks of Ebola [19] , the estimating of the infection's rate of the SARS [20] , the modeling of the influenza A (H1N1-2009) [21] , predicting the outbreaks of the MERS [22] .

Time series models are useful to models data that gathered and indexed by time. Time series analysis has been used effectively to model, estimate, forecast and predict real practical problems, see refs. [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] . Symmetry of error's distribution is an essential condition. But there exist many cases in the real world that assumption of symmetrically distribution of the error terms is not satisfactory (see e.g., refs. [25-32]), so in our methodology we consider the time series models based on the two-piece distributions, especially twopiece scale mixture normal (TP-SMN) distributions which had introduced by refs. [32] [33] [34] [35] [36] [37] [38] . The proposed time series models includes the symmetric Gaussian and symmetric/asymmetric lightly/heavy-tailed non-Gaussian time series models, and were fitted initially to the historical COVID-19 datasets. Then, the time series that has the best fit to each of the dataset is selected. Finally, the selected models are used to predict the number of confirmed cases and death rate of COVID-19 in the world. In this study,

1. An improved time series model is introduced applying TP-SMN distributions.

2. The new efficient predictive model is applied to predict and estimate the confirmed cases and death rate of COVID-19 in the world, using past and current datasets.

The autoregressive moving-average (ARMA) processes are a useful and accurate class of time series for modelling and forecasting of real datasets. The ARMA model presents a time series based on two linear functions; one contains the linear combinations of past values of time series, called the autoregressive (AR), and the other contains the linear combinations of a set of uncorrelated errors, called the moving average (MA). This model was firstly introduced by Peter Whittle, ref.

[39], and then used by refs. [40] [41] .

where (0, 2 ) refers to a set of uncorrelated and identically distributed zero-mean random variables with variance 2 . It should be noted that the cases = 0, and = 0, are called the ( ) and the ( ) models, respectively.

Following general two-piece distributions from ref.

[33] based on the scale mixtures of normal (SMN) family, the probability density function (pdf) of the TP-SMN family for ∈ ℝ, that is presented by ~ TP-SMN( , , , ), is represented by

such that 0 < < 1 is the slant coefficient and (⋅ | , , ) is pdf of the SMN family.

, then has a stochastic representation given by

where −~ ( , 1 , ) ( ) and +~ ( , 2 , ) ( ), for which 1 = (1 − ), 2 = , = (−∞, ) and

(⋅) (⋅) is the truncated SMN-distribution on , and = ( 1 , 2 ) ⊤ such that 1 + 2 = 1 has following probability mass function (pmf): 

Consider the ( , ) model (1) with independent and identically distributed (i.i.d.) noises from TP-SMN,

And assume = ( 1 , … , ) ⊤ and = ( 1 , … , ) ⊤ are AR and MA coefficients of the TP-SMN-ARMA model, respectively. In this work, we will represent this model by { } ~ --( , ) with the model parameter = ( , , , 1 , 2 , ) ⊤ (based on the TP-SMN representation from Lemma 2.1.).

< ∞ is satisfied, then converges in the mean, and this process is strictly stationary with the following mean and covariance functions: . Also (ℎ) → 0, as ℎ → ∞, (see, ref. [42] ). 

where ( ) is the conditional likelihood function on initial values, (See more details about choosing the initial values and construction of the conditional likelihood function, in ref. [40] ). So the log-conditional likelihood function is derived by

such that g(⋅) refers to TP-SMN pdf given in (2).

The SMN-densities in the pdf (2) are complex, and then the exploring the Maximum-Likelihood (ML) estimates for the parameters of model (7) will tractable. But, using the Lemma 2.1., concludes a suitable hierarchically form of the TP-SMN family besides the proposed ARMA model, to employ an EM-type algorithm to estimate the parameters.

Considering the Lemma 2.1., and stochastic representation of SMN family (ref. [43] ), let = ( , , ) ⊤ as the complete data for the observations , and = ( 1 , … , ) ⊤ and = ( 1 , 2 ) ⊤ ; = 1, … , are the missing (latent) data. It is noticed that the TP-SMN-ARMA model via (1) and (5) has the following hierarchically representation:

for = 1, … , and = 1,2, where = (−∞, ⊤ −1 + ⊤ −1 + ) and (⋅) (⋅) is the truncated normal distribution on .

The hierarchical form of the TP-SMN-ARMA process given in (11) and ECME algorithm, that is a generalization of the EM algorithm [44] , are applied to find the ML estimates. So considering the proposed the --( , ) and (11), ignoring constants, the conditional log-likelihood function is

where = ( , , , 1 , 2 , ) ⊤ . The CM-Steps of the ECME algorithm is also as following:

.

At the follows of CM-Steps, solving the stressed cubic equations 3 + + = 0; = 1,2, concluding the updates ̂( +1) ; = 1,2, where

, for which = 2 ( =1) + 1 ( =2) . Since < 0 and < 0, hence this equation has unique just root in (0, +∞).

Finally, the CML-step of the ECME algorithm is as following:

( +1) = argmax ℓ(̂⊤ ( +1) ,̂⊤ ( +1) ,̂( +1) ,̂1 ( +1) ,̂2 ( +1) , ).

The proposed algorithm will be continued until a convergence condition is verified, i.e., |ℓ(̂( +1) ) ℓ(̂( ) ) ⁄ − 1| ≤ , where is a known and fixed tolerance.

The coronavirus (COVID-19) is spreading in about 203 countries of the world. The daily data related the COVID-19 in the world, are reporting by the China National Health Commission (NHC) and World Health Organization (WHO). In this part we fit the maintained time series models to the total confirmed cases in the world include and exclude China from 22-Jan-2020 up to 08-Apr-2020.

Time series plots of the total and daily cases in the world from 22-Jan up to 08-Apr of 2020 which are confirmed, and its stationary differenced with order 3 (i.e. ∇ 3 = − 3 −1 + 3 −2 + −3 ) are given in Fig. 1 and Fig. 2 , respectively. Using the Dickey-Fuller test leads to p-value=0.01 with alternative hypothesis: stationary.

Obviously number of cases (total and daily) in any days depend the number on them in the previous day(s), so the ARMA model can be suitable model for the COVID-19 cases data. The histogram of the estimated errors (residuals) based on the estimated TP-T density (near symmetry but heavy-tailed) is superimposed on it shows the suitable performance of the estimated model to COVID-19 data (Fig. 4) . To further demonstrate the good fit of the model, we eliminated the last 10 data (2020-Mar-30 up to 2020-Apr-08), then fitted the TP-SMN-ARMA model and forecast these data. Fig. 5, Fig. 6 and Table 1 , show the forecasted real values of the COVID-19 in the world data are close. Table 1 contains the predictions and 98% confidence intervals for them. The mean relative percentage error (MAPE) index given by

where ̂+ 1 = ( +1 | , … , 1 ), is then used to evaluate the accuracy of the suggested data prediction, which for the proposed predictions is 0.60 % which shows the suitability of the proposed model for predicting. Note that, this criterion for the modeling via the ordinary Gaussian-ARMA model (also, the simplest TP-SMN-ARMA member) is 0.89 %. Also the AIC and BIC criteria for the best fitted TP-SMN-ARMA are 1290.49 and 1298.02, and for the best fitted Gaussian-ARMA model are 1524.14 and 1544.12, respectively. Finally, the p-value=0.972 from the Box-Pierce and p-value=0.931 from the Ljung-Box tests indicate the independency of residuals. Also the auto-correlation function (ACF) plot of the residuals presented in Fig.  7 shows the suitability of the --(7,0) model to the total confirmed cased of the COVID-19 dataset. 

In this section we consider and model the death rate of COVID-19 in the world from 02-Feb-2020 up to 08-Apr-2020, which this daily data also has reported by the China National Health Commission (NHC) and World Health Organization (WHO).

Time series plots of the death rate of coronavirus in the world from 02-Feb-2020 up to 08-Apr-2020, and its stationary differenced with order 3 (i.e. ∇ 3 = − 3 −1 + 3 −2 + −3 ) are given in Fig. 8 and Fig.  9 , respectively. Using the Dickey-Fuller test leads to p-value=0.01 which demonstrate the stationarity of differenced data.

Using the model selection criteria and methodology in the previous data, demonstrate that best TP-SMN-ARMA model with the best fitted orders is -- (7, 1) . The PACF given in Fig. 10 also satisfies it. Therefore the following TP-SMN-ARMA is the best model The histogram of the estimated errors (residuals) based on the estimated TP-T density (heavy-tailed and asymmetry) is superimposed on it shows the suitable performance of the estimated model to death rate of COVID-19 in the world (Fig. 11 ). Same as previous data, we eliminated the last 10 data (2020-Mar-30 up to 2020-Apr-08, then fitted the TP-SMN-ARMA model and forecast these data. Fig. 13 and Table 2 , show the forecasted real values of the death rate of COVID-19 in the world data are close. Table 2 contains the predictions and also 98% confidence intervals for them.

The MAPE for the second proposed predictions is 1.30% demonstrating the suitability of the proposed model for prediction. Note that, this criterion for the modeling via the ordinary Gaussian-ARMA model (also, the simplest TP-SMN-ARMA member) is 1.70 %. Also the AIC and BIC criteria for the best fitted TP-SMN-ARMA are -4.42 and 18.05, and for the best fitted Gaussian-ARMA model are 76.68 and 95.07, respectively.

Finally, the p-value=0.974 from the Box-Pierce and p-value=0.873 from the Ljung-Box tests indicate the independence of residuals. Also, the ACF plot of the residuals presented in Fig. 14 

Coronaviruses are a huge family of viruses that affect neurological, gastrointestinal, hepatic, and respiratory systems. The numbers of confirmed cases are increased daily in different countries, especially in China, Iran, South Korea, Italy and others. The spread of the COVID-19 has many dangers and needs strict special plans and policies. Therefore, to consider the plans and policies, the predicting and forecasting the future confirmed cases are critical. The time series models are useful to model data that gathered and indexed by time. Classical time series is based on the symmetry of error's distribution. But there exist many situations in the real world that the assumption of symmetric distribution of the error terms is not satisfactory. In our methodology, we considered the time series models based on the two-piece scale mixture normal (TP-SMN) distributions. The proposed time series models were fitted initially to the historical COVID-19 datasets. Then, the time series that had the best fit to a dataset was selected. Finally, the selected models were applied to forecast the number of confirmed COVID-19 cases. The results indicate that the introduced approach acts well in forecasting the future confirmed COVID-19 cases. Also all of criteria demonstrate that the proposed models are more reasonable that the ordinary Gaussian time series model (, which also is the simplest members of our proposed model). Note that a sample copy of the code is available from the authors upon request.

Emerging coronaviruses: Genome structure, replication, and pathogenesis

Isolation and characterization of a bat SARS-like coronavirus that uses the ACE2 receptor

Transmission scenarios for Middle East Respiratory Syndrome Coronavirus (MERS-CoV) and how to tell them apart

Genomic characterisation and epidemiology of 2019 novel coronavirus: Implications for virus origins and receptor binding

Novel Coronavirus: Where We are and What We Know

Clinical characteristics of 2019 novel coronavirus infection in China

Estimating the Unreported Number of Novel Coronavirus (2019-nCoV) Cases in China in the First Half of January 2020: A Data-Driven Modelling Analysis of the Early Outbreak

The Rate of Underascertainment of Novel Coronavirus (2019-nCoV) Infection: Estimation Using Japanese Passengers Data on Evacuation Flights

Estimation of the Transmission Risk of the 2019-nCoV and Its Implication for Public Health Interventions

Novel Coronavirus Outbreak in Wuhan, China, 2020: Intense Surveillance Is Vital for Preventing Sustained Transmission in New Locations

Real time estimation of the risk of death from novel coronavirus (2019-nCoV) infection: Inference using exported cases

Optimization Method for Forecasting Confirmed Cases of COVID-19 in China

Ensemble forecast of human West Nile virus cases and mosquito infection rates

Comparison of four different time series methods to forecast hepatitis A virus infection

Forecasting seasonal outbreaks of influenza

Real-time influenza forecasts during the 2012-2013 season

Inference and forecast of the current West African Ebola outbreak in Guinea, Sierra Leone and Liberia

Forecasting versus projection models in epidemiology: The case of the SARS epidemics

Real-time epidemic monitoring and forecasting of H1N1-2009 using influenza-like illness from general practice and family doctor clinics in Singapore

Predicting the international spread of Middle East respiratory syndrome (MERS)

Testing the Difference between Two Independent Time Series Models

Maximum a-posteriori estimation of autoregressive processes based on finite mixtures of scale-mixtures of skew-normal distributions

Autoregressive Models with Mixture of Scale Mixtures of Gaussian innovations

Time series process based on the unrestricted skew normal process

A Bayesian approach to robust skewed Autoregressive process

Nonlinear semiparametric autoregressive model with finite mixtures of scale mixtures of skew normal innovations

Asymmetric heavy-tailed vector auto-regressive processes with application to financial data

Autoregressive processes with generalized hyperbolic innovations

Leptokurtic and Platykurtic class of Robust Symmetrical and Asymmetrical Time Series Models

Two-Piece Location-Scale Distributions based on Scale Mixtures of Normal family

A Bayesian Analysis of Two-Piece Distributions Based on the Scale Mixtures of Normal Family

Robust mixture modeling based on two-piece scale mixtures of normal family

A robust class of homoscedastic nonlinear regression models

The Skew-Reflected-Gompertz distribution for analyzing symmetric and asymmetric data

Time Series Analysis: Forecasting and Control

Time Series: Theory and Methods

Introduction to Time Series and Forecasting

Springer Science & Business Media

Scale mixture of normal distribution

Maximum likelihood from incomplete data via the EM algorithm

A new look at the statistical model identification

Funding: No fund.

The authors declare no conflict of interest.