key: cord-0956047-5dour8wo
authors: Fatimah, Binish; Aggarwal, Priya; Singh, Pushpendra; Gupta, Anubha
title: A comparative study for predictive monitoring of COVID-19 pandemic
date: 2022-04-07
journal: Appl Soft Comput
DOI: 10.1016/j.asoc.2022.108806
sha: c438cf7f05840b539ba8f527b2d2b89ae9b99ae8
doc_id: 956047
cord_uid: 5dour8wo

COVID-19 pandemic caused by novel coronavirus (SARS-CoV-2) crippled the world economy and engendered irreparable damages to the lives and health of millions. To control the spread of the disease, it is important to make appropriate policy decisions at the right time. This can be facilitated by a robust mathematical model that can forecast the prevalence and incidence of COVID-19 with greater accuracy. This study presents an optimized ARIMA model to forecast COVID-19 cases. The proposed method first obtains a trend of the COVID-19 data using a low-pass Gaussian filter and then predicts/forecasts data using the ARIMA model. We benchmarked the optimized ARIMA model for 7-days and 14-days forecasting against five forecasting strategies used recently on the COVID-19 data. These include the auto-regressive integrated moving average (ARIMA) model, susceptible-infected-removed (SIR) model, composite Gaussian growth model, composite Logistic growth model, and dictionary learning-based model. We have considered the daily infected cases, cumulative death cases, and cumulative recovered cases of the COVID-19 data of the ten most affected countries in the world, including India, USA, UK, Russia, Brazil, Germany, France, Italy, Turkey, and Colombia. The proposed algorithm outperforms the existing models on the data of most of the countries considered in this study.

In early December 2019, cases of the coronavirus disease (COVID-19) originated in Wuhan city by a Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) [1, 2] . Within a short span of time, this virus quickly spread to a large population all over the world. It was declared an epidemic by the World Health Organization (WHO) on 11 th 5 March 2020. This disease is highly contagious and has infected millions of people globally. The number of deaths reported globally as of 15 th May 2021 is more than 3.5 million. It has hugely affected economic activities [3] and plunged millions into poverty. In countries such as USA, Brazil, Italy, and India, the rapid increase in the number of cases caused tremendous stress on the health care system. Its spread has been contained to some 10 extent in various countries by using partial and complete lock-downs, maintaining social distancing, and imposing quarantine for the infected people. For the timely implementation of these measures, a mathematical understanding of the future trend of the spread of disease is required. This can help the authorities announce control measures at an appropriate time. Thus, an accurate forecast of COVID-19 cases is extremely important to control its 15 rapid spread and hence, ensure the safety of the general public.

Researchers across the world have proposed various data-driven methods to forecast COVID-19 data, which has been a difficult and challenging task [4, 5] . Predicting or forecasting refers to estimating future cases on the basis of present and past data. It is carried out majorly using two popular approaches. The first approach includes compartmental 20 models such as SIR, SIRD, SEIR models [6] and the second is based on time-series learning methods such as curve-fitting [7, 8] , autoregression [9, 10] , and deep learning on time-series data [11, 12] .

Compartmental models are the traditional methods of forecasting infectious diseases [13] . In these models, the spread of infectious diseases is simulated by stochastic differential 25 equations that describe interactions between different compartments of the population (e.g. susceptible, infectious, and recovered). This approach majorly includes Susceptible-Infected-Removed (SIR) [14] , Susceptible-Infected-Removed-Death (SIRD) [15, 16] and Susceptible-Exposed-Infected-Removed (SEIR) models [17] . Hybrid models designed using compartmental models and deep learning frameworks have also been proposed recently [18] . expressed as a sparse linear combination of the dictionary atoms. By recursively and progressively improving dictionary atoms to the most recent COVID-19 data, forecasting of cases is done using partial fitting [23] . Another widely used time-series approach is the Auto-Regressive Integrated Moving Average (ARIMA) modelling [27] because the data of many countries has an inherent non-stationary trend. Thus, the ARIMA model has been 55 used widely by various authors for modeling and forecasting COVID-19 data [28, 29, 30] . To improve the performance of traditional ARIMA, Sharma et al. [31] used eigenvalue decomposition of the Hankel matrix to decompose the time-series into various stationary and non-stationary components. The decomposed signals were then modeled using ARIMA.

So far, several types of methods have been proposed for describing the time evolution 60 of COVID-19 epidemic. Irrespective of the huge progress in proposing various methods for COVID-19 prediction, this research area is still nascent and requires a comparison of various prediction methods in detail. The literature search did not reveal any review of available models and thus, this work reviews various approaches mentioned in brief above. In addition, we compare the performance of these models by assessing the Root Mean squared 65 error (RMSE) obtained for predicting the 7 days and 14 days future cases for the data of highly infected countries including India, USA, UK, Russia, Brazil, Germany, France, Italy, Turkey, and Colombia. We considered five different models including ARIMA, SIRD, composite Gaussian growth model, composite Logistic growth model, and ONMF model. The performance of the ARIMA model in predicting the short-term future is better than

• We have reported 7-days and 14-days forecasting of the number of infected, recovered, and deaths of the COVID-19 data for the ten most affected counties. These models provide good prediction accuracy for the upcoming three weeks. However, the prediction accuracy declines gradually with the increase in prediction time.

This paper is organized as follows. In Section II, we provide a brief overview of various 90 forecasting methods available in the literature on COVID-19 data modeling. In Section III, we provide a description of our proposed method. In Section IV, we present results. Finally, we discuss various results and conclude in Sections V and VI, respectively.

In this section, we discuss some of the existing modeling methods popularly used in the 95 literature for modeling and forecasting COVID-19 data.

SIR model [32] or its different variants such as the Susceptible-Infected-Recovered-Death (SIRD) model [33] or the Susceptible-Exposed-Infectious-Removed (SEIR) model [17] have been used to model the spread of diseases like dengue fever and malaria [34, 35] . Recently, various authors, [36, 15, 16, 37] , have used these methods for the modeling of data of COVID-19 prevalence. The traditional Susceptible-Infected-Recovered-Dead (SIRD) model [33] can be described using the following equations:

where S(t), I(t), R(t), and D(t) are the numbers of susceptible, infected, recovered, and death cases, respectively, β is the contact/infection rate (i.e., the average number of contacts per person per unit time) and γ is the recovery rate, i.e., 1/γ represents the av- 

where N is a constant and refers to the total population size. An important feature of the SIRD model is the estimated reproduction number, R 0 = βS N (γ+µ) > 1. This number provides an indication about the spread of the disease as 105 the number of susceptible cases getting infected from one infected person. If R 0 > 1, the number of cases are increasing, as in the start of an epidemic, R 0 = 1 indicates the disease is endemic, and R 0 < 1 indicates a decline in the number of cases.

In [36] , SIR approach has been used to model the prevalence of COVID-19 data in China. In [15] , a modified SIRD model is proposed to estimate COVID-19 data for five countries including India, USA, China, Italy, and France. This model considers the active, dead, and recovered cases simultaneously. It also considers the effect of quarantine and asymptomatic cases on the SIRD model that was otherwise not present in the traditional SIRD model. In [16] , SIRD model is used on the COVID-19 data of Italy. The parameters of the proposed model were considered to be time-varying and were expressed as linear combinations of the basis functions. Sparse identification methodologies were used to obtain these functions from the given COVID-19 data. The non-convex identification problem of estimating the model parameters was handled by a one-dimensional grid search in the outer loop and using Lasso optimization in the inner step. In [38] , Susceptible-Infected-Recovered for Asymptomatic-Symptomatic and Dead (SIRASD) model is used for Brazil COVID-19 data of 25 th February 2020 to 30 th March 2020 considering the long and short term effects of social distancing. In [37] , the best data-fitted curves have been obtained using the Gaussian mixture model (GMM) and composite logistic growth model (CLGM) to find the optimum SIRD model for COVID-19. SIRD model parameters are derived as time-varying quantities, which is closer to the real-life scenario and can capture the inherent changes in the characteristic of the pandemic with time. The above changes can be due to various government policies, restrictions on domestic and international travel, quarantine rules imposed and also due to the medical facilities available in every country. The number of Gaussian (or LGM) waves and the parameters for each wave is estimated by the minimization of the objective function given by the sum of squares for residuals of values [7, 39] . The minimization process uses the simplex search method in order to estimate the optimal values of the unknown parameters. Finally, the time-varying parameters of the SIRD model are computed from (1) as

Tuberculosis, Pertussis, Hepatitis, SFTS, HBV, Influenza, Human Brucellosis, Infectious Diarrhea, and Dengue Fever [40, 41, 42, 43, 44, 45] . It has been used by several authors to predict the cumulative COVID-19 infections, the number of deaths reported, and the recovered cases for different countries.

In [29] , ARIMA model is used to forecast the prevalence and incidence of COVID-19 for the next two days using the data from 20 th January 2020 to 10 th February 2020. Results were presented with 95% confidence. Data in the considered time range did not present any seasonality. ARIMA (1,0,4) and ARIMA (1,0,3) models were selected as the best fit models. In [30] , ARIMA model is used to capture the daily confirmed cases in Italy from 120 20 th February, 2020 to 4 th April, 2020. The seasonality of the data was tested using the Augmented Dickey-Fuller (ADF) test and the modified ADF-GLS (or ERS) test for unit root. The order of ARIMA was determined using Akaike's information criterion (AIC) and the mean absolute error (MAE). Perone further performed diagnostic tests on the residual data obtained using the selected ARIMA model including the Doornik and Hansen test 125 for normality, Engle's Lagrange Multiplier test for the ARCH (autoregressive conditional heteroskedasticity) effect, and the Ljung-Box test for the autocorrelation.

The ARIMA model has been used in [46] to forecast the next ten days' cases using the data available from 31 st January 2020 to 25 th March 2020. Also, a nonlinear autoregressive neural network was used to forecast the next 50 days' data. Bayesian Information Criteria 130 (BIC) was used to select ARIMA (1,1,0) model. It was mentioned that the autocorrelation function (ACF), and the partial autocorrelation function (PACF) can be used to choose the best fit and autocorrelation can be used to perform a diagnostic test on the residual signal. It was also mentioned in this work that BCI criteria is another method employed for model selection. The authors predicted that the number of new infections by 24 th May 135 2020 will reach 1,500. However, the actual numbers reached were 7,113. In [47], data of 15 countries was considered from 21 st January 2020 to 24 th April 2020. The countries included were: United States, United Kingdom, Turkey, China, Russia, Netherlands, Switzerland, Germany, Iran, Brazil, Spain, Italy, France, Canada, and Belgium. Confirmed cases, recovered cases, and the deaths reported were modeled using the ARIMA model. Authors 140 estimated that by 7 th July 2020, the confirmed cases, deaths, and recoveries would be doubled in all countries considered in the study except China, Switzerland, and Germany. For the United States, the cumulative confirmed cases on 7 th July 2020 were 3.33 times the cases on 24 th April 2020, for the United Kingdom the data became 2. 21 In [28] , ARIMA(2,1,1) model was used to obtain a four-week prediction for per day new infections in Saudi Arabia. It was estimated that the per day cases will reach 7,668 by 21 st May 2020. However, the cases reported were 2,532. In [48] , ARIMA (0,2,1), ARIMA (1,2,0), and ARIMA (0,2,1) were used to model prevalence of COVID-19 in Italy, Spain, 150 and France, respectively. The models were selected based on the lowest MAPE values. Data from 21 st February 2020 to 15 th April 2020 was used and the total confirmed cases for the next ten days were predicted. The data of France was predicted with an RMSE of 9.1762e3, Italy was predicted with 2.1004e3 RMSE and the data of Spain was predicted with an RMSE of 3.2774e4. In [49] , the spatial distribution of COVID-19 in Indian districts 155 were analyzed and the prevalence and incidence of the disease were predicted using the ARIMA(2,2,2) model. Data from 30 th January 2020 to 26 th April 2020 was used to predict the data from 27 th April 2020 to 11 th May 2020.

In [9] , a relationship between the number of COVID-19 cases and the population of the country is illustrated. Data of 145 countries have been modeled and the countries are 160 grouped based on their proximity to each other. The study assumes that the spread of the disease is affected by various measurable and non-measurable factors that will remain similar in countries closer to each other. The average RMSE obtained in this case was 144.8. In [10] , outbreaks of COVID-19 in Japan and South Korea were modeled using ARIMA(6,1,7) and ARIMA(2,1,3), respectively, for the duration from 20 th January 2020 165 to 26 th April 2020. The number of new infections per day for the next seven days was forecasted.

In general, the evolution of the reported cases is modeled as a single wave (single peak wave). However, fitting the data with only one wave may not always be correct, since there 170 are usually several waves with multiple peaks of the epidemic, while one wave captures very less fluctuations present in the data [19, 8, 14, 7] .

Some recent works are based on the assumption that multiple waves of a different peak, amplitude, and shape emerge and vanish overtime during the epidemic duration [8, 14, 7] . These works decompose the evolution of reported cases into several basic 'waves', where 175 each basic wave is considered as a representation of the epidemic, both localized in time and position. Every single wave is considered as one of the known growth models such as the logistic or Gaussian. Briefly, we explain below both of these models.

First, we present the modeling framework with a single wave, i.e., with P = 1 for the logistic growth model. The logistic growth model is often used in epidemiology to model the spread of the infection [50, 51, 7] . Here, the number of infections initially grows exponentially, but later declines as the numbers approach the population's carrying capacity, where the carrying capacity is denoted as the number of people that can be infected eventually in a population. The cumulative number of infections on the t th day, denoted as C(t), using the logistic growth model can be written as

where K is the carrying capacity, A denotes the number of persons initially infected, and r is the growth rate. Corresponding to this model, the number of infected persons on the 7 J o u r n a l P r e -p r o o f Journal Pre-proof t th day, I(t), is given by

For any country, the numbers reported on day-0 (day of reference) are those that are 180 active on that day. Hence, these are the cumulative numbers until that day and are equal to C(0). Substituting t = 0 in (3) and (4), we obtain

The values of K, A, r, C 0 and I 0 are determined from the curve fitting of the available data.

The composite logistic growth model can be written as [52] 

where the number of waves P , and the four parameters (K i , A i , r i , τ i ) for each wave are 185 estimated by minimization of the objective function, which is the sum of squares of residuals [53, 7, 39] . The minimization uses the simplex search method [54] to estimate optimal values of these unknown model parameters.

Next, we model I(t) using the Gaussian function. Here, the number of infected persons I(t) on the t th day is given by

where µ denotes the mean, σ 2 denotes the variance of the Gaussian function, while I 0 = α e − µ 2 2σ 2 . Thus, the composite Gaussian model can be written as

where regression parameters α i , µ i and σ i are the amplitude, mean, and standard deviation, 190 respectively. The model is utilized for a maximum of five epidemic waves. The sum of all the waves should predict the main reported cases.

In one of the recent works [23] , forecasting of COVID-19 is done using dictionary 195 learning and ONMF. This approach mainly consists of four steps. First, the dictionary is learned by minibatch learning from the entire duration of COVID-19 data, followed by, progressively adapting and improving the learned dictionary via ONMF. Later, a one-step prediction is made by partially fitting a learned dictionary to the known data so as to get a 8 J o u r n a l P r e -p r o o f Journal Pre-proof forecasted value of one day ahead. Lastly, by recursively applying the one-step predictions, 200 extrapolation of predictions for the near future is done. We name this method as ONMF, from here onwards.

The pseudo-code of the method is as follows:

• Dictionary learning: The first step deals with dictionary learning. Consider the number of days T for which the data x ≡ (x 1 , x 2 , ..., x T ) is available. Random patches of length N are extracted from this data and stacked as columns of matrix X.

where x i ∈ R N ×1 . Given a data matrix X, the goal is to find nonnegative dictio-205 nary W ∈ R N ×r and nonnegative code matrix H ∈ R r×d by solving the following optimization problem:

where A 2 F = i,j A 2 ij denotes the matrix Frobenius norm and λ ≥ 0 is the regularization parameter. W represents the learned dictionary having r number of atoms and H represents the code matrix. Above optimization problem is also known as the 210 Nonnegative matrix factorization (NMF) problem.

• Refining the learned dictionary: In the second stage, the learned dictionary is further updated using online NMF. Here, Online implies learning the sequence of dictionary matrices from the sequence of data matrices X, generated by moving one day ahead and considering all previous data points. 215 • Forecasting: Further, a learned dictionary is used to predict one day ahead data by partial fitting and updating H. For more details of this framework, please refer to the work by [23] . By recursively using the one-step predictions, extrapolation of the future values is carried out.

To summarise, several approaches have been proposed by researchers to predict the COVID-19 outbreak including SIR models, and variants, ARIMA modeling, multi-wave curve fitting modeling and dictionary learning modeling. Among all, compartment models (SIR and its variants) are the most frequently used approach so far [55] and dictionary learning modeling is the least used method in the literature for COVID-19 forecasting. In 225 essence, all these methods exhibit many pros and cons, which are described in Table 1 . These models make two assumptions: (i) the chance of any infected person to infect other susceptible persons is constant during the epidemic duration, and (ii) assume that every infected person has a constant chance to recover at any given time. Both of these may not be true. Moreover, a precise and closed-form solution of all the system parameters is difficult to obtain.

ARIMA [29, 30, 46, 47, 28, 48, 49, 9, 10] It is a parametric model which can fit non-stationary data.

It does not perform well if the correlation in data samples is negligible.

Fitting (Gaussian and Logistic) [8, 14, 7] . These are parametric models which can capture multiple emerging waves of epidemic.

Require predefined shape of the waves and estimations of parameters for fitting. Dictionary Learning [23] It is a non-parametric model which can capture any shape of epidemic wave.

Requires large amounts of data. The training process is expensive involving selection of large number of hyperparameters.

In this work, we propose modeling based on the ARIMA model to forecast the daily infections, cumulative deaths, and cumulative recovered cases of COVID-19. Since the COVID-19 data for various countries is non-stationary, traditional ARMA methods may 230 not be sufficient to capture the data efficiently. In such cases, ARIMA performs better by capturing the trend or seasonality in the data. To further improve the performance of ARIMA models, we propose to add a pre-processing step to the ARIMA model and estimate the trend in the data using a low pass Gaussian filter as shown in Fig. 1 In [27] , a method is proposed to estimate an ARIMA model. ARIMA models are considered as a generalization of ARMA models to predict a given time series data using its past values. While ARMA models are used to fit the time-series data with stationary property, ARIMA has been developed for data with inherent non-stationarity or with seasonality trends. The ARIMA model can be understood in two steps, where the first step removes the non-stationary trend from the data, and the second step models the output obtained from the first step using an ARMA model. The model is denoted as ARIMA(p, d, q), where p denotes the order of AR, q is the order of MA, and d represents the degree of difference used in the model, mathematically, expressed as follows:

where B is the back-shift operator, B p represents back-shift by p-steps, and x(t) is the non-235 stationary time series data to be modeled. In the first step, (10) , signal x(t) is converted to a stationary signal y(t).

Step 2 models the signal y(t) using an ARMA model as depicted in (11), where a(k) are the auto-regressive model coefficients, b(k) are the moving average coefficients, and e(t) is the error signal. The Box-Jenkins method used for the modeling of data with the best ARIMA model 240 involves three steps: model selection, parameter estimation, and model validation. For a given time-series, the order of the model can be estimated using the sample autocorrelation function (ACF) and partial autocorrelation function (PACF). If the data has an inherent seasonal trend, the initial differencing step is used once or more than once to convert the data to a stationary series. The output of the differencing step is checked after every 245 iteration using the ACF plot. For a stationary series, the values of the ACF should rapidly converge to zero. If it is not so, the differencing step should be repeated. The number of times the differencing step is repeated to obtain the stationarity provides the order value for d. ACF is also used to estimate the MA order, q, with the assumption that for pure MA processes ACF values converge to zeros after lag q. Similarly, it can be observed that 250 for a pth order pure AR processes values of the PCF become zero after lag p. Akaike information criterion (AIC) or the Bayesian Information Criterion (BIC) can be used to obtain the optimum order of ARMA, where the objective is to minimize AIC or BIC values.

Once the values for p, q, and d are obtained, the model can be estimated using either the maximum likelihood estimation or the least-squares estimation.

In this work, we first filter the given time-series using a Gaussian filter to obtain its low pass version which is then modeled using ARIMA. Here, the stationarity of the low pass signal is checked, and accordingly the parameter value for d is obtained. The stationary signal obtained after differencing is estimated using ARMA and optimum p and q parameters are selected. The cut-off frequency of the low pass filter is changed and the 260 best ARIMA models are obtained for each case. It is pertinent to mention that since the ARIMA model is developed for the low pass version, whereas our main goal is to estimate the given time-series and forecast the future values, metrics proposed in the Box-Jenkins ARIMA modeling method such as BIC may not be a correct choice. BIC and AIC values will prefer the model that can estimate the low pass signal efficiently, but not the original 

In this section, we present the simulation results for modeling and predicting the COVID-19 data for ten countries, including India, the USA, the UK, Russia, Brazil, Germany, France, Turkey, Italy, and Colombia. The data used in this paper involves the cumulative recovered cases, cumulative death cases, and number of new infections as col-280 lected from the Worldometer [56] and WHO daily situation report [57] . The data from February 15 2020 till April 14 2021 has been used for modeling and the data for the next 14 days is used for validating the prediction performance of the model. We compare the performance of the proposed method with the ARIMA model, SIRD model, composite Gaussian model, composite logistic growth model, and ONMF model. These models are 285 used to forecast the next 7 days and 14 days data and the RMSE values obtained are compared in Table 2 . Results for the SIRD model have been obtained using the model discussed by [37] and results for the ARIMA model are obtained using the Box-Jenkins approach [27] . Fig. 2 and Fig. 3 show the normalized RMSE values obtained when different metrics 290 are used to select the optimum ARIMA model to estimate the cumulative recovered data and death data for the ten countries considered. Using this, we observed that models selected on the basis of BIC value and RMSRE values provide the best estimates in most of the cases for cumulative recovered and death cases, respectively, and thus, these metrics were selected for the given data series. Furthermore, Fig. 4 Fig. 5-7 shows the actual data, estimated data and the forecasted data for the next 14 days as obtained using the proposed model for India, USA, Brazil, UK and Russia, Germany and Fig. 8-9 includes the graphs for France, Turkey, Italy and Colombia. Fig. 12-15 show the performance of the proposed algorithm in forecasting the short-term future data of the ten countries considered. The figures include the actual data, 305 the output of the Gaussian filter, and the forecasted data with 95% confidence interval. Table 3 provides the order of the ARIMA models obtained with the proposed method.

The relative RMSE of a model is defined here as the ratio of the RMSE obtained using the model to RMSE obtained with the proposed method. This gives a metric to compare the RMSEs obtained using different models and compare the performance of the proposed 310 method as compared to existing modeling techniques. The relative RMSEs so obtained are plotted as histograms in Fig. 16 -17 for all ten countries and also the average obtained after removing outliers is plotted as well. 

Actual data Estimated data (f) Figure 9 : Actual COVID-19 data and the estimated values as obtained using the proposed modeling scheme for (9a) daily infections, (9b) death cases, and (9c) recovered cases for Italy, (9d) daily infections, (9e) death cases, and (9f) recovered cases for Colombia.

J o u r n a l P r e -p r o o f Figure 12 : Actual COVID-19 data and the forecasted values as obtained using the proposed modeling scheme for (10a) daily infections, (10b) death cases, and (10c) recovered cases for India, (10d) daily infections, (10e) death cases, and (10f) recovered cases for USA, (11a) daily infections, (11b) death cases, and (11c) recovered cases for Brazil, (11d) daily infections, (11e) death cases, and (11f) recovered cases for UK, (12a) daily infections, (12b) death cases, and (12c) recovered cases for Russia. Compared to the existing methodologies, the models obtained using the proposed methodology predict the future cases far better for India, UK, Russia, and Colombia as shown in Table 2 and Fig. 16a, 16d, 16e,17d . For India, the performance of ARIMA for both 7 days and 14 days forecasts is very poor compared to the proposed methodology. As compared to the RMSE obtained using the proposed method, RMSE obtained with 320 ARIMA is 4.19 times for recovered cases, 3.95 times for death cases, and 3.3 times for infected cases. In these cases, we observe that the proposed method estimates the low pass version of the actual data efficiently, as shown in Fig. 5-9 . Also, the future cases follow this low-pass trajectory as shown in Fig. 12 and Fig. 15 . The reported cases depend on various external factors such as socio-economic activities, policy changes, festivals, holi-325 days, local weather, etc., and also on the number of testing being done. These changes may cause sudden fluctuations in the time series. In the case of India, the future cases were not dependent on these fluctuations and therefore, estimation using a low pass trend of the time-series produced better prediction results. In cases of number of new infections, the performance of Gaussian curve fitting is closer to the proposed method with its 1.58 330 times RMSE as compared to the proposed method.

The proposed method predicts the recovered cases for Germany with the least RMSE and provides considerable improvement over ARIMA. For daily infections and death cases, ARIMA works better, as shown in Fig. 16f . However, the RMSE values are very close, 0.8 times of the proposed method for infected cases and 0.83 times for death cases. For France, 335 the proposed algorithm works better than ARIMA for recovered cases and equivalently for daily infected cases. However, for recovered cases SIRD gives the best performance with 0.04 relative RMSE and ONMF gives 0.56 relative RMSE for infected cases. For the COVID-19 cases of the USA, the proposed methodology forecasts the recovered cases and death cases far better than the existing methods by a factor of 10 1 and 10 3 , respectively.

However, for daily infected cases, ARIMA estimates the future cases better than the proposed method by a very small margin with relative RMSE of 0.74. This is also observed for Brazil daily infected cases, where the performance of ARIMA is slightly better than the proposed method, with 0.81 relative RMSE. As observed from Fig. 10d and Fig. 11a , the actual cases have fluctuations and thus, the low pass model obtained using the proposed 345 method is not able to predict these high-frequency changes. For Turkey, the performance of the proposed prediction algorithm is superior for cumulative recovered cases and death cases. For daily infection cases, it performs better than ARIMA but falls short when compared to ONMF. The ARIMA model predicts the future cases for Italy with the least RMSE. However, the performance is very close to that of the proposed method.

Considering an average over all ten countries, Figure 17e , the proposed method predicts the recovered cases with 0.32 times RMSE as compared to ARIMA, 0.07 times of SIRD, 0.04 times that of composite Gaussian growth model, 0.05 times composite Logistic growth model, and ONMF model. For prediction of death cases, the proposed method predicts the J o u r n a l P r e -p r o o f Journal Pre-proof 14 days data with RMSE 0.40 times as compared to ARIMA, 0.02 times that of SIRD, 0.04 355 times composite Gaussian growth model, 0.05 times composite Logistic growth model, and 0.03 times of RMSE obtained with ONMF model. The performance of the proposed method in predicting daily infected cases is compared using RMSE values, which is 0.38 times that of RMSE obtained when the ARIMA model is used, 0.27 times that of SIRD, 0.56 times of composite Gaussian growth model, 0.2 times composite Logistic growth model, and 0.14 360 times the RMSE of ONMF model.

In this work, we reviewed and benchmarked the most popular modeling techniques of COVID-19 data estimation and continuous prediction. These models are ARIMA, SIRD, composite Gaussian growth model, composite Logistic growth model, and dictionary learn-365 ing model (i.e., ONMF). Composite Gaussian and Logistic methods model the COVID-19 data by a number of overlapping Gaussian and Logistic distribution waves, where each basic wave is localised in time and considered as a representation of the epidemic. However, the assumption of having similar waves throughout the epidemic duration may not give realistic forecasts. Therefore, to overcome this drawback, we also reviewed the re-370 cently proposed model-free approach of dictionary learning. This method learns the waves directly from the data without assuming any predefined shape. We also proposed a new data modeling strategy by estimating a trend of the data and then using an optimized ARIMA model. The trend is obtained using a low-pass Gaussian filter. The performance of these models was compared based on the RMSE values obtained for the 7-days and 375 14-days ahead prediction, and it was shown that for most of the cases the performance of the proposed methodology is far superior to the other existing methodologies. The number of daily COVID-19 infections, the cumulative number of recovered cases, and the cumulative deaths reported for India, the USA, the UK, Russia, Brazil, Germany, France, Italy, Turkey, and Colombia have been considered for modeling and continuous prediction.

Although we have considered these mentioned countries only in our study, the proposed methodology can be used for the continuous monitoring and prediction of COVID-19 of any country, state, and region. So far, a number of methods have been proposed for forecasting COVID-19 data. The utility of these methods is still limited for long-term forecasting and predicting onsets of 385 COVID-19 waves, and in exploring the data correlation geographically. Furthermore, it is also imperative to provide insight into any model with respect to assessment, planning, and policy-making for combating the spread of COVID-19. Policy data integration with forecasting models is another important work that is worth exploring in the future as policies and public health guidelines issued at the state and local level could aid in the advance-

The novel coronavirus outbreak in Wuhan, China

Detection of improperly worn face masks using deep learning -a preventive measure against the spread of COVID-19

The impact of the COVID-19 pandemic on 405 business expectations

Forecasting for COVID-19 has failed

Transmission dynamics model of coronavirus covid-19 for the outbreak in most affected countries of the world

Modified SEIR and AI prediction of the epidemics trend of COVID-19 in China under public health interventions

Estimation of a state of Corona 19 epidemic in august 2020 by mul-420 tistage logistic model: a case of EU, USA, and World

Modeling and prediction of COVID-19 pandemic using Gaussian mixture model

Forecasting of COVID19 per regions using ARIMA models and polynomial functions

ARIMA modelling and forecasting of irregularly patterned 430

COVID-19 outbreaks using Japanese and South Korean data

Time series forecasting of Covid-19 using deep learning models: India-USA comparative case study

Prediction of COVID-19 confirmed cases combining deep learning methods and Bayesian optimization

Compartmental Models in Epidemiology

Generalized SIR (GSIR) epidemic model: An improved framework for the predictive monitoring of COVID-19 pandemic

Use of a modified sird model to analyze covid-19 data

A time-varying SIRD model for the COVID-19 contagion in Italy

Stability analysis and numerical simulation of SEIR model for pandemic COVID-19 spread in Indonesia

China using hybrid AI model

Epidemic dynamics via wavelet theory and machine learning with applications to covid-19

Prediction of epidemic trends in COVID-19 with logistic model and machine learning technics

Covid-19 predictions using 465 a gauss model

Epidemic dynamics via wavelet theory and machine learning with applications to covid-19

COVID-19 Time-series Prediction by Joint Dictionary Learning and Online NMF

A framework based on sparse 475 representation model for time series prediction in smart city

Forecasting of stock return prices with sparse representation of financial time series over redundant dictionaries

Multi-step ahead time series forecasting via sparse coding and dictionary based techniques

Time Series Analysis: Forecasting and Control

Forecasting the spread of the COVID-19 pandemic in saudi arabia using ARIMA prediction model under current public health interventions

Application of the ARIMA model on the COVID-2019 epidemic dataset

An ARIMA model to forecast the spread and the final size of COVID-2019 epidemic in Italy

EVDHM-ARIMA-based time series forecasting model and its application for COVID-19 cases

The mathematics of infectious diseases

Analysis and forecast of COVID-19 spreading in China

Mathematical model of malaria transmission dynamics with distributed delay and a wide class of nonlinear incidence rates

Early prediction of the 2019 novel coronavirus outbreak in the Mainland China based on simple mathematical model

An improved data driven dynamic SIRD model for predictive monitoring of COVID-19

Modeling and forecasting the early evolution of the Covid-19 pandemic in Brazil

Estimation of the final size of the COVID-19 epidemic

Forecasting model for the incidence of Hepatitis A based on artificial neural network

Using autoregressive integrated moving average 520 (ARIMA) models to predict and monitor the number of beds occupied during a SARS outbreak in a tertiary hospital in Singapore

Modelling malaria incidence with environmental dependency in 525 a locality of Sudanese savannah area

Forecasting incidence of hemorrhagic fever with renal syndrome in China using ARIMA model

Comparative study of four 530 time series methods in forecasting Typhoid fever incidence in China

The development of a combined mathematical model to forecast the incidence of Hepatitis E in

ARIMA and NAR based prediction model for time series analysis of COVID-19 cases in India

The prediction of COVID-19 pandemic for top-15 affected countries using advance ARIMA model

Estimation of COVID-19 prevalence in Italy

Spatial prediction of COVID-19 epidemic using ARIMA techniques in India

Can we predict the occurrence of COVID-19 cases? considerations using a simple model of growth

Forecasting the novel coronavirus COVID-19

On the rate of growth of the population of the United States since 1790 and its mathematical representation

MATLAB central file exchange

Convergence Properties of 560 the Nelder-Mead Simplex Method in Low Dimensions

A review on COVID-19 forecasting models, Neural Computing and Applications

Coronavirus disease (COVID-2019) situation reports"Retrieved

Authors also would like to thanks anonymous reviewers and editors for providing constrictive review comments to improve the manuscript.

Highlights: Present a comprehensive review, comparison, and benchmarking of popular data modeling methods for continuous forecasting and monitoring of COVID-19 pandemic  Propose O-ARIMA model for predicting COVID-19 cases that yields minimum prediction error compared to the existing methods  Report one-week and two-week forecasting of the number of COVID-19 infected, recovered, and deaths for the ten most affected counties  O-ARIMA can be used for the continuous monitoring and prediction of COVID- 19 pandemic of any country, state, and regionThe authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: None Y Declaration of Interest Statement