key: cord-0817151-oe2ws489
authors: Kalantari, Mahdi
title: Forecasting COVID-19 Pandemic Using Optimal Singular Spectrum Analysis
date: 2020-12-05
journal: Chaos Solitons Fractals
DOI: 10.1016/j.chaos.2020.110547
sha: 9c25fa77070a89c84251f28ea115f7be9993b0dd
doc_id: 817151
cord_uid: oe2ws489

Coronavirus disease 2019 (COVID-19) is a pandemic that has affected all countries in the world. The aim of this study is to examine the potential advantages of Singular Spectrum Analysis (SSA) for forecasting the number of daily confirmed cases, deaths, and recoveries caused by COVID-19, which are the three main variables of interest. This paper contributes to the literature on forecasting COVID-19 pandemic in several ways. Firstly, an algorithm is proposed to calculate the optimal parameters of SSA including window length and the number of leading components. Secondly, the results of two forecasting approaches in the SSA, namely vector and recurrent forecasting, are compared to those from other commonly used time series forecasting techniques. These include Autoregressive Integrated Moving Average (ARIMA), Fractional ARIMA (ARFIMA), Exponential Smoothing, TBATS, and Neural Network Autoregression (NNAR). Thirdly, the best forecasting model is chosen based on the accuracy measure Root Mean Squared Error (RMSE), and it is applied to forecast 40 days ahead. These forecasts can help us to predict the future behaviour of this disease and make better decisions. The dataset of Center for Systems Science and Engineering (CSSE) at Johns Hopkins University is adopted to forecast the number of daily confirmed cases, deaths, and recoveries for top ten affected countries until October 29, 2020. The findings of this investigation show that no single model can provide the best model for any of the countries and forecasting horizons considered here. However, the SSA technique is found to be viable option for forecasting the number of daily confirmed cases, deaths, and recoveries caused by COVID-19 based on the number of times that it outperforms the competing models.

until March 31, 2020 . A new hybrid model of discrete wavelet decomposition and ARIMA models have been developed in [23] to make one month ahead pre- 55 diction of death cases in Italy, Spain, France, the United Kingdom (UK), and the United States of America (USA). More information about other time series models used for forecasting COVID-19 disease can be found in [24] [25] [26] [27] [28] [29] .

Despite many attempts to model COVID-19 pandemic, few researches to the best of our knowledge have utilized Singular Spectrum Analysis (SSA) tech- 60 nique to forecast COVID-19. We found that a modified SSA approach has been used in [30] to predict COVID-19 pandemic in Saudi Arabia. Also, the recurrent forecasting method of SSA has been applied in [31] to provide predictive modelling of COVID-19 cases in Malaysia. The SSA has been a rapidly developing method of time series analysis. This non-parametric technique is widely 65 used in a variety of fields such as signal processing, finance, economics, image processing, meteorology, engineering, medicine, biology and genetics. The main characteristics of SSA are neither a parametric model nor stationary condition have to be assumed for a time series. Whilst the review of all applications of SSA is beyond the scope of this paper, we refer interested readers to [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] . 70 For a whole and detailed information on the theory and applications of SSA, see [50, 51] . A comprehensive review of SSA and description of its modifications and extensions can be found in [52] . Due to the great potential of SSA to forecast future data, we believe that this method can provide a reliable forecast for COVID-19 time series data and therefore, this motivates us to apply the SSA. 75 The number of confirmed cases, deaths, and recoveries caused by COVID-19 are the three main variables of interest that have been reported every day.

Accurate forecast of these variables is crucial and it can allow us to better understand the global impact of corona virus and correct planning in the future, such as estimating the required number of hospital beds or changing the so- 

The SSA technique has various modifications and extensions, which some of them are explained in [53] . The most fundamental version of the SSA is called Basic SSA. Here, we briefly explain the theory underlying Basic SSA and in doing so we mainly follow [51, 53] . Also, two types of SSA forecasting meth-115 ods namely Recurrent forecasting (R-forecasting) and Vector forecasting (Vforecasting) are briefly reviewed. It is noteworthy that there are many software applications which are applied in SSA such as Caterpillar-SSA and SAS/ETS.

In this research, we apply the free available R package Rssa to conduct SSA stages and to obtain recurrent and vector forecasting. More details on this 120 package can be found in [54] [55] [56] .

The SSA technique consists of two complementary stages: Decomposition and Reconstruction. Each of these stages includes two separate steps. At the decomposition stage, a time series is decomposed into several interpretable com-125 ponents such as trend, seasonal and cyclical components, which enables us to signal extraction and noise reduction. At the reconstruction stage, interpretable components are reconstructed, which can be used to forecast new data points. In embedding step, the observed time series {x 1 , . . . , x N } is transformed 130 into the L × K matrix X, whose columns comprise X 1 , . . . , X K , where X i = (x i , . . . , x i+L−1 ) T ∈ R L and K = N −L+1. The matrix X is called the trajectory matrix. This matrix is a Hankel matrix in the sense that all the elements on the anti-diagonals i + j = const. are equal. This step has only one parameter L, which is called the window length. The window length is commonly chosen 135 such that 2 ≤ L ≤ N/2, where N is the length of the time series {x 1 , . . . , x N }.

In Singular Value Decomposition (SVD) step, the trajectory matrix X is decomposed into X = UΣV T , where U and V are orthogonal and Σ is a diagonal matrix. The diagonal entries of the matrix Σ are called the singular values of X and denoted by σ 1 , . . . , σ L in decreasing order of magnitude (σ 1 ≥ · · · ≥ σ L ≥ 0).

The columns of U are called left singular vectors and those of V are called right singular vectors. If d = max{i, such that σ i > 0} = rank(X) then the SVD of the trajectory matrix X can be written as follows: is defined as X I = X i1 + · · · + X ip , that is, summing the matrices within each group. With the SVD of X, the split of the set of indices {1, . . . , d} into the m disjoint subsets I 1 , . . . , I m corresponds to the following decomposition:

The main goal of diagonal averaging is to transform each matrix X Ij of the grouped matrix decomposition (2) into a Hankel matrix, which can subsequently be converted into a new time series of length N . Let A be an L × K matrix with elements a ij , 1 ≤ i ≤ L, 1 ≤ j ≤ K. By diagonal averaging, the matrix A is transferred into the Hankel matrix HA with the elements a s over the anti-diagonals (1 ≤ s ≤ N ) using the following formula:

where

the number of elements in the set A s . By applying diagonal averaging (3) to all the matrix components of (2), the following expansion is obtained: X = X I1 + · · · + X Im , where X Ij = HX Ij , j = 1, . . . , m. This is equivalent to 145 the decomposition of the initial series {x 1 , . . . , x N } into a sum of m series:

N } corresponds to the matrix X I k . In this paper, we denote the number of leading eigentriples corresponding to the signal (noise-free time series) by r. 3. The numbers z N +1 , . . . , z N +h are the h step ahead vector forecasts.

The window length (L), which is used in the embedding step of SSA, plays a pivotal role in the SSA technique; because the whole procedure of SSA depends 170 upon this parameter. Another important parameter is the number of leading eigentriples (r) that is required to reconstruct and forecast the signal (noisefree time series). In order to find the optimal values of L and r, we apply a cross-validation procedure. This method of parameter choice is based on the minimization of Root Mean Squared Error (RMSE) within the validation 175 (test) period for a given forecasting horizon h (i.e. the number of periods for forecasting). In Algorithm 1, the details of finding optimal (L, r) are described:

Algorithm 1 Calculation of optimal (L, r) Require: Time series {x 1 , . . . , x N }, forecasting horizon h;

where N is equal to the time series length and x is the greatest integer less than or equal to x. 

where x j denotes the forecast of time series at time j. RMSEs across all test sets as follows: 

In this section, the other commonly used time series forecasting methods applied in this investigation are briefly explained. The ARIMA technique is one of the most established and widely used time series forecasting methods. A non-seasonal ARIMA(p, d, q) model is given by

where {x t } is a time series, B is the backshift operator defined as Bx t = x t−1 , [57] . Also, the seasonal ARIMA(p, d, q)(P, D, Q) m model is written

where m is equal to the number of observations per year, Φ(B m ) = 1 − Φ 1 B m − · · · − Φ P B P m and Θ(B) = 1 + Θ 1 B + · · · + Θ Q B Qm .

Selecting an appropriate model order, that is the values p, d, q, P, D and Q, is a major task in ARIMA modelling. In this paper, we use the auto.arima 185 function from the forecast package of R software to find the best ARIMA model automatically and estimate its parameters. For more information on how this function works and examples of applications, see [58] .

If the time series {x t } exhibits a long-range dependence, then the parameter 190 d can be allowed to have non-integer values in an ARIMA model, which is also called an ARFIMA model. We apply the arfima function from the forecast package to find automatically the best ARFIMA model. This function selects p and q and estimates the parameters of model using an algorithm proposed in [58] , whilst the algorithm provided in [59] is applied to estimate the parameters 195 including d.

Exponential smoothing methods are among the most widely used forecasting procedures in practice. These were originally classified by Pegels' taxonomy [60] and later extended by Gardner [61] , modified by Hyndman et al. [62] , and 200 extended again by Taylor [63] , giving a total of fifteen methods. It has shown that the exponential smoothing family has good forecast accuracy over several forecasting competitions [64] [65] [66] and is especially suitable for short time series.

Some of well-known methods such as simple (or single) exponential smoothing, Holt's linear method, additive and multiplicative Holt-Winters' methods 205 are special cases of exponential smoothing techniques. In order to refer to the three components error, trend, and seasonality in exponential smoothing methods; the notation ETS is proposed in [58] and we also use this notation. The ETS models can capture a variety of trend and seasonal structures (additive or multiplicative) and combinations of those. A detailed description of ETS can 210 be found in [67] and is therefore not repeated here. We apply the ets function from the forecast package to find automatically the best ETS model. This function implement the innovations state space modelling framework described in [67] for parameter estimation and forecasting.

An innovations state space modelling framework has been introduced in [68] for forecasting complex seasonal time series such as those with multiple seasonal periods, high-frequency seasonality, non-integer seasonality, and dual-calendar effects. This model, which is called BATS, is an exponential smoothing state space model with BoxCox transformation, ARMA errors, trend and seasonal 220 components. This model is a generalization of the traditional seasonal innovations models to allow for multiple seasonal periods. The notation BATS is an acronym for BoxCox transform, ARMA errors, Trend, and Seasonal components. In TBATS model, the trigonometric representation of seasonal components based on Fourier transform is used and the initial T in the notation 225 TBATS stands for trigonometric. For more information on the theory and applications of TBATS, see [68] . The tbats function is made available through the forecast package to fit TBATS model to a time series.

There has been an increasing interest in using neural networks to model and This is known as a multilayer feed-forward network, where each layer of nodes receives inputs from the previous layers. Let us here briefly present some details of Neural Network Autoregression (NNAR) model and in doing so we mainly follow [57] . In the NNAR model, the lagged values of the time series can be 240 used as inputs to a neural network. The notation NNAR(p, k) is used in [57] to indicate feed-forward networks with one hidden layer, p lagged inputs and k nodes in the hidden layer. In addition, a seasonal NNAR model has the notation

. , x t−P m as inputs with k neurons in the hidden layer. The nnetar function in the forecast 245 package fits an NNAR(p, P, k) m model to time series data. In this function, the values of p and P are selected automatically if they are not specified. More details on NNAR model and its applications can be found in [57] .

The accuracy of forecasting largely depends on the quality of data and re- This data has developed to a standard resource for researchers and the general audience interested in assessing the global spreading of the virus.

The data is provided at country and sub-country levels.

• European Centre for Disease Prevention and Control (ECDC) 270 The data is updated daily and contains the latest available public data on the number of new COVID-19 cases reported per day and per country. An alternative data source for governmental interventions.

The data is provided by Apple at country and sub-country levels.

• Google COVID-19 Community Mobility Reports data This data is available at the country, regional and U.S. county level.

It presents data on the search volume for the term coronavirus. This data can be used to assess the public attention to COVID-19 across countries and over time within a given country. The data is available at the country, regional and city level but availability varies across countries.

• World Bank

These data contain country level information provided by the World Bank and allow researchers to calculate per capita measures of the virus spread.

Also, these data can help researchers to assess the association of macroeconomic variables with the development of the virus. It is apparent from Figure 6 that the number of recovered cases in Russia In order to provide a better understanding on the nature of the confirmed cases data, some descriptive statistics of the number of confirmed cases are reported for ten countries in Table 1 

In this section, the performance of R-SSA, V-SSA and other time series forecasting methods reviewed in Section 3 are evaluated by applying them 355 to confirmed cases, deaths, and recoveries described in Section 4. The accuracy of forecasting results are measured using RMSE. In order to compute the RMSE of each forecasting method corresponding to the forecasting horizon h (i.e. RM SE h defined in (6)), Steps 1-5 of Algorithm 1 are used. It is noteworthy that the optimal (L, r) is applied to produce R-SSA and V-SSA forecasting.

In Table 5 , the optimal (L, r) are reported for confirmed, deaths, and recovered series of the ten countries. shown in Figure 7 indicate that there will be a dramatic increase in the number of confirmed cases of France, Spain, and UK. However, the rate of growth will be slower in Russia and Argentina. This increase will happen slowly in India, Brazil, Colombia, and Mexico. Also, this results reveal that the number of confirmed cases will be decreasing in USA. According to the forecasting results depicted in Figure 9 , the number of 405 recovered cases will rise in USA, Russia, France, and Argentina; however, there will be a decline in Brazil, and especially in India. This quantity will tend to a constant in Colombia and Mexico. In addition, the number of recovered cases in UK will fluctuate, but the trend will upward. 

In this study, we have evaluated the potential advantages of SSA for forecasting the number of daily confirmed cases, deaths, and recoveries caused by COVID-19. In order to calculate the optimal parameters of SSA including win- University has been adopted to forecast the number of daily confirmed cases, deaths, and recoveries for top ten affected countries until 29 October 2020. It should be noted that the dataset of CSSE has a considerable disadvantage. If 420 the cumulative data of confirmed cases, deaths, and recoveries are transformed into daily data, some negative data are obtained that are apparently irrational.

In order to deal with this issue, first, we considered the negative values and outliers as missing data. Then, these missing values were imputed by Kalman

Smoothing method. It is worth mentioning that the present study is unique with 425 regard to using optimal version of V-SSA and R-SSA, and comparing the results to those from ARIMA, ARFIMA, ETS, TBATS, and NNAR. The findings of this study can be summarised as follows:

• No single model can provide the best forecast of the number of confirmed cases, deaths, and recoveries for all ten countries considered here.

• Based on the number of times that R-SSA and V-SSA forecasting techniques outperform the other models across all horizons, these methods are viable options for forecasting the number of daily confirmed cases, deaths, and recoveries caused by COVID-19.

• There will be a rapid rise in the number of confirmed cases of France, 435 Spain, and UK. However, the rate of increase will be slower in Russia and

Argentina. This growth will occur slowly in India, Brazil, Colombia, and

Mexico. The results of point forecasts reveal that the number of confirmed cases will be decreasing in USA.

• The number of deaths of Russia and Argentina will increase dramatically; 440 however, it will be decreasing in India, France, Colombia, and UK. The number of deaths will fluctuate around an almost constant value in USA, Brazil, and Mexico.

• There will be an increase in the number of recovered cases of USA, Russia, France, and Argentina; however, there will be a decline in Brazil, and 445 especially in India. This quantity will tend to a constant in Colombia and

Mexico. Also, the number of recovered cases in UK will fluctuate, but the trend will upward.

In this paper, we have used the optimal version of two forecasting techniques 450 of SSA, namely R-SSA and V-SSA, for forecasting the number of daily confirmed cases, deaths, and recoveries caused by COVID-19. In order to evaluate the performance of these approaches based on the RMSE criterion, the forecasting results have been compared to those from other commonly used time series forecasting methods including ARIMA, ARFIMA, ETS, TBATS, and NNAR. 455 We considered only the first ten countries in terms of the number of cumulative confirmed cases. These countries include USA, India, Brazil, Russia, France, Spain, Argentina, Colombia, UK, and Mexico. The evidence from this investigation shows that there is not a single model to provide the best model for any of the countries and forecasting horizons considered in this study. However, 460 we have found that the optimal SSA technique can provide a powerful tool for forecasting the number of daily confirmed cases, deaths, and recoveries caused by COVID-19 based on the number of times that it outperforms the competing models.

Our study has an obvious shortcoming. The forecasting methods used in 465 this investigation may produce some negative point forecasts that are clearly meaningless as the number of confirmed cases, deaths, and recoveries. In order to make positive point forecasts, we suggest using count time series models.

This work has gone some way towards enhancing our understanding of SSA capabilities for forecasting the COVID-19 pandemic. The results of this study 470 enable forecasters to choose the most appropriate model (from those considered here) based on the country and horizon for forecasting the number of confirmed cases, deaths, and recoveries caused by COVID-19. We hope that our forecasts will be a useful tool for governments towards making appropriate decisions to control the disease and prevent further damages. In terms of future research, 475 we will apply the multivariate version of SSA that employs the time-dependent correlations between several time series.

Forecasting covid-19 dynamics 480 in Brazil: A data driven approach

Prediction and analysis of COVID-19 positive cases using deep learning models: A descriptive case study of

Forecasting COVID-19

A model for COVID-19 prediction in Iran based on China parameters

Deep learning methods for fore-495 casting COVID-19 time-series data: A comparative study

Recognition of COVID-19 disease from X-ray images by hybrid model consisting of 2D curvelet transform, chaotic salp swarm 500 algorithm and deep learning technique

Digital currency forecasting with chaotic meta-heuristic bio-inspired signal processing techniques

A new forecasting model with wrapper-based feature selection approach using multi-objective optimization technique for chaotic crude oil time series

Real-time estimation and prediction of mortality caused by COVID-19 with patient information based algorithm

Modified SEIR and AI prediction of the epidemics trend of COVID-19 in China under public health interventions

Why is it difficult to accurately predict the COVID-19 epidemic?

Modeling and forecasting the COVID-530 19 pandemic in India

On forecasting the spread of the COVID-19 in Iran: The second wave

Forecast and evaluation of COVID-19 spreading in USA with reduced-space Gaussian process regression

COVID-19) using machine learning methods

Hassanien, Forecasting models for coronavirus disease (COVID19): A survey 545 of the state-of-the-art

Forecasting the novel coronavirus COVID-19

Forecasting COVID-19 spread in Malaysia, Thailand, and Singapore

ARIMA modelling & fore-555 casting of COVID-19 in top five affected countries

Modelling and forecasting of new cases, deaths and recover cases of COVID-19 by using Vector Autoregressive model in

Was school closure effective in mitigating coronavirus disease 2019 (COVID-19)? Time series analysis using Bayesian inference

Development of new hybrid model of discrete wavelet decomposition and autoregressive integrated moving average (ARIMA) models in application to one month forecast the casualties cases of COVID-19

Time series modelling to forecast the confirmed and recovered cases of COVID-19

Statistical analysis of forecasting COVID-19 for upcoming month in Pakistan

Forecasting the spread of 580 the COVID-19 pandemic in Saudi Arabia using ARIMA prediction model under current public health interventions

Time series forecasting of COVID-19 trans-585 mission in Canada using LSTM networks

Time series analysis and forecast of the COVID-19 pandemic in India using genetic programming

Time series prediction of COVID-19 by mutation rate analysis using recurrent neural network-based LSTM model

Predicting COVID-19 pandemic in Saudi Arabia using modified Singular Spectrum Analysis

Predictive modelling of covid-19 cases in Malaysia based on recur-600 rent forecasting-Singular Spectrum Analysis approach

Singular spectrum analysis for modeling seasonal signals from GPS time 605 series

Periodicity of carbon element distribution along casting direction in continuous-casting billet by using Singular Spectrum Analysis

Singular spectrum analysis for enhancing the sensitivity in structural damage detection

Quantifying the correlation between the MEI and LOD variations by decomposing LOD with Singular Spectrum Analysis

Singular Spectrum Analysis for signal extraction in stochastic volatility models

Predicting the Brexit outcome using Singular Spectrum Analysis

Minute-ahead stock price forecasting based on singular spectrum analysis and support vector regression

Forecasting tourist arrivals using multivariate singular spectrum analysis

Monthly forecasting of 635 GDP with mixed-frequency multivariate singular spectrum analysis

Singular Spectrum Analysis of Biomedical Signals

Evaluating the analytical distribution of bicoid gene expression profile

Embedding dimension selection for adaptive singular spectrum analysis of EEG signal

A new signal processing approach for discrimination of EEG recordings

Improving the performance 650 of the SSVEP-based BCI system using optimized singular spectrum analysis (OSSA)

Bicoid signal extraction: Another powerful approach

Removal of EMG artifacts from multichannel EEG signals using combined singular spectrum analysis and canonical correlation analysis

Risk management, signal processing and econometrics: A new tool for forecasting the risk of disease outbreaks

Entropy-based pattern learning based on singular spectrum analysis components for assessment of physiological signals

Analysis of Time Series Structure: SSA and related techniques

Singular Spectrum Analysis for Time Series

Particularities and commonalities of singular spectrum analysis as a method of time series analysis and signal processing

Singular Spectrum Analysis with R

Computation-and space-efficient implementation of 680 SSA

Basic singular spectrum analysis and forecasting with R

Multivariate and 2D extensions of singular spectrum analysis with the Rssa package

Forecasting: Principles and practice

Automatic time series forecasting: The forecast package for R

Space-time modelling with long-memory dependence: Assessing ireland's wind power resource

Exponential forecasting: Some new variations

Exponential smoothing: The state of the art

A state space framework for automatic forecasting using exponential smoothing methods

Exponential smoothing with a damped multiplicative trend

The accuracy of extrapolation (time series) methods: Results of a forecasting competition

The M3-Competition: results, conclusions and implications

The M4 Competition: 100,000 time series and 61 forecasting methods

Forecasting with 725 exponential smoothing: the state space approach

Forecasting time series with complex seasonal patterns using exponential smoothing

Top 100 R resources on novel COVID-19 coronavirus

Tidy and Visualize Covid-19 Related Data

{tidycovid19}: New data and documentation

imputeTS: Time series missing value imputation in R

Testing for unit roots in autoregressive-moving average models of unknown order

tseries: Time Series Analysis and Computational Finance, R package version 0

Draft preparation, Visualization, Validation, Reviewing and Editing

☒ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work