key: cord-0326304-ictzdlr9 authors: Prajapati, Samyak; Swaraj, Aman; Lalwani, Ronak; Narwal, Akhil; Verma, Karan title: Comparison of Traditional and Hybrid Time Series Models for Forecasting COVID-19 Cases date: 2021-05-05 journal: nan DOI: 10.21203/rs.3.rs-493195/v1 sha: 4502646e4b913c7f65a015a7af1c203c62bb2385 doc_id: 326304 cord_uid: ictzdlr9 Time series forecasting methods play critical role in estimating the spread of an epidemic. The coronavirus outbreak of December 2019 has already infected millions all over the world and continues to spread on. Just when the curve of the outbreak had started to flatten, many countries have again started to witness a rise in cases which is now being referred as the 2nd wave of the pandemic. A thorough analysis of time-series forecasting models is therefore required to equip state authorities and health officials with immediate strategies for future times. This aims of the study are three-fold: (a) To model the overall trend of the spread; (b) To generate a short-term forecast of 10 days in countries with the highest incidence of confirmed cases (USA, India and Brazil); (c) To quantitatively determine the algorithm that is best suited for precise modelling of the linear and non-linear features of the time series. The comparison of forecasting models for the total cumulative cases of each country is carried out by comparing the reported data and the predicted value, and then ranking the algorithms (Prophet, Holt-Winters, LSTM, ARIMA, and ARIMA-NARNN) based on their RMSE, MAE and MAPE values. The hybrid combination of ARIMA and NARNN (Nonlinear Auto-Regression Neural Network) gave the best result among the selected models with a reduced RMSE, which proved to be almost 35.3% better than one of the most prevalent method of time-series prediction (ARIMA). The results demonstrated the efficacy of the hybrid implementation of the ARIMA-NARNN model over other forecasting methods such as Prophet, Holt Winters, LSTM, and the ARIMA model in encapsulating the linear as well as non-linear patterns of the epidemical datasets. The novel coronavirus which first appeared in Wuhan, China in late 2019 has already infected over 257 million people and caused over 5.1 million deaths worldwide [1] . The ground-zero for the zoonotic spillover has been triangulated to the live-food markets of Wuhan, where the virus spread proximally due to direct exposure to animal shedding, bodily fluids, blood, and secretions [2] . In the absence of any tangible treatment, the pandemic has ruptured the concept of normal life while spreading with a rate of 1.8 (in India) [3] . To flatten the pandemic curve, several intervention policies have been implemented in countries all over the world. However, these policies which include mobility and transportation restrictions, have provided temporary relief, but not flattened the curve, which has seen multiple waves of COVID-19 cases as exhibited in Figure 1 . The situation has even more degraded in densely populated countries like India and Brazil which can't afford the luxury of lockdown due to socio-economic reasons. Therefore, rapid and predictable up-scaling of the healthcare framework is now most critical towards ensuring the availability of appropriate facilities during these demanding times. In our earlier work we also presented a model for classifying covid-infected X-rays [4] . Each nation now aims at vaccinating their citizens against the virus, but there have been multiple studies claiming that the vaccine elicited immunity is a short-term immunity, and the majority of the populace would require booster shots in the near future. Thus, there lies a sense of uncertainty about the ongoing pandemic and the spread of its contagion. This section elaborates on the data collection segment of our work, followed by a short description of the forecasting models that were used. The metrics used to assess the performance of the models are given at the end of this section. The time-series data was fed to all the stated models and their results were compared based on the performance. The COVID-19 Data Repository by John Hopkins University's Centre for System Science and Engineering contains the time-series dataset for cumulative count of confirmed cases, reported deaths and recovered cases worldwide [1] . For our study, we chose three countries that were severely affected by COVID-19, respectively the United States, India and Brazil. Since the dataset contained cumulative counts, the data was differenced with its preceding data point to generate a daily incidence time series as well. All the models were trained on three different intervals. The test-train split for each interval is shown in table 1. Last ten days of each respective interval were used for testing purpose. The reasoning for 10 days being selected as the test set is because the target of this work was to analyze the performance of forecasting models in terms of their short-term performance during the growing peak of the 'First Wave' of the COVID-19 cases, thus short-term forecasting would be helpful in predicting the spread of COVID-19 and in focusing the attention of the state authorities on a particular region of the country. The models were then ranked accordingly based on Root Mean Square Error (RMSE), Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE). In order to support the results on India, the models were analyzed on USA and Brazil as well (Top 3 countries with the highest incidence of COVID-19). Auto-Regressive Integrated Moving Average (ARIMA) was proposed by Box and Jenkins in the 1970s [42] as a model which took varying trends, seasonal changes and random disturbances in account to predict the future values of the series. Due to these reasons, today, it is one of the most popular models that is used for forecasting time-series. It is denoted as ARIMA (p, d, q) where p and q are the orders of the AR and MA terms of the models respectively, and d represents the level of differencing used in the model to achieve stationarity. In a much simpler way, it can be stated that p is the number of lagged elements that have an influence on the current element, d is the number of times the series is differenced to achieve constant mean and variance and q is the number of error terms. It can mathematically be represented as, Where denotes the computed value of forecast at the given time t, i and j are the coefficients of the AR and MA models respectively and is the random error occurring at time . LSTMs are a form of a recurrent neural network (RNN) and as suggested by their name Long Short Term Memory, they allow for the model to retain information about the data that was previously computed. While most forms of RNNs can utilize the previous data in some form, LSTMs have the intrinsic ability to "store" the data for a short duration. This is achieved by the use of multiple "gates" and by modifying the cell state ( Figure 2 ). Each gate is essentially a function which computes an output determining the way cell state has to be modified. Each gate can easily be attributed to an activation function, where x is the feature vector, ht-1 is the output of cell t-1, ct-1 is cell state after cell t-1, ct is the cell state after cell t and ht is the output of cell t, thus the resulting computations are, Where m is the period length of the seasonal variation, k is the number of steps ahead from any arbitrary step i, and, Prophet is an open-source time-series forecasting library developed by Facebook which runs upon Stan, which is a statistical modeling and high-performance statistical computation platform. It is based on a decomposable additive model constituting three major components; trends, seasonality and holidays. The equation for the above can be interpreted as, where, ( ) represents the piecewise linear or the logistic growth curve for modelling the non-periodic changes in the time series, ( ) is the periodical changes that occur with seasonality, ℎ( ) includes the effects of holidays (which can be provided by the user) along with schedules that may be irregular in nature and finally, is the error term which takes in consideration any irregular changes that may not be accommodated by the model. In our previous work, we have illustrated this point that creation of a hybrid model between ARIMA and a NARNN [44] that can selectively work on a time series by isolating and working on individual areas of strengths (summarized in Figure 3 ). NARNNs are generally known for their ability to be modelled on non-linear features of a time-series data [45] [46] [47] . It works by employing the architecture of recurrent neural networks and uses its embedded memory with feedback connections. This can be exhibited by the mathematical description of the NARNN model, Where ̂ is the value forecasted by the ARIMA model at a time . By modelling the residuals using ANNs, the non-linear segments can be realized and thus, the residuals are fed into a NARNN model which comprises of n input nodes, modelling it into, = ( −1 , −2 , … , − ) + (16) Where, constitutes as the non-linear function that is being evaluated by the NARNN model, and the error generated in doing so is represented by . The final equation is then represented by the equation below, where ̂ indicates the final forecast of the time-series at time and ̂ is the residual forecast. = ̂+̂ Accuracy of a time series model can be evaluated by comparing the predicted values with the actual/true values. There lie several performance measures for this purpose; however, this study employs RMSE, MAE and MAPE. Their mathematical notations are shown below: Here, stands for the number of data points available, is the original value at time and ̂ denotes the estimated value at time . Lower values of RMSE, MAE and MAPE indicate the better fitting of the data to the model. As described earlier in section 2.1, the models were analyzed on three sets of intervals, (22nd Jan -15th May and so on, where the last ten days were used for testing), prioritizing India and using the incidence count of the other countries to confirm the observations. In order to attain the best fit of the ARIMA model, the respective parameters p,q and d have to be selected appropriately. And so, first the Augmented Dickey-Fuller (ADF) unit root test was conducted to observe the stationarity of the time-series. Keeping the significance level of 0.05, it was found out that the time-series data was not stationary and needing differencing to achieve stationarity. After differentiating the time-series, ADF tests were repeated to check if stationarity was achieved. After this, ACF and PACF plots of the time series data were generated to identify appropriate AR and MA parameters of the ARIMA model. With p,d,q parameters being identified, the model was fit to the data. AIC and BIC values were used to verify the appropriate fitness of the model to the data; after verification, we achieved the following results as shown in Table 2 (a) for India. The actual incidences from each of the models are plotted in Figure 4 . the 15th of May, 21st July and the 30th of July, and Aug 1st and the Aug 10th, 2020. From the above tabulated data in Table 2 (a), it is clearly evident that ARIMA performs the best when compared with other popular time-series forecasting methods. Although ARIMA performs good, better still is the hybrid model which is able to map the non-linear components of the forecast as well. To further substantiate the superiority of our Hybrid model over the single ARIMA, we compared the performance of ARIMA and our hybrid model on the same interval on USA and Brazil and the results are tabulated in Table 2 (b) and 2(c) and plotted in Figure 5 respectively. It is evident from the aforementioned figures and tables that a hybrid model of ARIMA-NARNN is able to outperform the standard ARIMA by predicting the residuals and working upon them to generate accurate forecast. In order to establish a deeper analysis on India, the hybrid model was further subjected to forecasting daily incidences of COVID-19. Since our prime motive was to compare the efficacy of popular forecasting models in their short-term prediction ability during the first wave of COVID-19 cases, predictions of daily observed cases, daily reported deaths and daily recovered cases were analyzed and tabulated in Table 3 and plotted in Figure 6 . Our study highlighted the key point of analyzing linear and nonlinear patterns in a time series forecasting model. From Table ( 2-a, b, c) we see clearly how RMSE value of the hybrid model is minimal when compared to the other stipulated models. This is attributed to the hybrid model having the ability to train itself on the non-linear features of the data. In the Indian time-series of total cases, there is a notable rise in the non-linear features along with linear features. While ARIMA is able to perform well on just the linear features; the Hybrid model is able analyze the non-linear features as well and is able to substantially improve its performance, thus exhibiting the least RSME amongst the other models. With most countries hitting their share of the surge of COVID cases, aptly named as the "second-wave", this surge is the result of multiple factors which range majorly from the softening of threat in the mindset of the common folk and the relaxations in the government-imposed policies due to economic slowdowns [48] in several sectors. In such times, where the governing bodies are struggling to stabilize economic recessions, true and precise forecasting of seasonal diseases and the spread of contagions is an ever-growing priority. Among the chosen time-series forecasting models, ARIMA performed very well, and its performance was then improved by our implementation of the ARIMA-NARNN Hybrid model. While it is important to model the long term spread of contagions, short-term forecasts play a vital role as well in the rapid deployment of resources and manpower especially in developing countries like India and Brazil having dense population and dynamic demographics. The ever-increasing habitat loss of wildlife leads the animals in search of a new home, this search brings them closer to us; and a consequence of this is that it also exposes us to them. Keeping this in mind, it is plausible that we may get exposed to a lot more zoonotic pathogens in the coming future, and then the only way to circumvent another pandemic is to prepare ourselves in monitoring and curb the spread of infectious diseases. Practical insights of how the spread of diseases may transpire would lead to the development of an understanding between the policymakers, and hence preferable allocation and management of crucial resources under tight time constraints. An interactive web-based dashboard to track COVID-19 in real time The Man Who Saw the Pandemic Coming Projections for novel coronavirus (COVID-19) and evaluation of epidemic response strategies for India Classification of COVID-19 on chest X-Ray images using Deep Learning model with Histogram Equalization and Lungs Segmentation Predicting the outbreak of the hand-foot-mouth diseases in china using recurrent neural network Application of time series methods for dengue cases in North India (Chandigarh) Case fatality ratio estimates for the 2013-2016 west African Ebola epidemic: application of boosted regression trees for imputation Comparison of time series models predicting trends in typhoid cases in northern India Trend prediction of influenza and the associated pneumonia in Taiwan using machine learning Identifying outbreaks of porcine epidemic diarrhoea virus through animal movements and spatial neighbourhoods Predicting malarial outbreak using Machine Learning and Deep Learning approach: A review and analysis Prediction of the Epidemic Peak of Coronavirus Disease in Japan, 2020 SEIR and Regression Model based COVID-19 outbreak predictions in India The Framework for the Prediction of the Critical Turning Period for Outbreak of COVID-19 Spread in China based on the iSEIR Model Data-Based Analysis, Modelling and Forecasting of the novel Coronavirus (2019-nCoV) outbreak,"medRxiv Estimating clinical severity of COVID-19 from the transmission dynamics in Wuhan, China Nowcasting and forecasting the potential domestic and international spread of the 2019-nCoV outbreak originating in Wuhan, China: a modelling study The effect of control strategies to reduce social mixing on outcomes of the COVID-19 epidemic in Wuhan, China: a modelling study Modelling the situation of COVID-19 and effects of different containment strategies in China with dynamic differential equations and parameters estimation A conceptual model for the coronavirus disease 2019 (COVID-19) outbreak in Wuhan, China with individual reaction and governmental action Forecasting COVID-19 impact on hospital bed-days, ICU-days, ventilator days and deaths by U.S. state in the next 4 months Day Level Forecasting for Coronavirus Disease (COVID-19) Spread: Analysis, Modelling and Recommendations Neural network-based country wise risk prediction of COVID-19 Machine Learning Approach for Confirmation of COVID-19 Cases: Positive, Negative, Death and Release COVID-19 Epidemic Analysis using Machine Learning and Deep Learning Algorithms Application of the ARIMA model on the COVID-2019 epidemic dataset Brief Analysis of the ARIMA model on the COVID-19 in Italy An ARIMA model to forecast the spread and the final size of COVID-2019 epidemic in Italy Forecasting of COVID-19 Confirmed Cases in Different Countries with ARIMA Models Trend Analysis and Forecasting of COVID-19 outbreak in India Coronavirus (COVID-19): ARIMA based time-series analysis to forecast near future Forecasting the dynamics of COVID-19 Pandemic in Top 15 countries in April 2020: ARIMA Model with Machine Learning Approach Temporal relationship between outbound traffic from Wuhan and the 2019 coronavirus disease (COVID-19) incidence in China Optimization method for forecasting confirmed cases of COVID-19 in China Estimation of COVID-19 prevalence in Italy Exponentially Increasing Trend of Infected Patients with COVID-19 in Iran: A Comparison of Neural Network and ARIMA Forecasting Models An Eye on the Future of COVID'19: Prediction of Likely Positive Cases and Fatality in India over A 30 Days Horizon using Prophet Model Trend analysis and forecast of daily reported incidence of hand, foot and mouth disease in Hubei Comparison of the Ability of ARIMA, WNN and SVM Models for Drought Forecasting in the Sanjiang Plain Applications and comparisons of four time series models in epidemiological surveillance data Application of an autoregressive integrated moving average model for predicting the incidence of haemorrhagic fever with renal syndrome Time Series Analysis: Forecasting and Control Understanding LSTM and its diagrams Implementation of stacking based ARIMA model for prediction of Covid-19 cases in India Time series forecasting using a hybrid ARIMA and neural network model Hybrid methodology for tuberculosis incidence time-series forecasting based on ARIMA and a NAR neural network Small-scale solar radiation forecasting using ARMA and nonlinear autoregressive neural network models A second wave of coronavirus may force renewed lockdowns". Pharmaceutical-technology.com The authors declare that they have no conflict of interest. Further, the authors have no relevant financial or non-financial interests to disclose. No funds, grants, or other support was received.