key: cord-1002567-f8fpn0qz authors: Panda, M. title: Application of ARIMA and Holt-Winters forecasting model to predict the spreading of COVID-19 for India and its states date: 2020-07-16 journal: nan DOI: 10.1101/2020.07.14.20153908 sha: ea2e15c38ddb8287e074ba1d1711920962b8f71f doc_id: 1002567 cord_uid: f8fpn0qz The novel Corona-virus (COVID-2019) epidemic has posed a global threat to human life and society. The whole world is working relentlessly to find some solutions to fight against this deadly virus to reduce the number of deaths. Strategic planning with predictive modelling and short term forecasting for analyzing the situations based on the worldwide available data allow us to realize the future exponential behaviour of the COVID-19 disease. Time series forecasting plays a vital role in developing an efficient forecasting model for a future prediction about the spread of this contagious disease. In this paper, the ARIMA (Auto regression integrated moving average) and Holt-Winters time series exponential smoothing are used to develop an efficient 20- days ahead short-term forecast model to predict the effect of COVID-19 epidemic. The modelling and forecasting are done with the publicly available dataset from Kaggle as a perspective to India and its five states such as Odisha, Delhi, Maharashtra, Karnataka, Andhra Pradesh and West Bengal. The model is assessed with correlogram, ADF test, AIC and MAPE to understand the accuracy of the proposed forecasting model. Ever since the discovery of COVID-19 epidemic that originated in Wuhan, China in late 2019, lots of clinical investigations are going on by World Health Organization (WHO) and other medical practitioners/companies across the globe to develop vaccines to countermeasures this disease. This has tremendously affected human life with illness, economic slowdown and death. Looking into the impact of this deadly disease over more than 210 countries, recently WHO has declared COVID-19 pandemic [1] as a global threat to public health and social life and advised some preventive strategies to be followed strictly including: wearing of masks, maintaining social distancing of 3 feet, washing of hands by sanitizer repeatedly, avoid moving in large public gatherings etc., till some well-tested drugs are being discovered to counter this virus. COVID-19, being SARS Cov-2 causes severe respiratory problems where the patient needs emergency care with ICU facility and in many cases with high mortality rate. The present clinical experience after using infected patient's swab test and abnormal chest CT imaging suggests Lungs CT imaging may be the better diagnostic method in place of nucleic acid testing for early detection of such a contagious disease [2] . It is observed that SARS Cov-2 COVID-19 spread more rapidly among human through personal After presenting an introduction in Section 1, Section 2 presents some of the COVID-19 related work available in the literature to date. Section 3 presents the materials and methods used in this research. Experimental results, performance evaluation and discussion are provided in Section 4, followed by conclusion in section 5. Machine learning has found its enormous application across various public health including disease prediction and relevant valid drug development [3] . Rough set theory considered to be an effective method to deal with health care data having inconsistent and imprecise information [8] . Machine learning and deep learning is applied successfully in medical imaging applications, cancer tumour classification and tuberculosis (TB) disease prediction and analysis [9] [10] [11] [12] . In a recent review, Gamboa [13] discusses the usefulness of various stochastic models such as AR, ARIMA and GARCH etc. and then mentioned the scarce applications of deep learning in time series forecasting, even though the use of neural networks for financial prediction is not new [14] . Deep learning is applied in computerized tomography scan and radiogram images of the COVID-19 patients to detect the presence of viral infections [15] with good performance accuracy. Ribeiro et al. [16] performed extensive applications of short-term forecasting techniques in Ten Brazilian States using ARIMA, random forest, Support vector regression (SVR), RIDGE and an ensemble approach to perform one to six days ahead forecasting on cumulative confirmed COVID-19 patients and ranked the model performance based on having low mean absolute error. Tuli et al. [17] use a robust iterative weighting model to statistically predict the severity of COVID-19 spread efficiently and compared with the baseline Gaussian model and advocates that a poorly fitting model could lead to worsening the public health situation. Ghosal et al. [18] uses autoregression, multiple regression and linear regression techniques for trend analysis of COVID-19 death patients at the 5 th and 6 th week in India and opines autoregression for better prediction performance. Kumar, Gupta and Srivastava [19] presents a progressive search considering the latest modern approaches available that might be suitable to fight with COVID-19 pandemic and opines for the further research as technology contribution to reduce the impact of this outburst. Deb and Majumdar [20] explored the usefulness of time series forecasting for trend analysis at an early stage of the COVID-19 epidemic for different countries for developing countermeasure policies in dealing with the epidemic. Kucharski et al. [21] discuss the early spread pattern of COVID-19 in every nook and cranny of China with different available datasets using scientific methods and explore its possible spread in other parts of China. Dey et al. [22] use a visual method to understand the COVID-19 disease patterns everywhere possible in the world, for better containment planning. The publicly available time-series data is obtained from Kaggle [23] that collects the data as and when released in the official website of the Ministry of Health and Family Welfare (MoHFH) Government of India. The MoFHW updates the national level as well as state-level confirmed, death and recovered data. This dataset consists of three parts as (I) complete list of all states arranged in day-wise details, (ii) national daily and cumulative details and (iii) state-wise daily details of confirmed, death and recovered cases. In all the three parts, data are collected from 30 th January to 29 th June 2020. The complete list contains a total of 3511 instances, the national part containing daily and cumulative cases with 153 instances and state-level data with 109 instances. The dataset has 10 attributes with time-stamp, state, latitude, longitude, confirmed cases, recovered cases. death cases and their new reported cases for each one, on a day to day basis. In the case of India, the first reported case was on 30 th Jan 2020 and the first reported death case was on 11 th March 2020. The Holt-Winters forecasting algorithm developed by Charles Holt and Peter Winters is useful for time series forecasting where users smooth the time series data and then use it for the forecast as per its interest. Exponential smoothing is a method to smooth a time series where it allocates exponentially decreasing weights and values in opposition to historical data to lessen the value of the weights for the bygone data. Exponential β γ smoothing can be classified into three types. While the simple or single exponential smoothing time series forecasting for uni-variate data does not have any systematic structure with no trend and seasonality. In this case, an only single parameter is used as a smoothing factor that lies between 0 and 1. A smaller value designate slow learning, takes more past observations for forecasting and a larger value designate faster learning takes most recent observations for making a forecast. Next type is double exponential smoothing where apart from , another smoothing parameter is used for change in trend. There are two types of the trend such as additive trend which gives linear trend analysis and the other is multiplicative trend gives exponential trend analysis. It is observed that during multi-step long-range forecasting, the trend may become unfeasible. Dampening may be practical hereby reducing the trend size for the future forecast with a straight line (no trend). Finally, triple exponential smoothing adds seasonality ( ) part from and . This is the most recent exponential smoothing method, named after its inventor Charles Holt and Peter Winters, which is useful to find the changing pattern of level, trend and seasonality over time by using either additive or multiplicative seasonality. In this paper, Holt-Winter triple exponential smoothing [24], is used for forecasting, Where, α is the smoothing factor for the level; is the smoothing factor for the trend, is the smoothing factor for the seasonality; y and S are actual and smoothed observation; b is the trend factor, I is the seasonal index, F is the forecast at m steps ahead, L is cycle length and t is a period. The value of , and are to be chosen carefully so that the error is minimized. The autoregressive-integrated-moving average (ARIMA) is a very popular and efficient time series forecasting model that can predict a value in time series by The working principle of ARIMA lay down by three stages: Step-1: Identification stage, where the input time series data are used for computing autocorrelation (ACF) and partial autocorrelations (PACF) from correlogram and auto-correlogram respectively, to check whether the data is stationary or non-stationary. Next, if the time series data is non-stationary, then it is to be converted to a stationary one using differencing. With differencing, one can model the change from one time period to the next rather than modelling the series itself. It is worth noting here that over differencing may increase standard deviation. Finally, the white noise residual test will enable us to perform hypothesis testing on the autocorrelations of the series for its statistical significance. Step-2: Estimation stage with statistical analysis helps us to understand the adequacy of changes not yet incorporated in the estimated model. Sometimes, this seems a cumbersome task for a long series and hence, a post-hoc outlier analysis is advised in such a case. Step-3: Forecasting stage with the help of estimation stage, could generate the best prediction output with future time series values along with upper and lower bound of these forecast with confidence interval using the best ARIMA model. Generally, ARIMA can be modelled as ARIMA (p,d,q) with p is the order of AR (autoregressive) process, d is the differencing and q is the order of MA (moving average) part. If d=0, then ARIMA (p,d,q) reduces to an ARMA(p,q) model. Predictive modelling with statistical and machine learning methods that takes the past data is used for making the better future prediction. To have a better prediction model, selection of most appropriate methodology is of paramount importance, failing which it may not only provide worst decision making but also a greater chance of affecting public health and social life. The forecasting model is a popular choice when it comes to deal with past numerical data and to predict the new value-based on past data. Short term forecasting model for time series data with short term predictions is considered in this paper using Holt-Winters and ARIMA architecture, to explore further insights to deal with COVID-19 epidemic in the perspective of India and five Indian states such as Odisha, Andhra Pradesh, Delhi, Maharashtra and West Bengal. The proposed methodology is shown in Figure 1 . To make the COVID-19 time-series data to be stationary, ADF (Augmented Dickey-Fuller) Test is carried out with differencing d=0, 1 and 2 with the result shown in Table 1 and Table 2 . It is seen that d=1 is a proper choice to make the series stationary at a 5% significance level test in comparison to d=0 with the low p-value. Further, to estimate other two parameters of the model such as ACF and PACF; Correlogram and Partial Correlogram of the time series and first difference are used for lag 1 to lag 30, which is shown in Figure 2 . . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 16, 2020. . possible models with low AIC values and well-behaved residuals with zero mean and constant variance. The potential ARIMA models for final selection are shown in Table 3 . . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 16, 2020. . https://doi.org/10.1101/2020.07.14.20153908 doi: medRxiv preprint (a) Acf for confirming cases (b) PACF for Confirm cases (c ) ACF for death cases . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 16, 2020. . https://doi.org/10.1101/2020.07.14.20153908 doi: medRxiv preprint (d ) PACF for death cases Table 4 . Further, to check the accuracy of the proposed forecasting model, a comparison has also been performed for the daily confirmed and death cases for India from 1 st July 2020 to 10 th July 2020, as shown in Table 5 . Looking into all the above estimations, short term forecasting is performed till August 31, 2020, for both confirmed and death cases, which are shown in Figure 3 and Figure 4 . is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 16, 2020. . Table 6 . Finally, a comparison is made with actual COVID-19 spreading in India from 1 st July 2020 to 10 th July 2020, with the estimated values obtained from Holt-Winters model for the accuracy of the prediction, which is shown in Table 7 . . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 16, 2020. . . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 16, 2020. The proposed forecasting using Holt-winters model for confirm and death cases as India perspective are shown in Figure 6 and Figure 7 respectively. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 16, 2020. . https://doi.org/10.1101/2020.07.14.20153908 doi: medRxiv preprint health care facilities etc. to name a few. Five Indian states: Maharashtra, Delhi, Andhra Pradesh, West Bengal and Odisha are selected for regional level short term forecasting using ARIMA and Holt-Winters model. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 16, 2020. . https://doi.org/10.1101/2020.07.14.20153908 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 16, 2020. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 16, 2020. . https://doi.org/10.1101/2020.07.14.20153908 doi: medRxiv preprint D. HWA-0.3,0.3-death cases is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 16, 2020. . is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 16, 2020. . https://doi.org/10.1101/2020.07.14.20153908 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 16, 2020. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 16, 2020. The similar procedure adopted as done in case of India as a whole for forecasting, for the Indian states also. The AI and RMSE value for the best chosen ARIMA model with point forecast and 95% confidence interval forecast with upper bound and lower bound values for both confirm and death cases are presented in Table 8 . . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 16, 2020. . https://doi.org/10.1101/2020.07.14.20153908 doi: medRxiv preprint All these results indicate that the COVID-19 pandemic to stay for a longer period and hence appropriate measures are sought to fight it successfully till some remedial measures are found to counter it. ARIMA and Holt-Winters exponential smoothing methods are employed in this paper, . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 16, 2020. . https://doi.org/10.1101/2020.07.14.20153908 doi: medRxiv preprint to forecast and predict the model for 20-days ahead forecasting the COVID-19 cumulative confirmed, recovered and death cases as an Indian perspective from 30 th June to 19 th July 2020. It is observed from the experiments that our proposed ARIMA model (411 for confirmed cases and 311 for death cases) is most accurate in forecasting the future cases with 99.8% and 99.3% respectively, in comparison to Holt-Winters with 87.9% and 95.6%. The India forecasting reveals that the COVID-19 spreading will grow further in the long run and needs special attention by the people and the government to take precautionary measures to counter the disease effectively. As can be seen from the Indian state's forecasting, there is a growing concern as the trend is increasing which was initially less in both confirmed and death cases for the states like Odisha. The forecast for India shows that there will be total 1886403.8 number of A novel coronavirus outbreak of global health concern Automated classification of usual interstitial pneumonia using regional volumetric texture analysis in high-resolution CT What the cruise-ship outbreaks reveal about COVID-19 Clinical features of COVID-19 in elderly patients: A comparison with young and middle-aged patients Predicting hepatitis B virus-positive metastatic hepatocellular carcinomas using gene expression profiling and supervised machine learning Controlling testing volume for respiratory viruses using machine learning and text mining Volatile fingerprinting of human respiratory viruses from cell culture Preliminary estimation of the basic reproduction number of novel coronavirus (2019-ncov) in china, from 2019 to 2020: A data-driven analysis in the early phase of the outbreak Automated classification of usual interstitial pneumonia using regional volumetric texture analysis in high-resolution CT Ai-assisted ct imaging analysis for COVID-19 screening: Building and deploying a medical ai system in four weeks Machine learning applications in genetics and genomics Healthfog: An ensemble deep learning-based smart healthcare system for automatic diagnosis of heart diseases in integrated IoT and fog computing environments Deep learning for time-series analysis Neural networks in finance: gaining predictive edge in the market Imaging profile of the COVID-19 infection: radiologic findings and literature review Short-term forecasting COVID-19 cumulative confirmed cases: Perspectives for Brazil Sukhpal Singh Gill, Predicting the Growth and Trend of COVID-19 Pandemic using Machine Learning and Cloud Computing Linear Regression Analysis to predict the number of deaths in India due to SARS-CoV-2 at 6 weeks from day 0 (100 cases A review of modern technologies for tackling COVID-19 pandemic A time series method to analyze incidence pattern and estimate reproduction number of COVID-19 Early dynamics of transmission and control of COVID-19: a mathematical modelling study Analyzing the Epidemiological Outbreak of COVID-19: A Visual Exploratory Data Analysis (EDA) Approach