key: cord-0858157-pkod7r08 authors: Khan, Farhan Mohammad; Gupta, Rajiv title: ARIMA and NAR based Prediction Model for Time Series Analysis of COVID-19 cases in India date: 2020-06-29 journal: nan DOI: 10.1016/j.jnlssr.2020.06.007 sha: de0fc4107379df8cebe6c6f5f29beb2c4c4f0f41 doc_id: 858157 cord_uid: pkod7r08 Abstract In this paper, we have applied the univariate time series model to predict the number of COVID-19 infected cases that can be expected in upcoming days in India. We adopted an Auto-Regressive Integrated Moving Average (ARIMA) model on the data collected from 31st January 2020 to 25th March 2020 and verified it using the data collected from 26th March 2020 to 04th April 2020. A nonlinear autoregressive (NAR) neural network is developed to compare the accuracy of predicted models. The model has been used for daily prediction of COVID-19 cases for next 50 days without any additional intervention. Statistics from various sources, including the Ministry of Health and Family Welfare (MoHFW) and http://covid19india.org/ are used for the study. The results showed an increasing trend in the actual and forecasted numbers of COVID-19 cases with approximately 1500 cases per day, based on available data as on 04th April 2020. The appropriate ARIMA (1,1,0) model was selected based on the Bayesian Information Criteria (BIC) values and the overall highest R2 values of 0.95. The NAR model architecture constitutes ten neurons, which is optimized using the Levenberg-Marquardt optimization training algorithm (LM) with the overall highest R2 values of 0.97. The outbreak of transmissible person-to-person pneumonia caused by the extreme acute respiratory coronavirus 2 syndrome (SARS-COV-2, also known as COVID-19), has sparked a global warning. The COVID-19 virus spreads mainly by droplets of saliva or nose discharge while an infected person is coughing or sneezing. Milan Batista [1] used the logistic model in prediction of the total number of cases and peak time of the coronavirus epidemic in China, South Korea, and the rest of the World, which gave a reasonable description of the epidemic in those places. A time series is a sequence where a pattern is recorded over regular time intervals. In this paper, we have applied the univariate time series model to predict the number of COVID-19 infected cases that can be expected in upcoming days in India. The ARIMA method is also known as the Box-Jenkins method [2] . The Box-Jenkins method relates to the fitting of a mixed ARIMA model to a given data set. ARIMA, short for 'Auto-Regressive Integrated Moving Average' is a class of models explaining a given time series based on its past values, i.e., its own lags and the lagged forecast errors, so that equation can be used to forecast future values. An ARIMA model is characterized by 3 terms: p, d, q. Where, p is the order of the AR term. q is the order of the MA term. d is the number of differencing required to make the time series stationary. The most common approach to make a series stationary is to subtract the previous value from the current value. Sometimes more than one differentiation may be required, depending on the complexity of the series. Therefore, the value of d is the minimum amount of differentiation needed to render the sequence stationary, and if the time series is stationary already, then d = 0. The 'p' is the order of 'Auto-Regressive' (AR) term; it refers to the number of Y lags which should be used as predictors. The 'q' is the order of the 'Moving Average' (MA) term; it refers to the number of lagged errors in the forecast that should go into the ARIMA model. The principal objective of the fitting ARIMA model is to correctly recognize the stochastic mechanism of the time series and forecast future values. Such approaches have also proven useful in other types of scenarios in which models for discrete-time series and dynamic systems are created. This method is, however, not appropriate for lead times or seasonal series with a broad random variable [3;4] . This univariate time series model has been used in the present study to forecast the number of COVID-19 cases in India. The steps of the ARIMA model building methodology are presented in a flow chart below in Figure 1 . Nonlinear autoregressive (NAR) neural networks were designed to forecast a time series from past values [5] . In this study, the NAR based interface was developed using the app designer programming environment of MATLAB software [6] using the standard commands. NAR based networks can be built trained and predict future values [7] . Our objective is for users who do not have any programming experience to predict future values over a time series. The steps of NAR model building methodology are presented in a flow chart below in Figure 2 .  Identification of the model: This includes identifying the most suitable lags for the components of the AR and MA, and deciding whether the variable needs first differentiation to induce stationary. The Auto Correlation (ACF) function and the Partial Auto Correlation (PACF) function are used to determine the best model.  Estimation: This usually involves the use of a least-squares estimation process.  Diagnostic testing: This usually is the test for autocorrelation. If this part fails, then the process returns to the identification section and begins again, usually by the addition of extra variables.  Forecasting: The ARIMA models are particularly useful for forecasting due to the use of lagged variables. Once the model was identified, and the model parameter can be estimated, then the model is determined with a different set of parameters. It is basically checked with the assumption that the model about the random error is satisfied using statistical diagnostic tests and residual plots that can be used to analyze the suitability of various models to historical data. The model selection can be made based on the values of specific criteria like Normalized Bayesian Information Criteria (BIC). The data containing confirmed COVID-19 cases is used for the time series analysis. The statistical calculations for raw data include data count, mean, standard deviation, maximum and minimum values, and mode. The dataset is divided into training (70%), testing (15%), and validation (15%). The NAR model optimization is performed using three training functions, TRAINBR, TRAINLM, and TRAINSCG. The performance of the model is measured using the value of the determination coefficient (R 2 ); it reflects the proportion of variance in the dependent parameter described by the independent parameter. The higher value of R 2 indicates that the model explains variation in the dependent parameter. Mean Squared Error (MSE) is one of several ways of quantifying the difference between the values predicted and the real values of the measured quantity. Data regarding the number of cases reported in India till 04 th April 2020, data were collected from the Ministry of Health and Family Welfare (MoHFW) and covid19india.org [8] . This data was plotted on a graph to see the trend, as shown in Figure 3 . The ACF stands for Autocorrelation function, and the PACF for Partial Autocorrelation function Autocorrelation computes and plots the autocorrelations of a time series. Autocorrelation is the correlation between observations of a time series separated by k time units, while the ACF is the plot used to show the correlation between the points, up to and including the lag unit. In ACF, the coefficient of correlation is in the x-axis whereas the number of lags in the y-axis is shown. Similarly, partial autocorrelations measure the strength of the relationship with other terms that are accounted for. In this case, other terms are the lags in the model that are intervening. A partial autocorrelation is a combination of the relationship between an observation in a time series with observations being excluded at the initiation phase with the relationships of intervening observations. Auto Correlation Function (ACF) is given in Figure 4 shows that the series has positive autocorrelations to a large number of lags, i.e., 10, so a higher order of differentiation is required. Figure 5 shows that the autocorrelation of lag-1 is small and patternless, so the series does not need a higher order of differentiation. If the autocorrelation of lag-1 is zero or more negative, then the series may be over differentiated. The partial autocorrelation function (PACF) of the differenced series shows a sharp cutoff due to the positive lag-1 autocorrelation, and the series appears to be slightly under differentiated, so one or more AR terms should be added to the model. The lag beyond which the PACF cutoff is the number of AR terms indicated. The ACF and PACF showed an irregular increasing pattern in the number of cases of COVID-19. Henceforth, ARIMA models (p, d, q), apt for such a scenario was applied. In terms of choosing a Box-Jenkins model, the smaller the goodness-of-fit measures the better. The best suitable Box-Jenkins model was selected based on minimal Bayesian Information Criteria (BIC) value. In this study, the least BIC value is 354.4367 as given below in Table 1 , and the corresponding model is ARIMA (1,1,0) with the overall highest R 2 values of 0.95, as shown in Figure 6 . The verification of the model is done by checking the residuals of the model using autocorrelation and correlation of the residuals of different orders. For a time series model, the residuals are proportional to the difference between the measurements and the corresponding fit values. Residuals are useful for testing the model 's suitability to capture the information in the data. The estimated autocorrelations and partial autocorrelations between the residuals at various lags are depicted in Figure 7 . It is evident from the figure that all the lags are well within the 95% confidence level. This implies that residuals are random, i.e., white noise, indicating that the model is a good fit. It is also observed that all autocorrelation coefficients are not statistically significant, implying that residuals are not autocorrelated with each other. A model with the lowest value of normalized BIC is found to be ARIMA (1,1,0) , which can be considered as the best fit model and can be further used to generate the forecasts. Based on the ACF and PACF, the daily prediction of COVID-19 cases is calculated, as shown in Figure 8 . Using the fitted model to generate MMSE forecasts corresponding mean square errors over a 50-day horizon, as shown in Figure 9 . The predicted ARIMA model is compared with Monte Carlo Forecasts for validation. Results show that the forecast for MMSE and the mean for simulation are practically indistinguishable, as shown in Figure 10 . Slight differences occur between the theoretical 95% forecast intervals and the simulation-based 95% forecast intervals. The trial and error method is implemented to achieve the optimal network structure in which a rigorous analysis with one hidden layer is carried out. Figure 11 shows the architecture of the NAR neural network. The back-propagation of error is carried out by the Bayesian regularization (BR), which is a standard second-order nonlinear least-squares technique using the back-propagation process to increase the speed and efficiency of the training. Figure 12 shows the performance graph obtained for the model; the best validation performance of the model was 168.8105 obtained at epoch 34. The closed-loop training form of the created neural network is shown in Figure 13 . After the closed-loop training period, one step ahead form is created using an open loop, as shown in Figure 14 . Error histogram is plotted in Figure 15 , error histogram is the histogram of the errors between target values and predicted values after a neural network has been trained. These error values reflect how the expected values vary from the target values, and they may also be negative. The results from the error histogram show that the maximum correlations at zero lag, the time-series response of the NAR model, is shown in Figure 16 . The data related to is collected from 30 th Jan. Subsequently, the probable variation for the next 50 days is forecasted using the two different techniques, as discussed above. In this study machine, learning-based algorithms are proposed to observe the transmission pattern of COVID-19. Performance evaluation of the forecast is estimated in terms of Root Mean Square Error (RMSE) and coefficient of determination, and the statistics are found to be within satisfactory limits. The accuracy of the ARIMA and NAR model developed is compared for the daily predicted confirmed COVID-19 cases in India, as shown in Figure 17 . It can be seen clearly from the curve that at the end of June, daily new confirmed cases will be almost 1500 cases per day if the conditions remain the same in the country. As early mentioned above, the ARIMA model consists of four steps. The first step was the identification of the model. The model identification is made by ACF and PACF (Figure 4 and Figure 5 ). The time-series response of the NAR neural network model was observed using an open loop and close loop network. Both the models revealed an increasing pattern in the number of cases of COVID-19. Model parameters were estimated using MATLAB2019b. In the forthcoming days, the model predicted a steady increase. The first case of COVID-19 is reported in India on 30 th January 2020, is expected to increase approximately 1500 cases per day in the next 50 days. This implies that the existing interventions in the country will not be adequate to control the on-going epidemic. Based on the forecasting model and assuming the current conditions of the COVID-19 outbreak, it is expected a quick and exponential increase in the number of cases. This study considered data related to daily confirmed cases of COVID-19 in India. Based on available data as on 04 th April 2020, the daily predicted number of COVID-19 cases in India using the ARIMA model and NAR neural network is approximately 1500 cases per day for the next 50 days. This prediction is made with existing conditions. However, it can be improved by taking a few preventive steps. We intend to further improve our model by collecting more data from the upcoming days. For future modifications, the new methodology will be used for forecasting the scenario of confirmed, deceased, and recovered cases of COVID-19. None. Estimation of the final size of the COVID-19 epidemic Forecasting incidence of dengue in Rajasthan, using time series analyses Autoregressive Integrated Moving Average (ARIMA) Modeling of Time Series of Local Telephone Triage Data for Syndromic Surveillance Forecasting the trend in cases of Ebola virus disease in west African countries using auto regressive integrated moving average models Nonlinear autoregressive neural network with exogenous inputs based solution for local minimum problem of agent tracking using quadrotor MathWorks: Neural Network Toolbox Release 2018b-MATLAB & Simulink -MathWorks India, available at, last access: 05 th NAR based forecasting interface for time series analysis: T-seer Ministry of Health and Family Welfare