key: cord-0684108-yhfqs4uo authors: Mangla, Sherry; Pathak, Ashok Kumar; Arshad, Mohd; Haque, Ubydul title: Short-term forecasting of the COVID-19 outbreak in India date: 2021-06-05 journal: Int Health DOI: 10.1093/inthealth/ihab031 sha: b00db0509cd1af9525447b1fe2312471e98db13a doc_id: 684108 cord_uid: yhfqs4uo As the outbreak of coronavirus disease 2019 (COVID-19) is rapidly spreading in different parts of India, a reliable forecast for the cumulative confirmed cases and the number of deaths can be helpful for policymakers in making the decisions for utilizing available resources in the country. Recently, various mathematical models have been used to predict the outbreak of COVID-19 worldwide and also in India. In this article we use exponential, logistic, Gompertz growth and autoregressive integrated moving average (ARIMA) models to predict the spread of COVID-19 in India after the announcement of various unlock phases. The mean absolute percentage error and root mean square error comparative measures were used to check the goodness-of-fit of the growth models and Akaike information criterion for ARIMA model selection. Using COVID-19 pandemic data up to 20 December 2020 from India and its five most affected states (Maharashtra, Karnataka, Andhra Pradesh, Tamil Nadu and Kerala), we report 15-days-ahead forecasts for cumulative confirmed cases and the number of deaths. Based on available data, we found that the ARIMA model is the best-fitting model for COVID-19 cases in India and its most affected states. The coronavirus disease 2019 (COVID-19) pandemic is spreading around the world. 1 Human-to-human transmission has been confirmed and, worldwide, measures have been taken to mitigate the virus' spread. 2 This pandemic has placed an unprecedented burden on the global economy, healthcare and globalization through its effects on travel, events cancellation, employment, the food chain, academia and healthcare capacity. 3 According to the Worldometer website (https: //www.worldometers.info/coronavirus/), as of 20 December 2020, there were 77.75 million cases globally and around 10 million confirmed cases in India. The first case of COVID-19 was reported in India on 30 January 2020. As the number of COVID-19 cases increased significantly since the first case was reported, the government of India imposed a complete lockdown on 25 March 2020. Due to the unavailability of drugs to cure COVID-19, most countries are implementing stringent laws for isolation and quarantine of infected people. India is the second most populous country in the world and contains 18% of the world's population as of 2019. 4 Because of this, it is important to predict the cumulative number of infected cases and associated deaths in India. In the current situation of the COVID-19 pandemic, decision making and strategy planning activities rely on accurate forecasts of the disease. Numerous researchers have used various modelling techniques for forecasting COVID-19 cases in different countries, including short-term forecasting such as the autoregressive integrated moving average (ARIMA) and Holt's exponential smoothing in India, 5 the simple mean-field model and susceptible-infected-recovered-deaths model, 6 the Gompertz model, the logistic model and the Bertalanffy model 7 in Italy, China and France. In the literature, researchers have used the Gompertz model to predict the growth of tumours, 8 bacteria 9 and birds, 10 whereas the logistic growth model has been used for 29 provinces of China and around the world to model the outbreak of COVID-19 11 and to forecast the worldwide spread of COVID-19. 12 Similarly, the exponential growth model was employed to model coal production in Nigeria 13 and population growth 14 and the ARIMA model has been used to forecast the final size and spread of COVID-19 in Italy 15 and the cumulative confirmed cases of COVID-19 for Mainland China, Italy, South Korea, Iran and Thailand. 16 In this study, the cumulative number of infected cases and the total number of deaths in India after the announcement different unlock phases are predicted using four different models: exponential, Gompertz, logistic growth and ARIMA. Mean absolute percentage error (MAPE) and root mean square error (RMSE) values were used to measure the goodness-of-fit of the model. The model with the smallest MAPE and RMSE values is considered best. Nejadettehad et al. 17 used MAPE and RMSE metrics to compare the performance of the recurrent neural network, gated recurrent unit and long-and short-term memory neural network in short-term traffic flow prediction. Qian et al. 18 used identical metrics to compare the artificial neural network model (i.e. Elman recurrent neural network) and the classical time series model (i.e. seasonal autoregressive integrated moving average) to estimate and forecast traffic death cases in China. A similar performance evaluation procedure was adopted by Huang and Hao 19 and Zhou et al. 20 India ranked second on the pandemic vulnerability index and the morbidity and mortality due to COVID-19 is spreading rapidly in India. The proposed models will provide a reliable forecast for outbreaks at the national and state level to implement interventions to curb the pandemic. 3 The daily reported cumulative number of infected cases and deaths from 30 January to 20 December 2020 was collected from the COVID19-India API website (https://api.covid19india. org/documentation/csv/). State-level data for the total number of confirmed cases were collected from 14 March to 20 December 2020. Irregularities in the daily reported cases affect the time series and hence the cumulative number of cases were analysed, which provides more stable and reliable results. One of the most common applications of exponential functions involves growth and decay models. In a range of physical processes, exponential growth and decay make an appearance. Exponential functions are widespread in nature, from population growth to radioactive decay. In infectious disease modelling, when a function CC t continues to expand at a rate r>0, then CC t has the form where CC t is the cumulative number of infected cases at time t. I 0 is the initial number of cumulative infected cases and r is the growth rate. Logistics equations were introduced by the seminal work of Pierre-Francois Verhulst in 1844-1845. 21 The logistic growth model illustrates that population growth is confined by carrying capacity and the growth rate gets smaller and smaller as population size approaches the carrying capacity. Hence the logistic growth model assumes that the growth rate decreases lin-early with size until it equals zero at the carrying capacity. Logistic growth models are mainly used in epidemiology, biology and environmental sciences. It is important to investigate the risk factor of a serious disease and to estimate the possibility of the outbreak of disease based on the risk factors. The growth and transmission law of epidemiology can be approximately estimated by using a logistic growth curve: where CC t is the cumulative number of confirmed cases at time t, M c is the predicted maximum of confirmed cases, a and b are fitting coefficients and t 0 is the time when the first case was reported. The Gompertz model is widely used and a well-known technique to model the population growth and has many applications in biology, epidemiology and environmental science. This model was introduced by Gompertz 22 as an animal population growth model to describe the extinction law of the population. Also, the Gompertz model is a particular case of the Richards model. The development of epidemic growth is equivalent to the growth of the population. In this article the Gompertz model was used to determine the cumulative number of COVID-19 infected cases in India. The mathematical form is: where CC t is the cumulative number of confirmed cases at time t, M c is the predicted maximum of confirmed cases, a and b are fitting coefficients and t 0 is the time when the first case was reported. ARIMA models are classical techniques of time series forecasting introduced by Box and Jenkins. 23 ARIMA (p, d, q) models are a combination of autoregressive AR(p) and moving average MA(q) models, where p represents the order of autoregressive terms, d is the degree of difference and q is the order of the moving average. The ARIMA (p, d, q) model is given by: where y t is the time series under consideration, e t is the error at time t and a and b and are coefficients. Estimation of parameters leads to a specific point estimate. In practice, point estimates frequently vary from the parameter's actual value. In order to tackle this, the t-statistic was considered to construct the CIs in this article for different model estimates. The CI approach for the mean (μ) was utilized as: where t 1−α/2 specifies the Student's t-distribution with n−1 degree of freedom and S is the sample standard deviation. The 15-days-ahead forecast of COVID-19 for India was generated using four different methods, the exponential growth model, logistic growth model, Gompertz model and ARIMA model, from 21 December 2020 to 4 January 2021. The cumulative number of confirmed cases and recovered cases in India from 30 January to 20 December 2020 is presented in Figure 1a and the cumulative number of deaths until 20 December 2020 is presented in Figure 1b . The 15-days-ahead forecast for the cumulative cases and deaths from each model is shown in Figure 2 . Tables 1 and 2 show the expected number of cumulative confirmed cases of the four models with 95% CIs and Tables 3 and 4 show the expected number of deaths using the four models with their 95% CIs. Also, to improve the forecast, we fed the truncated time series to generate a 15-days-ahead forecast from 21 December 2020 to 4 January 2021. R version 4.0.2 (R Foundation for Statistical Computing, Vienna, Austria) was used for this analysis. The exponential growth model was fitted to data from 30 January to 20 December 2020 for the cumulative number of infected cases and deaths. Using the logistic growth model, we expect 5.09 million (95% CI 5.06 to 5.11) cumulative infected cases and 0.098 million (95% CI 0.097 to 0.098) cumulative deaths in India on 4 January 2021, as shown in Tables 1 and 3 . Log transformation was used for variance stabilization to perform this method over the cumulative number of cases from 30 January to 20 December 2020 and deaths from 12 March to 20 December 2020. Using the Gompertz model, it is expected that there will be 6.58 million (95% CI 6.56 to 6.60) cumulative infected cases and 0.108 million (95% CI 0.108 to 0.109) deaths in India on 4 January 2021. Tables 2 and 4 present the results of the forecasts for the model. The non-linear least square method was used to estimate the parameters for the three models: exponential growth model, logistic growth model and Gompertz model. The R 2 values for the exponential, logistic and Gompertz models for cumulative confirmed cases were 0.934, 0.974 and 0.988 and for the total number of deaths were 0.926, 0.961 and 0.976, respectively. Hence we conclude that the Gompertz model provides a better fit among the three models for the considered dataset. ARIMA models were fitted to daily infected cases from 30 January to 20 December 2020 to generate 15-days-ahead forecasts from 21 December 2020 to 4 January 2021. To check the stationarity of the time series, an augmented Dickey-Fuller (ADF) test and Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test were performed, and the best model was selected based on the smallest Akaike information criterion (AIC) value. Since the time series was not stationary, one difference was taken to achieve stationarity. Meanwhile, two was added to every observation of the time Table 2 ). Figure 3a represents the residuals of the ARIMA (1, 2, 5) model. Residuals are randomly scattered around a zero mean with International Health constant variance and follow an approximately normal distribution. Also, there is no serial correlation in residuals. While predicting the cumulative number of deaths using the ARIMA model, a similar methodology was employed and ARIMA (0, 1, 1) was selected as the best model, with a corresponding AIC value of 1256.51. ARIMA (0, 1, 1) passed the Ljung-Box test with a p-value of 0.12. According to this model, there will be around 0.151 million (95% CI 0.148, 0.155) expected deaths on 4 January 2021 (see Table 4 ). The residual plot for ARIMA (0, 1, 1) is given in Figure 3b . Fourth-root transformation was used to stabilize the variance of residuals. A comparison of all four models is presented in Figure 4 . To estimate the parameters of the ARIMA model, a conditional sum of squares followed by the maximum likelihood (CSS-ML) estimation method was used. First, a minimum conditional sum-of-squares was used to find the starting values, then the maximum likelihood estimation method was applied. Results of comparative performance using MAPE and RMSE between the models are presented in Table 5 . In Figure 2 , it is seen that of the four models, ARIMA fitted values nearly coincide with the actual reported values (infections and deaths) from 30 January to 20 December 2020, defining a better fit of the forecast using the ARIMA model. Thus the ARIMA model was employed for forecasting the cumulative number of infected cases at the regional level. Forecasting at the regional level In this study, five states (Maharashtra, Karnataka, Andhra Pradesh, Tamil Nadu and Kerala) were included for forecasting at the regional level. The time series of daily infected cases from 14 March to 20 December 2020 were used to provide the 15-days-ahead forecast. We found that Maharashtra will be the most affected state, with approximately 1.94 million cumulative cases, and Kerala will be least affected among these states, approximately with 0.80 million cumulative cases. ARIMA models were found to be suitable at the regional level and the results of the 15days-ahead forecasts are given in Table 6 . A graphical representation of the forecast from 21 December 2020 to 4 January 2021 for Maharashtra, Karnataka, Andhra Pradesh, Tamil Nadu and Kerala is shown in Figure 5 In order to study the performance of each model for the varied time series, i.e. after eliminating the days with zero reported cases, we used MAPE and RMSE for the ARIMA model, since it was the best-fitting model. The comparison is shown in Table 7 . It can be seen that the MAPE and RMSE values provided by a full-length time series are less as compared with the truncated time series. Hence we used the full-length time series data using the ARIMA model for forecasting in India and its five states. S. Mangla et al. Considering the present situation in India, Internet of Thingsbased smart disease surveillance systems have the potential to be a major breakthrough in efforts to control the current pandemic. With much of the infrastructure already in place (i.e. smartphones, wearable technologies, internet access), the role this technology can play in limiting the spread of the pandemic involves only the collection and analysis of data. 24 Another use can be in understanding the characteristics of spatiotemporal clustering of the COVID-19 epidemic, as R 0 is critical in effectively preventing and controlling the pandemic. 25 Limitations and intervention scenarios COVID-19 has been affected by a number of factors. Some studies have revealed how multiple variables contribute to the spread of the virus, 26 but with the inclusion of proper interventions, the spread of COVID-19 can be monitored. 27 However, it should be mentioned that this forecast is strongly related to the past pattern. The current situation in India represents a declining trend in daily reported infections. Our aim through this article is to compare the considered models in forecasting this pandemic based on the data set that is used. Also, considering the fact that there might have been a greater number of infections and deaths in the country as compared with what is being reported, this study is limited to the cases that have been reported. Simulations are beyond the of scope for this article. In this article we adopted the exponential, logistic, Gompertz and ARIMA models for short-term forecasting of the COVID-19 outbreak in India and its five most affected states. The results of all the considered methods show that the cumulative number of infected cases and deaths due to COVID-19 are increasing day by day in India and its most affected states. As per the prediction, there will be around 3.42 million additional infected cases and about 0.006 million new deaths will be reported in India in the 15 days from 21 December 2020 to 4 January 2021. Among the four models, we found that the ARIMA model provided a better fit and gave a more reliable forecast using epidemiological data for India. After the announcement of various unlock phases, Maharashtra remains a highly affected state in India due to COVID-19. An increase in the number of infected cases is directly related to an increase in the number of testing facilities and the interstate movement of people. Through updating these data and apply-ing the models at the regional level, some valuable and far more accurate predictions can be obtained. Real-time forecasts of the COVID-19 epidemic in China from The SARS, MERS and novel coronavirus (COVID-19) epidemics, the newest and biggest global health threats: what lessons have we learned? The impact of COVID-19 on globalization United Nations Department of Economic and Social Affairs Short-term forecasts of COVID-19 spread across Indian States until 1 Analysis and forecast of COVID-19 spreading in China, Italy and France Prediction and analysis of coronavirus disease 2019 Dynamics of tumour growth Modeling of the bacterial growth curve Comparison of three nonlinear and spline regression models for describing chicken growth curves Generalized logistic growth modeling of the COVID-19 outbreak in 29 provinces in China and in the rest of the world Forecasting the worldwide spread of COVID-19 based on logistic model and SEIR model Modeling of linear and exponential growth and decay equations and testing them on pre-and post-war-coal production in Nigeria: an operations research approach Population projection model using exponential growth function with a birth and death diffusion growth rate processes An ARIMA model to forecast the spread and the final size of COVID-2019 epidemic in Italy. Health, Econometrics and Data Group (HEDG) Working Papers 20/07 Forecasting of COVID-19 confirmed cases in different countries with ARIMA models Short-term demand forecasting for online car-hailing services using recurrent neural networks Forecasting deaths of road traffic injuries in China using an artificial neural network A novel two-step procedure for tourism demand forecasting Time series model for forecasting the number of new admission inpatients On the nature of the function expressive of the law of human mortality, and on a new mode of determining the value of life contingencies The early origins of the logit model Time series analysis: forecasting and control Defending against the novel coronavirus (COVID-19) outbreak: how can the internet of things (IoT) help to save the world? COVID-19 in China: risk factors and R 0 revisited Pre-to-post lockdown impact on air quality and the role of environmental factors in spreading the COVID-19 cases-a study from a worst-hit state of India A computational modelling study of COVID-19 in Bangladesh Funding: None. Ethical approval: Not required.Data availability: None. Supplementary data are available at International Health online. All the authors contributed equally in conducting this study.