key: cord-297343-e7slzb78 authors: Gola, A.; Arya, R. K.; Animesh, A.; Dugh, R.; Khan, Z. title: Fine-tuned Forecasting Techniques for COVID-19 Prediction in India date: 2020-08-13 journal: nan DOI: 10.1101/2020.08.10.20167247 sha: doc_id: 297343 cord_uid: e7slzb78 Estimation of statistical quantities plays a cardinal role in handling of convoluted situations such as COVID-19 pandemic and forecasting the number of affected people and fatalities is a major component for such estimations. Past researches have shown that simplistic numerical models fare much better than the complex stochastic and regression-based models when predicting for countries such as India, United States and Brazil where there is no indication of a peak anytime soon. In this research work, we present two models which give most accurate results when compared with other forecasting techniques. We performed both short-term and long-term forecasting based on these models and present the results for two discrete durations. In December 2019, some people in Wuhan, China were infected by the novel coronavirus, named 2019-nCoV and since then, this outbreak has spread to more than 200 countries all over the world. This has led the World Health Organization (WHO) to declare it as international public health emergency. Governments of the nations affected by this pandemic are running around to formulate provisions and provide resources to handle this epidemic. Forecasting the infection rate for a nation can act as a huge asset in planning and formulation of policies for such nations. While no model can accurately forecast the rates of infection and mortality, attempts have been made to consider and analyse the strengths and shortcomings of many studies and models presented regarding the coronavirus. Whereas the forecast models used by the health department or Government of India were not disclosed, we can definitely continue with existing models in separate publications. Each of these models took different approaches and techniques to predict the future rates. There has been a profusion of available mathematical techniques to predict the infection rate for the currently ongoing Covid-19 crisis. In past research [18] , researchers evaluated the performance of majority of these techniques and concluded with two models which can be used for further purposes of estimating the number of cases affected by the coronavirus as these models gave the best predictions. These two models, exponential curve fitting and least square fitted model, can be used for short-term and long-term forecasting respectively. In this study, we implement these two techniques on an updated dataset taken from the official website of Ministry of Health and Family Welfare, Government of India [17] . We estimate the number of affected, death and recovered cases for 2 different durations -one from August 5 to September 3 i.e. for 4 weeks, and the other from August 5 to September 23 i.e. for 7 weeks. We believe this forecast would assist the government and certain other official authorities in preparing and organizing necessary resources to deal with this pandemic. This study is organized into five main sections. The paper starts with the general information about history and information of the disease. Section 2 provides the survey of the previously employed forecasting models to predict the confirmed cases in Indian context. We present our methodology in section 3 and discuss our findings and results in section 4. We conclude this research work in section 5 alongside providing scope for future improvements. Research on estimation of infection rate of Covid-19 has been quite prolific. Majority of these revolve around traditional machine learning methods and neural network-based models. R. Sujath et. al [11] and Ajit Kumar Pasayat et. al [16] used linear regression models while Gaurav Pandey et, al [12] employed polynomial regression technique to predict the Coronavirus cases in its early months. R. Sujath et. al [11] also used multi-layer perceptron models alongside their stochastic vector autoregression (VAR) time-series model. Another case of using complex learning models is Anuradha Tomar et. al [14] applying a LSTM model to forecast the number of cases. In case of small epidemics, Meyers [1] studied the forecasted spread using a model of the Susceptible-Infected-Recuperated (SIR). In the simulation COVID-19 diffusion experiments, Wu et. al [2] applied the Susceptible-Exposed-Infectious-Recovered (SEIR) Model. Anastassopoulou Al. [3] performed a simulation study of situation COVID-19 at the very initial stage of pandemics, a model of susceptibleinfectious-recovered-dead (SIRD) was used. Ghosh et al [4] used a pandemic model of Susceptible Infectious Susceptible (SIS) to forecast spread of the COVID-19 in India. Kumar et. al [5] , in order to analyse the Indian scenario, has used the ARIMA time series analysis technique. Their predictions were very similar to the later reported actual values. Basu [8] has been researching time-based viral spread in India on his own basis. According to his predictions, in early June, total number of cases in India was estimated to cross 200,000 and that prediction was quite . CC-BY-NC 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 13, 2020. . accurate. Sudip Ghosh et. al [20] used linear square fitted modelling while Hemanta Kumar Baruah et. al [19] fitted an exponential curve for their predictions. Short-term forecasting can be done based on elementary analytical approaches instead of diving into complex architectures like disease modelling or neural networks. Previous research [18] has shown that for shorter durations, simplistic curve fitting models achieve better accuracy than regression and pandemic models. Observing the patterns of number cases in countries such as China, Spain and Italy we can infer that the natural infection rate curve will follow a non-linear path initially till it hits its peak and begins to subside. Nations such as India, United States of America and Brazil are still in the nonlinear portion of the plot and due to the uncertainty of their peak point, forecasting for such countries can only be done for short durations. Recovered Deaths Observing the pattern for India, we can discern a definite exponential trajectory, which can be exploited by studying the time series data in an inverted fashion and then instituting a numerical model established on the latter part of the data. Let P(t) be the total number of affected cases. Q(t) be the total number of death cases, and R(t) be the total number of recovered cases at a given time t. To verify out assumption of the curve being exponential, we took the natural logarithm of P(t), Q(t) and R(t). The resulting plots shown in Fig. 2 are linear for each curve, thus establishing our assumption as legitimate. Fitting against exponential functions is exceedingly fragile because tiny variations in the exponent can result in large differences in the result. Optimising is done across many orders of magnitude, and errors near the origin are not equally weighted compared to errors higher up the curve. The simplest way to handle this is to convert our exponential data to a linear form using a natural logarithm transformation: Considering the equations of curve to be: . CC-BY-NC 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 13, 2020. . where a and b are constants. Taking natural logarithm of both sides: This allows us to use the linear curve fitting method instead of the slower polynomial fitting method which when employed on large values is prone to result in overflow errors. We would later transform the data back into linear space for analysis. We used the polyfit() function of the numpy module placed in Python and got the coefficients' values as: The covariance matrices obtained for each case are shown in Fig. 3 . A methodology widely used to perform regression analysis is the least square regression method. This is a statistical technique to determine the best line of fit between an independent and a dependent variable. The 'least-square method' combines measurements in order to extract the parameter estimates that define the curve that best matches the results. Using the least square rule, given the set of N (noisy) measurements fi, i∈1, N, which are to be applied to the curve f(a), where 'a' is the vector of the parameter values, we seek to minimize the square of the difference between the measurements and the values of the curve to provide an approximation of the parameters 'a^' according to (7) When we fit our data to the polynomial function graph, the polynomial curve fit is. The same technique of smallest squares is used to identify a certain degree polynomial which has a minimum overall error: where M is the order of the polynomial is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 13, 2020. . https://doi.org/10.1101/2020.08.10.20167247 doi: medRxiv preprint We obtain a fit by minimizing an error function -sum of squares of the errors between the predictions y(xn,w) for each data point xn and target value tn. Here, polynomial of degree 6 is used for fitting the dataset. Observing the non-linear pattern in India's COVID-19 infection rate, we employed curve fitting techniques to predict the number of confirmed, death and recovered cases for both short-term and long-term durations. Due to the unpredictable nature of the exponential graph, small modifications in input can lead to abrupt changes in our output. Thus, we used exponential curve fitting for short-term forecasting for a duration of 4 weeks starting from August 8, 2020 to September 4, 2020. Polynomial regression modelling is used for long-term forecasting for a duration of 7 weeks starting from August 8, 2020 to September 24, 2020. Results for each case are presented in Tables 1 and 2 while their respective plots are demonstrated in Figs. 5 and 6. As per our forecasts, the total number of cases in India would cross 30,00,000 by August 15, 2020. By August 25, 2020 it would cross 40,00,000, and around September 1, it should exceed the 50,00,000 value. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 13, 2020. . https://doi.org/10.1101/2020.08.10.20167247 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 13, 2020. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 13, 2020. Building upon the previous research [18] , current study implemented two numerical models to forecast the number of cases related to COVID-19 in India, namely -exponential curve fitting and least square fitted model. Both of the models forecasted an upward of 30 lakhs cases and 40,000 deaths for the upcoming months. Unless there is a sudden peak in the graph and it begins to subside, we are going to face an enormous challenge to handle this pandemic. To prevent a dearth of required resources, government and official organisations should plan factoring in the forecasted cases. This study can be expanded to establish other mathematical and regression techniques for the forecasting of the COVID-19 cases in future. This would be essential in having a diverse assortment of prediction techniques to consider while developing new policies. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 13, 2020. . https://doi.org/10.1101/2020.08.10.20167247 doi: medRxiv preprint Contact network epidemiology bond percolation applied to infectious disease prediction and control Nowcasting and forecasting the potential domestic and international spread of the 2019 nCoV outbreak originating in Wuhan, China: A modelling study Data-based analysis, modelling and forecasting of the COVID-19 outbreak COVID-19 in India: State-wise analysis and prediction, medRxiv preprint Forecasting COVID-19 impact in India using pandemic waves nonlinear growth models Short term forecasts of COVID-19 spread across Indian States until Short term forecasts of COVID-19 spread across Indian States until 29 May, 2020, under the worst-case scenario Model based case studies in the UK, the USA and India Total coronavirus cases in India, Publishing Date The current COVID-19 spread pattern in India A machine learning forecasting model for COVID-19 pandemic in India SEIR and Regression Model based COVID-19 outbreak predictions in India Outbreak trends of CoronaVirus (COVID-19) in India: A Prediction Prediction for the spread of COVID-19 in India and effectiveness of preventive measures Time Series Analysis and Forecast of the COVID-19 Pandemic in India using Genetic Programming Predicting the COVID-19 positive cases in India with concern to Lockdown by using Mathematical and Machine Learning based Models Review of Forecasting Models for Coronavirus (COVID-19) Pandemic in India during Country-wise Lockdown \Nearly Perfect Forecasting of the Total COVID-19 Cases in India: A Numerical Approach Situation Assessment and Prediction of Corona Virus in India