key: cord-0905038-w7zkxaaf
authors: Hota, H.S.; Handa, Richa; Shrivas, A.K.
title: COVID-19 pandemic in India: Forecasting using machine learning techniques
date: 2021-05-21
journal: Data Science for COVID-19
DOI: 10.1016/b978-0-12-824536-1.00030-7
sha: 3672f3dd809d613788b41903206bd0fdc9a05fd6
doc_id: 905038
cord_uid: w7zkxaaf

Forecasting about the Novel coronavirus disease 2019 (COVID-19) pandemic involves high uncertainty and may be affected by measures taken by the government to fight the disease. This research explores machine learning (ML) techniques to forecast the epidemiological trend of COVID-19 in India. We used 22 ML algorithms develop forecasting models and selected the four best ones on the basis of their performance using mean absolute percentage error (MAPE). Feature extraction and feature selection techniques were also employed to improve performance with cumulative and daily data obtained from Mar. 2 to Apr. 25, 2020. Because of the linear nature of cumulative data, the model built with these time series data outperforms with an MAPE of 0.498, 0.240, and 0.430, respectively, for cases that are confirmed or recovered and deaths using the extra tree regressor compared with the model built with daily data with an MAPE of 1.377, 1.302, and 0.488, respectively. Moreover, the study confirms that the models perform well at the validation stage with an MAPE of 4.123, 5.411, and 4.553, respectively, for confirmed or recovered cases and deaths using a model built with cumulative data and an MAPE of 6.261, 7.576, and 6.273, respectively, using a model built with daily data. On the basis of selected models, a 15-day forecast for confirmed and recovered cases and deaths from COVID-19 was performed that can be validated in the near future. However, it depends on precaution measures taken by the central and state governments as well as individuals, including social distancing, self-isolation from society, restrictions in bus, rail, and air transport, school, college, and market closings or openings, the extension of the lockdown period, privileges to be given during lockdown, and other measures, as well whether guidelines issued by government from time to time were followed.

Various statistical and mathematical analysis and studies are ongoing to forecast the future trend of COVID-19 in India, and models are being developed to predict the future situation, known as N-days in forecasting. Because of the nonlinear behavior of COVID-19 data, various ML techniques could be useful to develop a robust forecasting model. Research shows that the application of computational intelligence methods is the basis for constructing a predictive model. In this, the neural network is useful for predicting time series data because it has the ability to learn from data and capture the various dynamics of time series data [5] . Evolutionary computations, fuzzy logic, and other models are also crucial owing to their principal differences from existing mathematical approaches. Hybrid models of various intelligent techniques are also widely used in forecasting, because accuracy and efficiency are the most important criteria of focus by researchers [6] .

Most work has been done on the basis of either trend already experienced by other countries such as China or statistical theory and analysis. Roosa et al. [7] worked on the real-time forecasting of the COVID-19 pandemic in China and developed models for 5-, 10-, and 15-days ahead forecasting based on the cumulative number of confirmed cases in Hubei and other provinces of China using three different techniques: the Richards model, the subepidemic model, and generalized logistic growth model (GLM). They concluded that each model predicts that the pandemic has reached saturation in Hubei and other provinces of China. Benvenuto et al. [8] performed an autoregressive integrated moving average (ARIMA) on Johns Hopkins epidemiological data to predict the epidemiological trend of the prevalence and incidence of COVID-19. Abdulmajeed et al. [9] proposed an online forecasting mechanism that streams data from the Nigeria Center for Disease Control, which provides updated COVID-19 forecasts every 24 h. The authors combine ARIMA, Prophet (an additive regression model developed by Facebook), and a HolteWinters exponential smoothing model combined with generalized autoregressive conditional heteroscedasticity. In other work, Tuli et al. [10] applied an improved mathematical model to analyze and predict the growth of the epidemic of COVID-19 and deployed the model on a cloud computing platform for more accurate and real-time prediction of the growth behavior of the epidemic. Ardabili et al. [11] presented a comparative analysis of ML and soft computing models to predict the COVID-19 outbreak and found ML to be an effective tool to model the outbreak. Zhou et al. [12] explained the challenges to geographic information systems (GIS) with big data on COVID-19. Other authors [13] analyzed and forecast COVID-19 spread in China, Italy, and France, and concluded that the infection rate needed to be cut down drastically and quickly to observe an appreciable decrease in the pandemic peak and mortality rate.

Tobías et al. [14] analyzed the trends of incident cases, deaths, and intensive care unit admissions in Italy and Spain before and after their respective national lockdowns using an interrupted time-series design. Data were analyzed with quasi-Poisson regression using an interaction model to estimate the change in trends. Chintalapudi et al. [15] highlighted the importance of lockdown and isolation by forecasting registered and recovered cases of COVID-19 after 60 days' lockdown in Italy adopting a seasonal ARIMA forecasting package with the R statistical model. Koczkodaj et al. [16] predicted the number of cases of COVID-19 outside China. Their approach was based on a heuristic solution and makes the realistic assumption that the current trend can continue for the next 17 days. Work by Tomar and Gupta [17] studied predicting the spread of COVID-19 in India. This study was made using deep learning techniques. The authors predicted the spread of COVID-19 in the country for next 30 days and suggested prevention measures. Fong et al. [18] presented a case study using composite monte-carlo (CMC) that is enhanced by a deep learning network and fuzzy rule induction to gain better stochastic insights about the development of the epidemic.

According to available resources and published articles, few researchers have worked on forecasting cases of COVID-19, especially for India. India is a diverse and highly populous country, so forecasting cases of COVID-19 is highly uncertain and is itself a nonlinear problem that depends on many factors. This research is an extension of the scarce research already done in this direction, with the objective of measuring the future situation of COVID-19 in India in terms of confirmed and recovered cases and deaths using an ML technique called regression [19] . Comparative analysis among various ML algorithms called estimators was also done. This research will facilitate as assessment of the critical situation in the country and will help the government to make appropriate decisions to optimize and use available health care infrastructure as well as to manage and deploy health personnel in affected areas. The outcome of the proposed research work is the ability to predict 15-day conditions of confirmed and recovered cases and deaths and can be improved for N-days ahead forecasting. Results were compared with two different models developed using cumulative data and daily data obtained elsewhere [20] . Empirical results show that the model built with cumulative data outperforms compared with the model built with daily data with an acceptable range of mean absolute percentage error (MAPE).

This section present details about the dataset used for ML and the preprocessing stage with a short description of ML algorithms used to develop the ML-based forecasting model.

Data are an essential ingredient of ML algorithms. COVID-19erelated data are actually time series data collected over a fixed interval of 1 day. However, an ML-based approach always needs a large sample size to train the models, but in the current situation, data are unavailable and it is also equally important to predict the future impact of COVID-19 for the government to make appropriate decisions and manage many other things in the various parts of the country. Details about the data used in this research are shown in Table 27 .1, which are collected from www.covid19india.org [20] .Two types of data, cumulative and daily, were collected to build forecasting models from Mar. 02, 2020 to Apr. 25, 2020. Figs. 27.1 and 27.2 show bar graphs of collected cumulative and daily data, respectively, for all three cases. 

Preprocessing techniques are used to remove inconsistencies from the dataset, which improves the quality of data. It is an essential step, especially in the case of an ML-based forecasting model for the smooth convergence of a learning curve [21] . Preprocessing techniques that were adopted are discussed next.

Normalization is the process of smoothing nonlinear data in a scale of [0 1]. The dataset is normalized by dividing each sample with the highest value of the sample data. An equation is used for data normalization, which scales data in the range of [0 1]:

in which X is the daily number of cases accrued for COVID-19 data, X max is highest value of number of cases, and X new is obtained normalize data. 

Sliding window works well when we have a smaller number of samples for training and testing and accumulate all values of window size. It is a temporary approximation over the actual value of the time series data [22, 23] . The size of the window and segment increases until we reached the least error approximation. Sliding window accumulate the historical time series data [24, 25] to predict the next-day value. After selecting the first segment, the next segment is selected from the end of the first segment. The process is repeated until all-time series data are segmented. In this research, we use window size ¼ 7 based on previous work and experience in developing ML models.

A straightforward approach and the most important aspect of the ML technique, called feature extraction and feature selection methods [26] , is applied to the dataset [27] to extract features from existing the feature space and select the best features to develop a robust forecasting model. Because COVID-19 time series data consist of only two fields, namely, date and number of cases (confirmed, death, and recovered), to improve performance, we need to generate features based on well-known and tested statistical formulas, as explained in Table 27 .2.

In this research, a new feature space was generated with the help of feature extraction. Features are extracted using technical indicators [28] such as moving average, exponential moving average (EMA), weighted moving average, relative strength index, standard deviation, variance, and rate of change, as shown in Table 27.2. On the other hand, feature selection is a process to find relevant features after removing irrelevant features from the original feature space. In this proposed work, we have selected the best features using a ranking-based feature selection technique. Of eight features, the four best ones (cases [original feature], EMA, standard deviation, and variance) were selected. There reduced feature space data were used to forecast COVID-19erelated cases and will also work well compared with original feature space data.

A new library of python (PyCaret) provides the bulk of ML techniques. Based on an exhaustive search of ML algorithms, 22 ML algorithms were selected automatically by feeding COVID-19 time series data. Models were trained using 10-fold cross-validation to use all of the samples as training as well as testing owing to the small sample size of data. Many ML techniques works well for forecasting time series data, including Bayesian net, support vector machine, and perceptron, to list a few [24, 29, 30] . A simple but widely used ML technique is regression [31, 32] , which is a supervised ML technique that estimates a significant relationship between one dependent variable and two or more independent variables [30] . Regression analysis is a statistical methodology most often used for numeric prediction, although other methods also exist. Regression also encompasses the identification of distribution trends based on available data [27] . The objective of this technique in ML is to predict continuous values such as time series data. ML algorithms developed through statistical learning theory has been widely applied in nonlinear regression estimations. When the regression of Y on X is linear, sometimes the regression line does not pass through the origin. In such conditions, it is more appropriate to use the regression type estimator to estimate the expected value; these estimators are various ML algorithms. Various estimators of regression along with a few basic algorithms have been used here and are listed in Table 27 .3. 

, where x i is the current value and n is number of values.

It is moving averages visualize the average value of timeseries data over a specific period of time. It is a momentum indicator that measures the magnitude of recent changes in data. 5 Standard

value, x is mean value of x, and n is number of values.

It calculates the dispersion of a dataset relative to its mean and square root of variance.

, where x i is current value, x is mean value of x, and n is number of values.

It determines the spread of data size compared with the mean value. 7

Rate of change (ROC) ROC ¼ C p À C pÀn C pÀn Ã 100, where C p is the value of the most recent period and C pÀn represents the value that is n periods before the recent value.

It measures the percent change in data from one period to the next.

Experimental work was carried out using a Python library (PyCaret). The regression module of PyCaret is a supervised ML module used to forecast continuous values. It has over 22 ML algorithms and various plots to analyze the performance of models. Ridge regression A way to create a model when the number of predictor variables is more than the number of observations. 9

Lasso regression A type of linear regression that uses shrinkage when data values are shrunk toward a midpoint, such as the mean. 10 Linear regression A linear approach that shows the relationship between a dependent variable and one or more independent variables. 11

Least angle regression A linear regression that selects the model to solve the problem of overfitting. 13 Bayesian ridge Estimates a probabilistic model of the regression problem. 14 Theil-Sen regressor Selects the median of the slopes of all lines for robustly fitting a line. 15 Random sample consensus Works iteratively and estimates the parameter of a mathematical model from a set of observed data that contains outliers. 16 Huber regressor A regression technique that is robust to outliers. 17 Passive aggressive regressor

Online learning algorithms for both classification and regression.

Orthogonal matching pursuit A greedy compressed sensing recovery algorithm that selects the best-fitting column of the sensing matrix in every iteration. 19 AdaBoost regressor A machine learning meta-algorithm that can be used in combination with many types of learning algorithms to improve performance. 20 K neighbors regressor A simple algorithm to predict the target value based on a similarity measure on all available cases. 21 Elastic net A linear regression model useful with multiple correlated features and likely to pick both. 22 Support vector machine Supervised learning model associated with multiple learning algorithms for data analysis for classification and regression.

A flow diagram of experimental design is depicted in Fig. 27.3 , showing the seven main components: data collection, data normalization, feature extraction, feature selection, data partition, model development, and model selection and validation. As stated in Section 2.1, data for confirmed and recovered cases and deaths were collected. Data are in an integer numeric form and were normalized first to scale between [0 1] and then partitioned into training and testing samples. Owing to the small size of available data, 10-fold cross-validation [33] was used. The 10-fold cross-validation technique is employed for the dynamic partitioning of data and to improve the performance of the model. As stated in Section 2.3, feature extraction and feature selection were also employed. Feature selection was performed by the Python tool, and the four best features (actual data, EMA, standard deviation, and variation) were selected as the best of eight extracted features. ML models were then developed with 20 ML algorithms (estimators), as explained in Section 3, and performance was measured based on the MAPE value, from which the best four models were selected. These were extra tree regressor, extreme gradient boosting, linear regression, and AdaBoost regressor. Finally, 

The performance of the model was analyzed across different aspects, as discussed next. 

According to the results at the validation stage shown in the tables in Section 4.2, a comparison was made, as shown in Table 27 .7e27.9 of actual (official) data as reported and predicted through the best model along with the percent error. These are also shown graphically in the form of bar graphs in Fig. 27 .4.

Analysis of the model was also conducted graphically through residuals graphs and prediction error plotting. Plotting takes a trained model object and returns a plot based on the test dataset, as shown in Figs. 27.5e27.10. A prediction error plot depicts actual targets against predicted values generated by our model. This shows the variance in the model. Using this plotting, we can diagnose regression models by comparing them against the slanting line of 45 degrees and identify whether the prediction exactly matches the model. A residual plot is a graphical representation that shows the relationship between a given independent variable and the response variable. A residual value is a measure of how much a regression lines best fits the dataset where few data points will fit the line and others will miss. In the residual plot, the X axis represents the residual values and the Y axis displays the independent variable. 

The primary objective of the proposed research is to predict future values of confirmed and recovered cases and deaths in India. shows the 15-day-ahead forecast. The actual data almost overlaps the predicted data at the testing and validation stages. The forecasted data indicate the 15-day-ahead forecasted values from Apr. 26 to May 10, 2020, which can be verified in a real sense as the time arrives. These graphical results indicate that cases will have increase in the near future and will have reached 72,028 confirmed cases, 38,178 recovered cases, and 1768 deaths by May 10, 2020 (shown in Fig. 27 .14 for all three cases). Fig. 27 .14 depicts a pandemic graph for confirmed and recovered cases and deaths in the case of forecasted models developed with both datasets. Fig. 27 .14A shows that the number of confirmed cases increases rapidly; on the other hand, because of available medical facilities and special attention given to treating COVID-19 patients, recovered cases are subsequently increasing and deaths are almost stable for the period of forecasting (i.e., Apr. 26, 2020 to May 10, 2020). The same trend is observed in the case of a model developed with daily data, as shown in Fig. 27 .14B, but with some more variations. The comparative results of both models are shown in Fig. 27 .15A and B for confirmed and recovered cases and deaths.

Because validation of the model was done on the basis of data during lockdown, when the sufficient precautions are being taken by Indian government, hence forecasting is biased and binding upon the situation of lockdown. Though more detailed data is needed to improve the forecasting of model for COVID-19 [8] . It may slightly change if any other decisions about lockdown will be taken by the government in near future. Also, the growth of this pandemic will be slowed in recent days only when people take precautions according to advisories that are issued by the Ministry of Health and Family Welfare of the Government of India from time to time. At this point, it is advisable for the lockdown to continue or for limited privileges to be given in specific areas of the country to fight COVID-19. 

Forecasting the future trend of COVID-19 pandemic is a critically important job for researchers to assist the government in making appropriate decisions to protect human lives. This research explores the development of ML-based models for 15-day-ahead forecasting based on cumulative and daily data for confirmed and recovered cases and deaths reported in India. Traditional statistical techniques are unable to incorporate variations of data of a nonlinear nature despite the application of statistical and mathematical techniques for the future value of forecasting COVID-19. This study used MLbased techniques to build forecasting models because these techniques are self-capable of producing comparatively better outputs. Regression techniques to develop robust forecasting models with the concept of the sliding window, feature extraction, and feature selection were used. The findings forecast COVID-19 in India with the models developed using cumulative and daily datasets for three cases, showing that confirmed and recovered cases and deaths will increase in the near future. Model validation for data from Apr. 15 building ML models. Cumulative data are linear in nature whereas daily data are slightly nonlinear in nature; therefore, the developed model using cumulative data outperforms the model developed with daily data. Despite having limited data for training, the models perform well. It is expected that this research work will provide a basis for the development of more an accurate forecasting model and that it might also be used to arrive at a strategy to control the COVID-19 pandemic. In the future, ML models with a greater amount of data can be developed and forecasting might be done for other countries with ML techniques such as deep learning. Also, long-term furcating for 30e60 days might be performed.

Forecasting versus projection models in epidemiology: the case of the SARS epidemics

Prediction of SARS epidemic by BP neural networks with online prediction strategy

COVID 19 in India: strategies to combat from combination threat of life and livelihood

Ministry of Health and Family Welfare, GoI

Neural networks performance in exchange rate prediction

Improvement of time forecasting models using a novel hybridization of bootstrap and double bootstrap artificial neural networks

Real-time forecasts of the COVID-19 epidemic in China from

Application of the ARIMA model on the COVID-2019 epidemic dataset, Data Brief

Online Forecasting of Covid-19 Cases in Nigeria Using Limited Data, Data in Brief

Predicting the Growth and Trend of COVID-19 Pandemic using Machine Learning and Cloud Computing

COVID-19 outbreak prediction with machine learning, SSRN Electron

COVID-19: challenges to GIS with big data

Analysis and forecast of COVID-19 spreading in China

Evaluation of the lockdowns for the SARS-CoV-2 epidemic in Italy and Spain after one month follow up

COVID-19 virus outbreak forecasting of registered and recovered cases after sixty day lockdown in Italy: a data driven model approach

1,000,000 cases of COVID-19 outside of China: the date predicted by a simple heuristic

Prediction for the spread of COVID-19 in India and effectiveness of preventive measures

Composite Monte Carlo decision making under high uncertainty of novel coronavirus epidemic using hybridized deep learning and fuzzy rule induction

Predicting short-term stock prices using ensemble methods and online data sources

Coronavirus Outbreak in India

The effect of data pre-processing on optimized training of artificial neural networks

Adaptive sliding window algorithm for weather data segmentation

Application of sliding window technique for prediction of wind velocity time series

Vehicle speed prediction via a sliding-window time series analysis and an evolutionary least learning machine: a case study on San Francisco urban roads

Multiple-output support vector regression with a firefly algorithm for interval-valued stock price index forecasting, Knowledge-Based Syst

Depression episodes detection in unipolar and bipolar patients: a methodology with feature extraction and feature selection with genetic algorithms using activity motion signal as information source

Data Mining Concepts and Techniques, Third

Stock market prediction with various technical indicators using neural network techniques

Bagging predictors

Prediction of foreign exchange rate using regression techniques

Displacement prediction of landslide based on generalized regression neural networks with K-fold cross-validation

Regression techniques for the prediction of stock price trend

Performance evaluation of classification algorithms by k-fold and leave-one-out cross validation

Predictive modelling for solar thermal energy systems: a comparison of support vector regression, random forest, extra trees and regression trees