key: cord-0949292-nvpklwab authors: Ahuja, Sahil; Kumar, Abhimanyu title: Expectation-Based Probabilistic Naive Approach for Forecasting Involving Optimized Parameter Estimation date: 2022-04-23 journal: Arab J Sci Eng DOI: 10.1007/s13369-022-06819-0 sha: dfed87c2b0fd5c1ee4c8215895cef32adb0199cf doc_id: 949292 cord_uid: nvpklwab This paper presents a forecasting technique based on the principle of naïve approach imposed in a probabilistic sense, thus allowing to express the prediction as the statistical expectation of known observations with a weight involving an unknown parameter. This parameter is learnt from the given data through minimization of error. The theoretical foundation is laid out, and the resulting algorithm is concisely summarized. Finally, the technique is validated on several test functions (and compared with ARIMA and Holt–Winters), special sequences and real-life covid-19 data. Favorable results are obtained in every case, and important insight about the functioning of the technique is gained. Extrapolative forecasting methods are widely used in production and inventory decisions [1] , tourism data forecasting [2] , economic and financial risk management [3] , wind forecasting [4] , analyzing and forecasting traffic dynamics [5] , forecasting road accidents [6] , demographic and epidemiological projection [7] , etc. One finds forecasting-related work employing both conventional techniques and machine learning (ML)-based techniques [8] , for instance electricity load forecasting using hybrid model based on IEMD, ARIMA, WNN and FOA [9] , short-term electricity price forecasting using an adaptive hybrid model based on VMD, SAPSO, SARIMA and DBN [10] , and electricity requirement forecasting for smart grid and buildings using ML models, ensemble-based techniques and ANNs [11] . The most recent application of forecasting techniques carried out by various researchers is on the covid-19 data (see for instance [12] [13] [14] [15] [16] There is a relation between the future and past; however, the knowledge of this relation is not available which calls for developing innovative methods of prediction. Before proceeding to introduce our technique, let us look at the methods proposed/studied in literature. There are a wide range of frequently used quantitative forecasting tools, but most of the focus is still on moving average, simple linear regression based on covariance, and multiple linear regression. Dedicated forecasting algorithms like ARIMA, SARIMA, Holt-Winters, etc. majorly use integrated moving average after disaggregating into trend, seasonality and white noise. According to Billah et al. [17] , applications of exponential smoothing to forecasting time series usually rely on simple exponential smoothing, trend corrected exponential smoothing and a seasonal variation. Their results indicate that the information criterion approaches provide the best basis for automated method selection, where the Akaike information criteria has a slight edge over its information criteria counterparts. Exponential smoothing-based models like Holt-Winters are widely used for forecasting [18] . The interested readers may refer to the work of Chatfield et al. [19] and Armstrong et al. [20] who provide general guidelines for selecting forecasting methods. Carbonneau et al. [21] applied a representative set of traditional and ML-based forecasting techniques to the demand data and compared the accuracy of the methods. The average performance of the ML techniques did not outperform the traditional deterministic approaches. Efficient training of ML models with ANN, CNN, LSTM, GRU backbones is now possible for deep representation learning. Siami et al. [22] concluded that the average reduction in error rates obtained by LSTM was between 84% and 87% when compared to ARIMA, indicating the superiority of LSTM to ARIMA; however, it is only suitable where ample amount of data and computation resources are available. Problem of overfitting in ML needs to be tackled by different regularization techniques. However, using a support vector machine (SVM) trained on multiple demand series produced the most accurate forecasts. Myrtveit et al. [23] simulated a machine learning and a regression model, based on which they suggested that more reliable research procedures need to be developed before we can have confidence in the conclusions of comparative studies of software prediction models. Zhang et al. [24] and Khashei et al. [25] created a hybrid model with ARIMA and ANN to capture the linear and nonlinear modeling simultaneously to obtain better accuracy. Hybrid forecasting system based on a dual decomposition strategy and multi-objective optimization was proposed by Yang et al. [26] for electricity price forecasting. In fact, optimization algorithms are widely used to obtain optimal parameters of the forecasting models. It is important since Das et al. [27] concluded that the accuracy of the PV power forecasting model varies by changing the forecast horizon, even with identical forecast model parameters. Zou et al. [28] tried to combine forecasts of individual models with an appropriate weighting scheme to have a predictor with smaller variability so that the accuracy can be improved relative to the use of a selection criterion. Stekler et al. [29] found that combining forecasts does improve accuracy while working on sports forecasting. Also, a need to adjust weights for new and old information is pronounced as prediction may be independent of previous events. Makridakis et al. [30] showed that the statistical methods are more accurate than ML and require lesser computation. Green et al. [31] after comparing 25 papers with quantitative comparisons claimed that complexity (by using more equations, more complex functional forms and more complex interactions) increases forecast error by 27% on an average. Transdisciplinary transition to distributional or probabilistic forecast has been observed in the past few years. Gneiting et al. [32] formalized and studied notions of calibration in a prediction space setting. Probabilistic forecasts serve as an essential ingredient for optimal decision making since they can quantify uncertainty. Overall, the need for advancement in the methodology of forecasting is immense; thus, researchers constantly look for opportunities to overcome challenges in the field. This paper proposes an unconventional forecasting approach where the principle of naïve method and average method is modified and simultaneously employed in a probabilistic sense where a parameter is learnt from available data by minimizing an error function, and consequently make a prediction at the unknown point. The paper is organized as follows: Section 2 develops the technique and formally summarizes the algorithm, and Section 3 applies it on several standard mathematical functions, special sequences and a real-life example (covid-19 dataset). Suppose an analyst is given n data points denoted by ( The three most basic approaches in literature are: average method, naïve approach, and drift method. In average method, the non-observed value is estimated as the average of past observed values. Naïve approach generates predictions equal to the last observed value, mathematically, y n+1 = y n where y n is the last data. Modifying this idea to allow slope gives the drift method which is similar to using first and last observation for linear extrapolation to the future. Often there is such a relation where the future is related to the past, however, the knowledge of this relation (or model) is either not available or not accurate, and has uncertainty which clearly indicate that forecasts should be probabilistic. In order to determine y n+1 from (y i ) 1≤i≤n , the technique proposed is an agglomeration of naïve approach and average method but in a probabilistic sense as described below. Since x n+1 is near to x n , so the likelihood of y n+1 being almost (or exactly) equal to y n is higher than any other y i . It is quite intuitive that as x n+1 → x n , so y n+1 → y n for a continuous function. Thus, the value at point x n+1 must be most influenced by the value at point x n . Let y n+1 = Y be treated as a random variable and P[Y = y i ] is the probability that y n+1 is equal to y i . This probability falls as a point is chosen far away from x n+1 . Additionally, it must be maximum only at x n+1 which is not actually possible since it is the point at which prediction is to be made, so the value is not known beforehand. Keeping these properties in mind, P can be chosen to obey the Gaussian distribution curve with peak at x n+1 with some parameter σ . Thus, the predicted value of y n+1 can be given as the expectation of past data: where Here, the parameter σ depends on the available data and can be extracted by minimizing the following error function: which shall give the optimal σ * = arg min E that is substituted in (1) to predict y n+1 . With (3), one gets good performance for certain standard functions; however, the technique does not perform satisfactorily for functions like y = log(x). This situation can be improved by employing the operating principle on the error function as well, that is, for evaluation at x n+1 consider only nearby points in the error function. But reduction of data points is not a wise decision; instead, all data points must be kept but with higher importance assigned to the nearby points. So, we consider the error function given below: to determine σ * and then predict y n+1 . The final technique is concisely given as Algorithm 1. Predict y n+1 at x n+1 /* Error function is defined */ 1 def ErrorFunction(σ ): return y k+1 /* Main code */ 12 σ * = arg min(ErrorFunction) 13 y n+1 = Predict(σ * , n) /* End */ The proposed method is applied to dataset sampled from certain standard functions, special/popular sequences, and real-life dataset like that of covid-19. In this paper, Nelder-Mead algorithm [33] shall be used to minimize the nonlinear loss function. Bottou et al. [34] explained that optimization algorithms such as stochastic gradient descent (SGD) show better performance for large-scale problems. In particular, second-order stochastic gradient and averaged stochastic gradient are asymptotically efficient after a single pass on the training set. Application of this algorithm minimizes the loss function to get the optimal parameter σ . The optimizer algorithm may be changed; for instance, one may try [35] . The proposed technique is applied to certain functions having different shapes and rate of growth as listed in Table 1 . Each test function is sampled at a step size h in the given range. Prediction is made at every point using the data prior to that point. The obtained actual versus predicted plots are shown in Fig. 1 for the considered test functions. For each test function, Table 1 gives the value of root mean squared error (RMSE) and mean absolute percentage error (MAPE) using the error evaluated between the actual curve and the prediction. The actual and predicted data points for selected test functions listed earlier are visualized in Fig. 1 where we observe that the proposed method successfully learns the trend and predicts very close to the actual data. Our method also performs well on periodic trigonometric function like sin x which have alternating slopes. An interesting revelation comes from Fig. 2 which depicts the prediction for function sin(x) sampled at 0.25, 1.00 and 2.50, respectively. For data points relatively closer to each other, our technique quickly learns and adapts, and thus per-forms better than other cases. Those working in the field of signal processing might find resemblance of this phenomenon to the Nyquist-Shannon sampling theorem. The output is a flat line for h = 2.5, it is distorted/phase-shifted (seems non-differentiable) for h = 1, but for h = 0.25, the predicted curve initially overestimates then corrects itself over two wavelengths eventually resulting in an almost accurate approximation of the actual curve. Figure 1 indicates that the prediction is overall accurate except initially. The divergence is a specific property of extrapolation methods and is only circumvented when the functional forms assumed by the method (inadvertently or intentionally due to additional information) accurately represent the nature of the function being extrapolated. In the beginning, the prediction is poor due to insufficient data preceding the point at which prediction is to be made. Eventually, the prediction becomes better and thus better fits the actual curve. To further strengthen our belief on the proposed technique, it is compared with ARIMA and Holt-Winters. ARIMA refers to autoregressive integrated moving average and is a more complex version of the autoregressive moving average, with the addition of integration. The dependence between an observation and a residual error from a moving average model applied to lagged data is used in this model where the "integrator" aspect makes the time series steady. The Holt-Winters technique is a popular time series forecasting approach that can account for both trend and seasonality. This approach is made up of three different smoothing methods viz. simple exponential smoothing (SES: assumes that the level of the time series remains constant, so cannot be utilized with series that have both trend and seasonality), Holt's exponential smoothing (HES: it allows trend component in the time series data), and Winter's exponential smoothing (WES: it is Holt's exponential smoothing enhancement that finally allows for seasonality to be included). Figure 3 shows the comparative plot of prediction made in the specific range for linear and exponential functions using ARIMA and Holt-Winters against the point predictions by proposed technique. The process is omitted for other func-tions since the conclusion remains same or the other methods fail to perform on negative values. The proposed method is also applied to two widely known sequences, namely the Fibonacci and partition sequences. These are described below. 1. Fibonacci Sequence-It is a sequence defined by the recursive relation F(n) = F(n − 2) + F(n − 1) and is listed as: and so on. More information can be found on OEIS [37] . The actual and predicted data points for the considered sequences are depicted in Fig. 4 . The obtained MAPE for Fibonacci sequence is 78.8562%, while for partition sequence is 73.6793%. While the values do not match, the predicted curve increases at the same rate of growth. Reducing the step size is not an option in situations like these where the function is not defined for non-integer inputs. Considering observations from earlier results, it seems that one can refine by scaling the x-axis (say by 1/6) which is reversible step. The improved results are depicted in Fig. 5 . The obtained MAPE for Fibonacci sequence is 3.0263%, while for partition sequence is 8.2235%. In order to see how the proposed technique performs on reallife data, here it is applied to daily covid-19 cases in India considered from January 30, 2020 to September 7, 2020 [38] . Therefore, the dataset spans 221 points (which is insufficient for ML algorithms to perform well). It is seen that machine learning algorithms converge much faster with feature scaling than without it. Additionally, scaling would help not to saturate too fast like in the case of sigmoid activation in neural networks. Since the proposed algorithm has been driven by Nelder-Mead method, monotonic transformation like scaling seems to provide improved results. Within the dataset, "New Cases Smoothed" feature column is considered and preprocessed by removing null values and applying standardization by considering x std = x−mean(x) √ var(x) . These data The proposed technique is applied on this dataset to arrive at a prediction; then, reversing the preprocessing step gives the final plot of actual versus predicted as shown in Fig. 6 . Evidently, it is a good fit with MAPE of 8.6810% and RMSE Since the model has only one tunable parameter σ * so it is really fast to train and lightweight as it can be stored in 64bit/128-bit floating point representations. In fact, immense work has emerged in the last year on specifically this topic. The objective here was to show that our technique can learn unprecedented variation in data. This was partially evident from the prediction of sinusoidal functions as well, but this example further strengthens our claim. This paper proposes a forecasting approach where the principle of the classical naïve method and average (expectation) method are probabilistically modified and simultaneously employed to predict, where a crucial parameter of the distribution is estimated through loss minimization from past data. Although this paper employs Nelder-Mead algorithm for optimization, other techniques are also promoted for the reader to pursue. The proposed technique converges in every considered scenario within fraction of seconds to produce the forecast. It is rigorously tested for several functions and sequences of different nature and growth rates. The proposed method is compared with other popular techniques like ARIMA and Holt-Winters. In fact, it is also applied to covid-19 data to demonstrate that the technique is adaptable to unprecedented variations. This work encourages the application of probability and optimization to the field of forecasting. Funding No funding. Forecasting with big data: a review The tourism forecasting competition Real-time inflation forecasting in a changing world A literature review of wind forecasting methods Data-driven analysis and forecasting of highway traffic dynamics Exploring the forecasting approach for road accidents: analytical measures with hybrid machine learning Bayesian probabilistic population projections for all countries Electrical load forecasting models: a critical systematic review Short term electricity load forecasting using a hybrid model An adaptive hybrid model for short term electricity price forecasting A review on renewable energy and electricity requirement forecasting models for smart grid and buildings The challenges of modeling and forecasting the spread of covid-19 Forecasting covid-19 Databased analysis, modellingand forecasting of the covid-19 outbreak Optimization method for forecasting confirmed cases of covid-19 in China Forecasting the novel coronavirus covid-19 Exponential smoothing model selection for forecasting Exponential smoothing: the state of the art What is the 'best' method of forecasting? Selecting forecasting methods Machine learningbased demand forecasting in supply chains A comparison of ARIMA and LSTM in forecasting time series Reliability and validity in comparative studies of software prediction models Time series forecasting using a hybrid ARIMA and neural network model An artificial neural network (p, d, q) model for time-series forecasting A hybrid forecasting system based on a dual decomposition strategy and multi-objective optimization for electricity price forecasting Forecasting of photovoltaic power generation and model optimization: a review Combining time series models for forecasting Issues in sports forecasting Statistical and machine learning forecasting methods: concerns and ways forward Simple versus complex forecasting: the evidence Probabilistic forecasting A simplex method for function minimization Large-scale machine learning with stochastic gradient descent Efficient regularized Newton-type algorithm for solving convex optimization problem The Online Encyclopedia of Integer Sequences The Online Encyclopedia of Integer Sequences