key: cord-0703567-4g80mq3g authors: Peng, Yaohao; Nagata, Mateus Hiro title: An empirical overview of nonlinearity and overfitting in machine learning using COVID-19 data date: 2020-06-30 journal: Chaos Solitons Fractals DOI: 10.1016/j.chaos.2020.110055 sha: 1243da6f979dec065b9f431d32f1e7170ff12be5 doc_id: 703567 cord_uid: 4g80mq3g In this paper, we applied support vector regression to predict the number of COVID-19 cases for the 12 most-affected countries, testing for different structures of nonlinearity using Kernel functions and analyzing the sensitivity of the models’ predictive performance to different hyperparameters settings using 3-D interpolated surfaces. In our experiment, the model that incorporates the highest degree of nonlinearity (Gaussian Kernel) had the best in-sample performance, but also yielded the worst out-of-sample predictions, a typical example of overfitting in a machine learning model. On the other hand, the linear Kernel function performed badly in-sample but generated the best out-of-sample forecasts. The findings of this paper provide an empirical assessment of fundamental concepts in data analysis and evidence the need for caution when applying machine learning models to support real-world decision making, notably with respect to the challenges arising from the COVID-19 pandemics. Machine learning algorithms have emerged as a popular paradigm in recent scientific researches due to their flexibility to cope with the specificities of the data, not being limited by assumptions such as functional forms of the decision function of the probability distribution of the variables (Hsu et al., 2016) . The versatility of this approach allowed its application in many different contexts, 5 ranging from forecasting financial variables (Henrique et al., 2019) to sentiment analysis in texts (Singh et al., 2017) and medical applications (Motwani et al., 2017; Sidey-Gibbons and Sidey-Gibbons, 2019) . Specifically in response to the COVID-19 pandemics, machine learning methods have also been applied to the detection of new cases using X-ray images (Ozturk et al., 2020) and the prediction of the time series of confirmed cases (Chimmula and Zhang, 2020) . 10 Give their ability to approximate highly complex patterns using nonlinear interactions on the data, the performance of machine learning models often exhibits a high sensibility to variations in their hyperparameters, which are user-defined model-specific parameters that control the learning process. As discussed by Claesen and De Moor (2015) , is not always trivial to find out the optimal combination of hyperparameters for prediction tasks, or even which hyperparameters are relevant 15 for that specific task. With respect to the recent COVID-19 pandemics, authors like Song and Karako (2020) pointed out the importance of real-time information dissemination, emphasizing the relevance of having not only precise and reliable information about the virus' characteristics and dissemination, but also the speed in which that information is generated and spread. Analogously, Leon et al. (2020) discussed 20 the importance of having comparable data to assist decision-making on COVID-19's containment, notably with regard to the impacts of different strategies of mitigation and/or suppression across different locations. In this sense, this paper aims to analyze the empirical effects of different hyperparameter settings on prediction performance, especially concerning the degree of nonlinearity that is introduced 25 to the models, using COVID-19 data from the 12 most-affected countries. Moreover, we provided a visual-friendly analysis of the overall sensitivity of the models to hyperparameter changes using 3-D interpolated surfaces. This paper is structured as follows: section 2 discuss the bias-variance dilemma and related concepts based on Hoeffding's inequality, as well as the mathematical formulation of Support Vector Regression (SVR) and Kernel functions; section 3 describes the data, the 30 steps of the empirical analysis and the discussion of the results; and section 4 discusses this study's limitations and propositions of future developments. The essence of statistical learning is to identify the relevant patterns from observed data and 35 be able to generalize those patterns to future data, thus providing reasonable explanations to a certain phenomenon of interest. The main challenge of generalization if to make inferences about the population based on a sample taken from it, which implies that every sample has particularities that may not represent the true data generating process, which would only be achieved by having the data of the whole population, In this sense, it is not enough to evaluate a model's generalization 40 ability based only on its predictive performance for the observed data (a small in-sample error E in ); instead, real-world decision making is interested in approximating the correct value for data that are yet to be observed -a small out-of-sample error E out . The relationship between E in and E out can be seen in Hoeffding's inequality (Hoeffding, 1963) , which formalizes the trade-off between capacity and complexity for the construction of a good 45 algorithm for generalizations -also known as the "bias-variance dilemma". Hoeffding's inequality is expressed as: where h is a predicting model constructed from the observed data, ε is user-specified margin, n is the sample size and M is a measure of the model's complexity (one possible indicator to measure this is the Vapnik-Chervonekis dimension (Vapnik et al., 1994) ), which expresses the learning capability 50 of a statistical learning algorithm -i.e.: the space of functions that the model can produce as a decision function for the input data. Nonetheless, the broader this set of "attainable functions" is, the more complex the model will be, such that in a scenario in which a "simpler" function would suffice, a more complex set of candidate functions would add more noise to the model's predictions, thus compromising its overall generalization ability. As stated in the "Occam's razor" principle, 55 a simpler model with the same explanatory power tend to be superior in comparison with a more complex one. While E in can be effectively reduced to zero without major challenges, doing so tend to be harmful to generalization purposes, since the researcher would be basically "hoping" for the future data to be equal to the sample taken. On the other hand, the generalization error E out depends 60 on the allocation between in-sample bias and out-of-sample variance (complexity). After algebraic manipulations of the expression 1, the upper bound for E out can be expressed as: where δ = 2 · M · e −2·ε 2 ·n is a constant and Ω represents the penalization for the complexity of the model. Note that two sources raise the upper bound for generalization error E out : based on Hoeffding's inequality, E out is bounded by the sum of two components: in-sample error E in (bias) and the algorithm's complexity Ω (variance): the first term is high when the sample data are not well described, while the second term is high when the predicting model is way too complex to fit well for unseen data -in both cases, the generalization power of the model is hindered. In this regard, the choice of the model and its hyperparameters is especially relevant since a small change can induce significant differences in generalization effectiveness, as discussed in Claesen and De 70 Moor (2015). Whilst machine learning methods are flexible to the characteristics of the observed data, excessive complexity in those models can induce overfitting. As the name implies, overfitting occurs when the predicting algorithm ends up in describing too well the in-sample data, incorporating not only the information related to the data-generating process but also the noises specific to that 75 particular sample -in this scenario, the model would be simply "memorizing" the observed patterns from the past, and thus it would not be actually "learning" the relevant patterns. The effect of excessive complexity can be seen in the so-called "Runge's phenomenon" (Runge, 1901) , which refers to the deviation between the Lagrange polynomials interpolations and the actual function of interest -maintaining the in-sample approximation error at zero, the out-of-sample approximation 80 error grows larger as the chosen degree of the interpolating polynomial increases. In this paper, we opted to use Support Vector Regression (henceforth SVR) to perform the forecasts, motivated by the fact that this method can generalize nonlinear interactions of a userspecified dimensionality by using Kernel functions, making it easy to compare the relative predictive 85 performance of different structures of feature nonlinearity. SVR (Vapnik, 1995; Drucker et al., 1997) is a regression model that aims to find the optimal middle-ground between a good description of the in-sample data ("bias") and the overall complexity of the decision function ("variance"), and is one of the most popular machine learning methods in the recent literature (Henrique et al., 2019) . As illustrated by Hoeffding's inequality, the out-of-sample generalization error depends on 90 both sources; in this sense, for the "variance side", SVR considers a band of tolerance ε, to avoid overfitting; also, for the "bias side", a penalty term C is added to the objective function for points that lie outside this confidence interval for an amount ξ > 0. We applied the ε-SVR, which uses the ε-insensitive loss function (Vapnik, 1995) SVR solves a quadratic programming problem with a global optimal solution, as described in Drucker et al. (1997) . The decision function of ε-SVR is given by: . . , T is the kernel function associated to the nonlinear mapping ϕ, which is the key element that introduces the nonlinear interactions between the original variables to the model. As discussed by Schölkopf (2001) , κ represents the inner 100 product of ϕ, thus it works as a metric of similarity at an arbitrarily high-dimensional feature space, without having to explicitly compute ϕ, which can have up to infinite components (such as in Gaussian Kernel, for example). To verify the relative effect of considering higher degrees of nonlinearity on the out-of-sample predictive performance of the models, in this paper we used polynomial and Gaussian Kernels, two of 105 the most commonly used functions in Kernel-based machine learning models for both classification and regression tasks. The polynomial Kernel of degree q, given by: introduces nonlinear interactions (powers and cross-products) between all explanatory variables up to the q-th degree. In this sense, applying SVR with the quadratic Kernel (q = 2) for a dataset with k independent variables x i , i = 1, 2, ..., k is equivalent to running a regression using all powers of 110 x i up to degree 2 (x 1 i , x 2 i ) and all cross-products x i x j , with i, j = 1, 2, ..., k; i = j. Analogously, the cubic Kernel (q = 3) includes all powers of x i up to degree 3 (x 1 i , x 2 i , x 3 i ) and all cross-products up to degree 3 (x i x j , x 2 i x j , x i x 2 j , with i, j = 1, 2, ..., k; i = j) as regressors. In summary, the polynomial Kernel can effectively generalize arbitrarily high-dimensional nonlinear regressions by changing only the hyperparameter q; α works as a bias term (intercept) that incorporates the interactions of degree 115 0. By taking the limit q → ∞, and decomposing equation 5 with Taylor series, one can generalize a infinite-dimensional feature space for the data, yielding the Gaussian Kernel: The Gaussian Kernel (also known in the machine learning literature as "Radial Kernel" or "RBF Kernel") generalizes an infinite-degree polynomial with only one hyperparameter σ, which controls 120 the dispersion of the empirical distribution 1 . The choice of the global hyperparameters from SVR's formulation ε and C, as well as the Kernelspecific hyperparameters α, q and σ, makes a decisive impact on the predictions' quality since a different decision function is generated in each different combination. Therefore, in this paper we investigated empirically the effect of introducing high-dimensional interactions to a dynamic 125 prediction of COVID-19's cases time series for the most-affected countries, evaluating the predictive gains by using higher degrees of nonlinearity (and consequently a more general set of potential candidates to the optimal decision function), and in what extent this additional complexity can hinder out-of-sample forecasting. We used daily data from the Center for Systems Science and Engineering at Johns Hopkins University up to May 31, 2020, and selected the countries that surpassed 150000 confirmed cases of COVID-19 at this date, namely (in alphabetic order): Brazil, France, Germany, India, Iran, Italy, Peru, Russia, Spain, Turkey, the United Kingdom, and the United States. The dependent variable was the number of confirmed cases at period t, y t . For each time period 135 t, we used as predictors the number of confirmed cases in the seven last days: y t−7 , y t−6 , ..., y t−1 . To simulate a more realistic decision-making process, we started with an initial training set composed by the time series up to the day when the respective country reached 10000 confirmed cases -in this paper, we called this timestamp S. Therefore, S is the number of days passed at each country 1 Regarding the generalization of polynomial and Gaussian Kernels of nonlinear interactions, a formal proof can be obtained in Peng (2016) . until the 10000-th confirmed case and, consequently, the initial size of the training set D train . On the other hand, the test set D test is always the number of confirmed cases one day after the end of the training set -that is, the first generated forecast isŷ S+1 . Then, at each subsequent day after S, the training set was updated with the actual value of the next day (y S+1 ), with the models being retrained for each combination of hyperparameters, generating the predictionŷ S+2 and comparing to is actual value y S+2 , and so on. For each model, we applied 10-fold cross-145 validation for the respective training set. The grid of hyperparameters that we used is displayed in At the end of the process, we obtained the prediction errors between the forecastsŷ and the actual values y for each day after S, and computed the Mean Absolute Error (MAE) for each 150 combination of hyperparameters {ε, C, α, q, σ}, given by: where N is the number of hyperparameter combinations for each country. After calculating all predictions of the number of confirmed cases y for all periods between S + 1 and May 31, 2020, we took the mean of the in-sample and out-of- Figure 12 : Out-of-sample MAE surfaces for the USA As shown in tables 2 and 3, the polynomial Kernels with degree 1 and 2 had the best average 160 out-of-sample MAEs, while the Gaussian Kernel had the best average performance for in-sample data, but worst for out-of-sample data. In light of Hoeffding's inequality and the decomposition of the generalization error sources presented in equation 2, this is a sign that the Gaussian Kernel models have fallen into overfitting -a small E in at the expanse of a large Ω, with the benefit obtained from the first term not compensating the price paid from the second term. In terms of volatility of the surfaces -i.e., the relative sensitivity of the models' to hyperparameter changes -figures 1 to 12 showed that the polynomial Kernels of degree 3 and 4 (in green and teal, respectively) were the more volatile ones, exhibiting big oscillations on the forecasting errors for small changes in C or ε. On the other hand, the less volatile surfaces were the linear Kernel (in magenta) -an expected result given that this function yields the less complex decision functions -170 and also the Gaussian Kernel (in blue), although it is theoretically able to generalize more complex polynomials than the other Kernels. Even though the shape of the surfaces was different across the 12 analyzed countries, their overall results were similar, both in terms of the order of the Kernels with respect to out-of-sample MAEs and in terms of the surfaces' volatility. However, while the values shown in tables 2 and 3 175 are easy to interpret, they can be somewhat misleading, since they represent the average value of the surfaces shown in figures 1 to 12, while in real-world decision-making the researcher would not pick the average, but the minimum value of the surfaces to actually generate predictions -i.e., out of all hyperparameter combinations, only the one that minimizes the out-of-sample prediction error would be effectively used. In this sense, we also compiled the smallest MAE of each hyperparameter 180 combination for in-sample and out-of-sample data, grouped by the respective Kernel function. The results are displayed in tables 4 and 5 below: Analyzing tables 4 and 5, the empirical consequences of the bias-variance trade-off becomes even clearer. Once again, the linear Kernel exhibited bad minimal in-sample MAEs, but had the best ones for out-of-sample data; on the other hand, the Gaussian Kernel performed almost perfectly for 185 in-sample data for at least one hyperparameter combination for every country, while its best outof-sample result was significantly worse than all other Kernel functions for every analyzed country -despite yielding the best E in , complexity took its toll and resulted in the worst generalization. In the general context of statistical learning, just like more complex models can suffer from overfitting, simpler models like the linear Kernel are also prone to " the other side" of bad general- complexity of nonlinear models can effectively contribute to better out-of-sample performance for financial data, as also pointed out by works like Gu et al. (2020) and Wang et al. (2020) . However, for the specific predictive task performed in this paper, the introduction of nonlinearity did not manage to enhance the forecasts' quality, and the extra complexity made them worst instead. As stated in the "no free lunch theorem" (Wolpert and Macready, 1997) , in statistical inference 200 there does not exist a globally superior learning algorithm that dominates over all others for every generalization task -in other words, there is no "best" model that suits well every application; instead, there is the right method for the right application. In this sense, not all "nonlinearities" are equal, and the same Kernel function can yield quite different performances depending on the application: for example, in Henrique et al. ( in press)'s paper on portfolio allocation using SVR, 205 the inverse multiquadric Kernel function yielded the best results, while in Peng and Albuquerque (2019), which also used SVR but for exchange rate prediction, the same Kernel function yielded overall poor results. This paper conducted an experiment using COVID-19 data to provide an empirical view of some 210 basic machine learning concepts and has the potential to aid future researchers and professionals to better understand the trade-offs involved in model building and their subsequent impacts on generalization performance. In this sense, this paper does not have the intention to exhaust the debate over COVID-19 predicting models, much less the debate over the society's reaction to the impacts of the still ongoing pandemics; instead, we intend to draw attention into fundamental 215 aspects of data analysis and machine learning model construction, as apparently small details can make a significant difference on decision-making made on real life, we urge for extra caution in this regard, especially in COVID-19 times. One of the main features of machine learning models is their ability to capture nonlinear patterns from the data; therefore, the experiment performed in this paper could be extended for many other 220 methods apart from SVR -such as random forest and deep neural networks -which also have many different hyperparameters (number of trees/hidden layers, number of observations in each terminal node, activation function, etc.) that directly influence on these models' generalization performance. While we opted to focus on SVR to facilitate the comparison between the models, the empirical effects of overfitting and hyperparameter changes can also be analyzed using other 225 machine learning models in future researches. In addition, popular models such as LASSO and ridge regression can serve as linear benchmarks to analyze the relative performance of nonlinear models, not only for COVID-19 data but for general predicting tasks of various knowledge fields. Finally, future developments can include COVID-19 data from other countries and other Kernel functions, as well considering a larger interval grid for the hyperparameters for further details on 230 the behavior of in-sample and out-of-sample prediction performance. Disclaimer: The views expressed in this work are of entire responsibility of the authors and do not necessarily reflect those of their respective affiliated institutions nor those of its members. ☒ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. ☐The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: eclaration of Interest Statement A method of bivariate interpolation and smooth surface fitting for irregularly distributed data points akima: Interpolation of Irregularly and Regularly Spaced Data Time series forecasting of covid-19 transmission in canada 240 using lstm networks Hyperparameter Search in Machine Learning Support vector regression machines Empirical asset pricing via machine learning Modeling nonlinearity in multi-dimensional dependent data Literature review: Machine learning techniques applied to financial market prediction Portfolio selection with support vector regression: Multiple kernels comparison Probability inequalities for sums of bounded random variables Bridging the divide in financial market forecasting: machine learners vs. financial economists Nonlinear dynamics of equity, currency and commodity markets in the aftermath of the global financial crisis COVID-19: a need for real-time monitoring of weekly excess deaths Machine learning for prediction of all-cause mortality in patients with suspected coronary artery disease: a 5-year multicentre prospective registry analysis Automated detection of covid-19 cases using deep neural networks with x-ray images Support Vector Regression applied to exchange rates prediction Non-linear interactions and exchange rate prediction: Empirical evidence using support vector regression Über empirische funktionen und die interpolation zwischenäquidistanten ordinaten The kernel trick for distances Machine learning in medicine: a practical introduction Optimization of sentiment analysis using machine learning classifiers COVID-19: Real-time dissemination of scientific information to fight a public health emergency of international concern Measuring the vc-dimension of a learning machine The nature of statical learning theory Chaos and complexity in a fractional-order financial system with time delays No free lunch theorems for optimization