Inherent Complexity: a problem for Statistical Model Evaluation Jan-Willem Romeijn University of Groningen Abstract This paper investigates a problem for statistical model evaluation, in par- ticular for curve fitting: by employing a different family of curves we can fit a scatter plot almost perfectly at apparently minor costs in terms of model complexity. The problem is resolved by an appeal to prior proba- bilities. This leads to some general lessons about how to approach model evaluation. 1 Introduction Theories often interface with empirical fact through a statistical model, namely a collection of hypotheses that each determine a probability distribution over possible observations. Most statistical inference is carried out on the basis of a model, for example by getting the data to choose among the hypotheses in it, or by redistributing the probability assignment over the hypotheses in the model. Curve-fitting is an instance of statistical inference. For example, the yearly number of car accidents with claimable damage follows a Poisson distribution, whose characteristics depend on the total distance covered by the vehicle. What determines the statistical model is the exact functional dependence of frequency 1 on distance. Since vehicles that do not cover any distance will not incur any damage, the intercept will be zero. One statistical model may be that the dependence is linear, so that the hypotheses in the model differ in the slope of the line that relates distance to expected number of accidents. Another statistical model might postulate a more complicated relation between distance and expected number of accidents, e.g., a quadratic dependence, perhaps with the idea that long-distance drivers have proportionally fewer accidents. While models are typically chosen at the outset, sometimes they are un- der scrutiny themselves. For example, we might compare the linear and the quadratic models sketched above. Statistical model evaluation allows us to compare such models on a variety of performance measures. Model evaluation is important for scientists and philosophers of science alike. It allows scientists to submit their modeling assumptions to empirical testing, and thereby address the uncertainty over their theoretical starting points. And it gives philosophers of science a concrete and formally precise handle on a fundamental kind of un- certainty. Examples of model evaluation abound, ranging from climate science and ecology to psychiatry and computational archaeology. If philosophers can motivate and develop norms for dealing with model uncertainty, this will have direct implications for the practice of science. This paper contributes to our understanding of the norms that drive sta- tistical model evaluation. After an introduction into model evaluation tools in section 2, I present a new problem for them in section 3. I then offer a diagnosis of the problem in section 4. In section 5 I show that the problem can be avoided if we involve prior probability assignments in the model evaluation. Through- out I will mostly avoid mathematical detail, to leave more space for conceptual considerations. 2 Statistical model evaluation The curve fitting problem sketched in the introduction may not seem statistical. Given a family of curves, we simply choose by minimizing the errors, i.e., the 2 sum of the discrepancies between curve and point. In the so-called least-squares approach, for example, the error is calculated as the sum of the squares of the vertical distance between point and curve. No model seems to be involved in this. Underneath such a minimization procedure, however, we do find a statistical inference. One central modeling assumption is that the number of accidents N follows a Poisson distribution. A further assumption is that the mean of this distribution depends on the distance D covered by the vehicle, Pθ(〈D,N〉) = (λ(D)) N N! e−λ(D), (1) with λ(D) = θ1D + θ2D 2. Then we choose θ1 > 0 > θ2 for the quadratic model and θ1 > 0 = θ2 for the linear one. Note that the model dictates a distribution over N for all values of D but that it does not determine a probability distri- bution over the values of D itself. The distance D is an explanatory variable, and we presume that it is randomly sampled from a uniform distribution. The data consist of m pairs of distances and numbers of accidents, collected in a scatter plot: SDN = {〈d1,n1〉,〈d2,n2〉, . . . ,〈dm,nm〉}. (2) For any curve and associated hypothesis we can calculate the probability of a scatter plot, i.e., the likelihood of the hypothesis for the data, by multiplying the probability of all the points, Pθ(Sdn) = m∏ i=1 Pθ(〈di,ni〉). (3) A data point 〈d,n〉 lying outside the normal range for some hypothesis, e.g., with n high while d is low and θ2 is too, will be improbable, and hence it will strongly decrease the likelihood of the hypothesis. To fit the curve we look for the value of θ, denoted θ̂, that makes the probability of the scatter plot maximal. Generally speaking, maximizing the likelihood of the curve will correspond to minimizing the distance of points to the curve under some notion of distance. Figure 1 offers an impression of what these curves may look like. 3 Figure 1: The polynomial curves fitted to the scatter plot. Against this background it will be clear that evaluating the general shape of the curves, comparing linear and quadratic ones, is indeed part of statistical model evaluation. Note that I use the term “model evaluation”, not the more often used “model selection”. The selection of a model is a decision, and so involves decision-theoretic as well as inferential aspects. But in what follows I will only consider norms for the comparison of models from an epistemic standpoint. A very common idea about model evaluation is that, next to the fit with data, it involves the complexity of the model. If a neat fit with the data is achieved by adding many bells and whistles, we are rightly reluctant to put our trust in it. We then say that the model is fitting to noise, or overfitting. In the example, the best fitting curve from the quadratic model will have a higher likelihood than the best curve from the linear model. But this is not to say that the quadratic curve is better. The question is whether the gain in fit weighs up against the cost of a more complicated model. The extant model evaluation tools, most notably the various information criteria (Claeskens and Hjort, 2008), provide specific formats for this trade-off between simplicity and fit. The two most prominent tools, the Akaike and Bayesian information criteria or AIC and BIC for short, express the simplicity by means of the number of free parameters in the model (cf. Akaike, 1973; Burnham and Anderson, 2002; Raftery, 1995; Schwarz, 1978). The linear model 4 of the example has one free parameter, and the quadratic model has two. The ICs then differ in how they factor the number of parameters into the trade-off: AIC(Mθ) = 2 log ( P θ̂ (S) ) − 2dim(Mθ), (4) BIC(Mθ) = 2 log ( P θ̂ (S) ) − log(m)dim(Mθ), (5) in which Mθ is the model parameterized by the vector θ, the number of free parameters is given by dim(Mθ), and θ̂ is the hypothesis in the model with maximum likelihood for the data D, so that P θ̂ (S) is the likelihood of the maximum likelihood estimator for the data S. In the BIC the penalty for complexity is scaled according to the sample size of the data m. The involvement of the complexity of models in their evaluation may seem intuitive on pragmatic or metaphysical grounds. A simpler model is easier to use, or we might think that the world itself is a simple place, perhaps because the Demiurge is an efficient or lazy being. The actual reason for the appearance of the complexity penalty in the ICs is epistemic though. Moreover, the motivation is different for the various information criteria on offer. For example, the AIC factors in the number of parameters as a result of approximating the expected Kullback-Leibler divergence to the true hypothesis. And for the BIC the penalty for complexity drops out of an approximation of the past predictive performance of the model, as measured by the marginal likelihood. The number of model parameters surfaces repeatedly as a criterion for model evaluation, under a variety of epistemic good-making features of models. Very roughly, the underlying reason is that the predictions and general empirical claims of more complex models will be less robust and reliable. In a more complex model the same number of data points will be used to determine a larger number of parameter values, and so the available information will have to be spread more thinly. For the AIC this shows up in the stumpness of the likelihood function over the model, and in the BIC it appears as the stumpness of the posterior probability distribution within the model. The general idea is that we can always introduce an additional parameter that improves the best 5 fit in the model, but that we might then lack the data to properly back up a stable value for this additional parameter. However, this intuition does not cover everything that is salient about com- plexity in model evaluation. There is another epistemic good-making feature, strongly related to complexity and the number of parameters, that needs to be taken into account when we compare models. This further feature concerns something like model size. It can be expressed by means of the prior probability distribution within the models, as the following model evaluation problem will reveal. 3 Cheap and almost perfect fit Consider again the example of the scatter plot and the polynomial model. But instead of using the polynomial curves, as detailed above, imagine fitting the data with a model based on trigonometric functions, or sine curves for short. We use the Poisson distributions of Equation (1) but instead of choosing λ(D) to be polynomial we choose λ(D) = α1 −α1 cos(α2D). (6) Figure 2 gives an impression of the fit that may be achieved by the so-called sine model. Importantly, all the points in the scatter plot are given close to maximal probability, because they all end up sitting arbitrarily close to the curve, and hence to the mean for the distribution at the given distance D. The key observation is that we have achieved this remarkable fit at the expense of only two parameters, α1 and α2. It is known that we can obtain a perfect fit to m data points with a polynomial curve of degree m − 1. But fitting any number m of points with two parameters seems inexplicably efficient. Clearly, if we were to apply model evaluation criteria like AIC or BIC, or indeed any other method in which complexity is expressed by the number of parameters, the sine model wins out on the quadratic model, and most likely also on the linear model. What is going on? 6 Figure 2: The sine curve that perfectly fits the scatter plot. Before providing a diagnosis, let me emphasize that the claim that a near- perfect fit is always possible is mathematically non-trivial. In the remainder of this section I will provide more detail to substantiate it. Notably, the fit does not hinge on the assumption of any particular distribution, be it Poisson, normal, or otherwise, or on any particular format of the data, be it real numbers, integers or otherwise. Moreover, given that the scatter plot will manifest on a finite domain 0 < D < L we need not even suppose that the parameters are real valued: it is enough to consider sine curves with a period L/t for t ∈ N, as one does in a Fourier series. Despite all this, it turns out that there are always infinitely many almost perfect fits to a set of points. This abundance of solutions will turn out to be of crucial importance for the resolution of the problem. Say that we have been given a scatter plot Sdn whose farthest points are at di = L and nj = H. For convenience we set α1 = H/2, but any α1 larger than that will work too. Take any specific point 〈d,n〉 from the scatter plot, consider the curves with α2 = L/t for increasing t, and ask: for what values of t does the sine curve intersect with the line D = d in very close proximity to the value n? Observe that d ∈ [kL/t, (k+1)L/t], and that the curve covers the whole of the range [0,H] over this interval of D twice. If we allow for a discrepancy of � between the curve and the value n and assume that d falls within the first 7 half of the interval, we must require that L πt cos−1 (1 − 2n−�/H) < d− kL t < L πt cos−1 (1 − 2n+�/H) . (7) If d falls in the second half of the interval [kL/t, (k+1)L/t], we require an analogous constraint. Because the slope of the cosine is bounded between −1 and 1, we may replace the above inequalities with L πt cos−1 (1 − 2n/H) − �L πtH < d− kL t < L πt cos−1 (1 − 2n/H) + �L πtH , (8) and similarly for d sitting in the second half of the interval. Consequently, for every t there is a specific region of length 4�L/πtH within the interval of length L/t that includes d, for which the resulting error lies within an � bound. The question merely is, for every separately t, whether d indeed lies within this specific region. To establish when the latter obtains, we first recall that the di’s from the scatter plot Sdn were randomly sampled from a uniform distribution over [0,L]. This means that the individual d from the sample is almost surely, i.e., with probability one, a random number. Consequently, there will be no pattern in how d shows up inside the intervals [kL/t, (k+1)L/t] for increasing t. The locations of d are evenly distributed over all parts of this interval. Hence for any � > 0 there will be infinitely many t for which d will fall within the portion of length 4�L/πtH inside the interval of length L/t. The relative size of the region in which the curve is sufficiently close to the value n is constant for increasing t at 4�/πH. And so there will be infinitely many sine curves that have an arbitrarily small error in fitting the point 〈d,n〉. This suffices as an argument for there being an infinity of curves that fit any finite number of points almost perfectly. For a single point, the fraction of sine curves will tend to 4�/πH. So for a set of m points that are randomly distributed over D, the fraction will tend to (4�/πH)m. When making � small and thus maximizing the likelihoods, the fraction of curves with good enough fit will be very small. But there will still be infinitely many fitting ones. 8 4 Diagnosis of the problem The fact that there are infinitely many equally well-fitting sine curves incapac- itates some of the standard model evaluation tools. The AIC, for example, is not defined for unidentified models. While being silent may be better than pos- itively evaluating the intuitively incorrect sine model, a negative evaluation of the sine model seems preferable. Our discussion revolves around three observa- tions: the sine model is not robust, counting parameters is a nontrivial affair, and the set of best fitting sine curves is not well-behaved. This sets us up for a solution to the problem along Bayesian lines in the next section. First consider the robustness of the sine model. Imagine that we alter the scatter plot by slightly nudging a single data point. What will be the result if the curve is a polynomial of a given degree? Clearly, any curve that was fitted to the data will change a little as well. But the rough shape of the curve will not change a lot: a small change in data is matched by a similarly small change to the best fitting curve. By contrast, if the curve is a trigonometric function, then nudging a single data point slightly will radically alter the best fitting curve. It will lead to a completely new set of best fits. We might say that the sine model is too versatile, lacking robustness, or skittish: it is oversensitive to the smallest of changes in the scatter plot. The AIC and BIC do not accommodate this feature of models. MDL-based model evaluation tools and extensions of the AIC and BIC fare slightly better. The Fisher information approximation (FIA) for example includes a so-called geometric complexity term, based on the Fisher information. One might say that this expresses model size in terms of how densely packed the model is with likelihood functions (Grunwald, 2007; Myung et al, 2000; Ly et al, 20XX). The term penalizes skittish models because they will in general cover a larger set of probable data patters: small changes lead to wildly different functions, and in this sense the skittish models are indeed packed densely. Furthermore, de- veloping the AIC and the BIC, as discussed in Bozdogan (1987) and further references therein, we also encounter the Fisher information. So there are nat- 9 ural extensions of the AIC and the BIC that accommodate something of the skittishness. However, in all of these refined methods, the contribution of the Fisher information (FI) term is not of the required order of magnitude to resolve the problem of the sine curves. Apart from the original AIC, the FI term is trumped by the term that captures complexity as the number of free parameters, and which grows with log(m). And the FI term cannot compete with the fit term, which grows with m in all model evaluation tools. For larger data sets the influence of the FI term on the model score therefore dwindles, so that the sines seem preferable after all. A second observation concerns the deceptively low dimensions of the sine model: it seems to harbor an inherent complexity that is not expressed in the number of parameters. The sine model illustrates that model dimension is a fleeting notion. As a quick illustration, note that statistical parameters are often real numbers. But real numbers are such that we can package any amount of information into them. For example, a sufficiently complicated function will allow us to compact two real numbers in a single one, by constructing the numerical expansion of a number from two such expansions, e.g., 0.135 . . . and 0.246 . . . yield 0.123456 . . ., and so on. While this sort of function is of course hopelessly contrived, it illustrates that counting statistical parameters does not give us a fair indication of model dimensions. This general observation has been made about model evaluation criteria more often, for example in Bozdogan (1987), who proposes to adapt the AIC by involving the sample size, thereby bringing it closer to the BIC. His motivation for adapting the penalty term is, by and large, that the notion of complexity is not adequately captured by the dimension term in the original AIC. Similar sentiments are expressed in Balasubramanian (2005) who develops minimum description length (MDL), and in Romeijn and van de Schoot (2008); Romeijn et al. (2012) who investigates and extends the BIC. The latter two point to a more general notion of model size as a component of complexity. However, 10 while these proposals are in the right direction, the adapted versions of AIC, BIC, and MDL still give the number of parameters a central role. A more promising method for dealing with the problem of the sine curves is offered by the so-called Deviance information criterion, or DIC for short (Spiegel- halter et al, 2002). The DIC was originally designed for comparing hierarchical Bayesian models, in which the number of free parameters is not clearly defined. Central to the DIC is the so-called deviance, i.e., the reduction in surprise due to estimation, which can be thought of as a degree of overfitting. The penalty for complexity in the DIC is given by the effective number of parameters, which is based on the notion of deviance. However, in this paper I will not investigate in detail how the DIC responds to the sine model. A final observation brings us closest to the ultimate reason that trigonometric curves are problematic for the purpose at hand. Note that both polynomial and trigonometric curves can be used as a basis for the space of functions on a finite domain, in the algebraic sense that they parameterize that space: we can write down functions by their Taylor or Fourier series. We can collect the curves that almost perfectly fit some scatter plot into a set within the space of functions. But for the Taylor and Fourier series this set will look very different. In the Taylor parameterization, the set is a well-behaved region sitting somewhere in the linear combination of at least m axes. But the set of well-fitting curves will look much wilder and disjointed in the Fourier parameterization, disjointed and intersecting with distinct axes rather than being lumped together. The implications of this are best brought out through a variant of the ro- bustness discussed above, namely by considering what happens if we add a point in the scatter plot. The original polynomial curve will not change too radically: the region of well fitting curves shifts slightly. By contrast, the set of best fit- ting sine curves alters significantly with the addition of a point, not so much by being relocated but rather by being constrained severely. There are infinitely many sine curves that fit the scatter plot, but almost all of those curves will miss the additional point by a stretch, and so be eliminated from the set of well 11 fitting curves. The solution of the problem hinges on exactly this elimination of hypotheses. 5 Priors to the rescue This section develops a particular response to the problem of the sine curves. It relies on so-called Bayesian model selection, or Bayesian model evaluation (BME). Following BME the sine model looses against polynomial models be- cause of the specific failure of robustness introduced above. The message of this section is not that we should embrace BME as the new standard in model selection. I will not make a systematic comparison with other model evaluation criteria and their relation to the salient notion of robustness. Looking at the solution that BME provides and the central role for the so-called marginal likelihood in BME, we might expect that other approaches in which the marginal likelihood is central, e.g., the DIC and MDL-based criteria, will also provide a solution. Because we can only compute something like a marginal likelihood if we adopt some version of a prior within the model, the central point of this section is rather that solutions will have to rely on priors of some kind. The central idea of BME is to compare models by their posterior probability assignment: P(M1|S) P(M2|S) = P(M1) P(M2) × P(S|M1) P(S|M2) . (9) Assuming an equal prior for the models Mi, the posteriors are completely de- termined by the ratio of the so-called marginal likelihoods, P(S|Mi) = ∫ Θ P(Hθ|Mi)P(S|Hθ ∩M1) dθ, (10) in which Θ is the parameter space. The likelihoods P(S|Hθ∩M1) are a different notation for the Pθ(S) of the foregoing. Notice that the prior within the model, P(Hθ|Mi), plays a key role in the computation of the marginal likelihood. Many approaches to model evaluation rely on the marginal likelihood of the model, including the BIC, the DIC, and MDL-based approaches. All these approaches must use some notion of a prior. 12 Now recall how sine curves manage to fit any scatter plots almost perfectly, in particular that there are infinitely many such curves. With the addition of a new point this set of best fitting curves will loose a large number of members, and this will severely impact the marginal likelihood of the sine model. Following Equation (3), we see that the likelihood of sine curves that retain their fit be multiplied by a maximal probability for every new data point. But this only holds for a small fraction 4�/πH of the sine curves. The fraction 1 − 4�/πH of sine curves will be multiplied by a factor that falls far short of the maximum probability. By comparison, the likelihoods of the well fitting polynomial curves will pick up a factor that is somewhat lower than the maximal probability for each point, though not falling very far short of the maximum. Importantly, this high but not maximum factor will apply to a set of curves that is more or less stable and that will accumulate more and more probability with the addition of data points. Consequently, the overall factor picked up by the marginal likelihood of the polynomial models will tend to this high but not maximum factor. To put this in a more mathematical format, say that the average factor picked up by the likelihood of a sine curve outside of the set of best fitting curves is U, that the same factor applies to badly fitting polynomial curves, that the factor for a well fitting polynomial is V , and for a best fitting sine curve W , so that U < V < W . For the sine curves we obtain P(〈dm+1,nm+1〉|MSine ∩Sdn) ≈ ( 1 − 4� πH ) U + 4� πH W, (11) which is arbitrarily close to U. For the polynomial curves we will have P(〈dm+1,nm+1〉|MPoly ∩Sdn) ≈ (1 −Rm)U + RmV, (12) in which Rm tends to 1 for increasing m so that the factor tends to V . The result is that the sine model performs less well than the polynomial model on the BME criterion. On BME, therefore, the inherent complexity of the sine curves is adequately factored in. 13 It will be insightful to return to the observations that the set of well fitting sine curves is skittish. Well fitting polynomial curves of a given degree are con- centrated in a particular region within the model, in which posterior probability will accumulate when data size increases: all of them will respond to new data points in roughly the same way. By contrast, with every new data point a small fraction of the well fitting sine curves is multiplied by a high likelihood, while a large portion picks up a low factor. It expresses the skittishness of sine curves that such a large portion of curves is suddenly far off in their prediction. We can also convert this reasoning to arrive at the observation about model size. Judged from the prior probability distribution within the sine model, the set of well fitting curves is very small indeed: after m points it has decreased to (4�/πH) m . But considering the prior within the polynomial model, the set of well fitting curves retains a reasonable size. What this signals is that the sine model, although it has only two free parameters, has many more different statistical hypotheses packed into it. It is versatile at the cost of a particular kind of robustness. The use of a prior within the model enables us to bring this kind of robustness out. 6 Conclusion We cannot turn the foregoing into an argument for BME: other model evaluation criteria may also have a response to the problem at stake. But there are several general lessons to take away. One is that we must never mistake the number of parameters in a model for its actual complexity. A related lesson is that we must not forget the deeper motivations for the model evaluation tools, i.e., the good-making features that the tools are based on. Concentrating on those features will guide us to a better understanding of our evaluations. Another general general lesson ties in with earlier work on the role of size in model evaluation (Romeijn et al., 2012), and indeed with scientific methodology as a whole. In the solution of the problem with the sine model, we can recognize a Popperian theme. Models that allow for fewer possible data patterns are 14 preferable to those that allow for a very wide range of data patterns. To express some notion of model size in our evaluations, we have to adopt some measure over the space of distributions over data. So we must involve something akin to a prior. There is, however, a problem with the idea that we can objectively determine how densely distributions are packed together in a model. To say that a set of distributions shows a wide variety in the data patterns that it can adapt to, we need to presuppose a notion of similarity among data patters or, more generally speaking, a metric over sample space. In this paper that metric was adopted implicitly, as part of the way in which we depict and conceptualize the data. This dependence on the metric of the sample space points to a potential subjectivity in adjudicating between statistical models, or at least a reliance on a natural conceptualization of the sample space. This idea deserves to be studied in its own right. Acknowledgements The author wishes to thank Elliott Sober and Tom Sterkenburg, as well as audiences in Santiago de Compostella, Padova, Helsinki, Gent, and Groningen. References Akaike, H. (1973). Information Theory and an Extension of the Maximum Like- lihood Principle. In 2nd International Symposium on Information Theory, eds. B. N. Petrov and F. Csaki, Akademiai Kiado, Budapest, pp. 267–281. Balasubramanian, V. (2005). MDL, Bayesian inference, and the geometry of the space of probability distributions. In Advances in Minimum Description Length: Theory and Applications, P. J. Grunwald et al. (eds.), pp. 81–99. MIT Press, Boston. 15 Bozdogan, H. (1987). Model selection and Akaike’s Information Criterion. Psy- chometrika 52(3), 345–370. Burnham, K.P. and D.R. Anderson (2002). Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach. New York: Springer. Claeskens, G. and N. L. Hjort (2008). Model selection and model averaging, Cambridge: Cambridge University Press. Grunwald, P. (2007). The Minimum Description Length Principle. MIT press, Cambridge (MA). Ly, A. J. Verhagen, R. Grasman, and E-J. Wagenmakers (20XX). A Tutorial on Fisher Information, unpublished manuscript. Myung, J. et al. (2000). Counting probability distributions: Differential geom- etry and model selection. Proceedings of the National Academy of Sciences 97(21), pp. 11170–11175. Raftery, A. E. (1995). Bayesian model selection in social research. Sociological Methodology, 25, pp. 111–163. Romeijn, J. W. and R. van de Schoot (2008). A Philosophical Analysis of Bayesian model selection. In Hoijtink, H., Klugkist, I., and Boelen, P. A. (2008). Bayesian Evaluation of Informative Hypotheses, Springer, New York. Romeijn, J.W., R. van de Schoot, and H. Hoijtink (2012). One size does not fit all: derivation of a prior-adapted BIC. In Dieks, D., W. Gonzales, S. Hartmann, F. Stadler, T. Uebel, and M. Weber (eds.), Probabilities, Laws, and Structures. Berlin: Springer. Schwarz, G. (1978). Estimating the Dimension of a Model. Annals of Statistics, 6, pp. 461–464. Spiegelhalter, D. J., N. G. Best, B. P. Carlin, and A. van der Linde (2002). Bayesian measures of model complexity and fit. Journal of Royal Statistical Society B, 64, pp. 583–639. 16