Simplicity and Model Selection

Guillaume Rochefort-Maranda

October 28, 2015

Contents

1 Introduction 2

2 Selecting a Model 3

2.1 Constructing the Data Set . . . . . . . . . . . . . . . . . . . . 3

2.2 Fitting a Polynomial Regression . . . . . . . . . . . . . . . . . 5

2.3 Fitting a Kernel Regression . . . . . . . . . . . . . . . . . . . 10

3 Five Concepts of Simplicity 13

3.1 Parametric Simplicity . . . . . . . . . . . . . . . . . . . . . . . 14

3.2 Theoretical Simplicity VS Theory-ladenness . . . . . . . . . . 18

3.3 Computational Simplicity . . . . . . . . . . . . . . . . . . . . 19

3.4 Epistemic Simplicity . . . . . . . . . . . . . . . . . . . . . . . 19

3.5 Dimensional Simplicity . . . . . . . . . . . . . . . . . . . . . 22

4 Conclusion 24

1

James Maranda
Forthcoming in the European Journal for Philsophy of Science

James Maranda


James Maranda

James Maranda
The official version is slightly different!

James Maranda


1 Introduction

In this paper I compare parametric and nonparametric regression models

with the help of a simulated data set. Doing so, I have two main objec-

tives. The first one is to differentiate five concepts of simplicity and assess

their respective importance. The second one is to show that the scope of

the existing philosophical literature on simplicity and model selection is

too narrow because it does not take the nonparametric approach into ac-

count (Sober 2002; Forster and Sober 1994; Forster 2001, 2007; Hitchcock

and Sober 2004; Mikkelson 2006; Baker 2013).

More precisely, I point out that a measure of simplicity in terms of the

number of adjustable parameters is inadequate to characterise nonpara-

metric models and to compare them with parametric models. This allows

me to weed out false claims about what makes a model simpler than an-

other.

Furthermore, I show that the importance of simplicity in model selec-

tion cannot be captured by the notion of parametric simplicity. ’Simplicity’

is an umbrella term. While parametric simplicity can be ignored, there are

other notions of simplicity that need to be taken into consideration when

we choose a model. Such notions are not discussed in the previously men-

tioned literature. The latter therefore portrays an incomplete picture of

why simplicity matters when we choose a model. Overall I support a plu-

ralist view according to which we cannot give a general and interesting

(epistemic or pragmatic) justification for the importance of simplicity in

science.

This paper contains two main sections. In the first section, I construct

2


a data set and explain how we can choose a regression model with an

additive error term and a linear smoother by using a parametric (polyno-

mial regression) and a nonparametric approach (kernel regression). This

allows me to discuss five different concepts of simplicity in the second sec-

tion. The R codes to recreate the results of the analyses are included in the

annex.

2 Selecting a Model

2.1 Constructing the Data Set

In order to construct a data set, I first define the following function f (x):

f (x) =
1

p
2p

e
�(x�6)2

2 +
1

p
2p

e
�(x�4)2

2 +
1

p
2p

e
�(x�8)2

2

Then, I simulate 200 observations of f (x) that are uniformly distributed on

the interval [0, 12]. For each observation, I add noise that follows a normal

distribution with µ = 0 and s = 0.2. The resulting couples (x, y) can be

visualised in Figure 1. The small blue dots represent the observations and

the red line represents f (x).

Those 200 observations now constitute a data set and the scientific

problem that I will solve is to estimate f (x) with that data set. I shall con-

sider that f (x) is unknown and make the following three assumptions:

• The distribution of the error (ei) follows a normal distribution centred on 0

with an unknown variance.

• The error (ei) is additive.

3


Figure 1: The Data Set

• The errors are uncorrelated.

In other words, I will assume the following:

yi = f (xi) + ei,

ei
i.i.d.⇠ N (0, s2),

It is possible to confirm those assumptions but I will take them for

granted. Validating them would not serve my purpose which is to com-

pare a parametric and a nonparemetric approach. Given the way in which

I constructed the data, the assumptions would be confirmed anyway.

Now, there are many different options to chose from in order to esti-

mate f (x). Here I will attempt to fit a polynomial and a kernel regression

4


model. Both regressions are similar in the sense that their respective es-

timate bf (x) of the function f (x), evaluated on the observed data, can be

defined with a linear operator S (a linear smoother) that does not depend

on y:
bf (x) = Sy

However, both regressions are different in the sense that a polynomial re-

gression is a parametric model and a kernel regression is a nonparametric

model.

This distinction and its implications will become clearer. For the pur-

pose of this paper, it is sufficient to say that a parametric model yields an

estimate bf (x) such that we only have to know its parameters in order to

compute it for any given x. On the other hand, a nonparametric model

provides and estimate bf (x), such that we always need to know about the

observations in our data set in order to compute bf (x) for any given x.

Before we move on with the estimation of f (x), it is also worth men-

tioning that I did not construct that function naively. f (x) has some prop-

erties that will highlight an important difference between the parametric

and the nonparametric approach. It will help me to illustrate a way in

which simplicity can lead to a better approximation of the truth.

2.2 Fitting a Polynomial Regression

To fit a polynomial regression, we must first assume that f (x) has the fol-

lowing form:

f (x) =
p

Â
k=0

bk x
k

5


Secondly, we need to estimate the parameters bk. We can do so by solving

the following equation, which determines the parameters that will min-

imise the square of the difference between the observed y and f (x):

bb = argmin
b

(y � f (x))2

This will yield an appropriate maximum likelihood estimate of f (x)1:

bf (x) =
p

Â
k=0

bbk xk

Finally, we need to figure out the number of parameters p that will deter-

mine the best estimate for f (x) out of all the possible polynomial regres-

sion models that can fit the data set.

To understand the nature of this challenge, let us compare two differ-

ent models. Figure 2 represents an estimate of f (x) provided by a model

with 4 adjustable parameters. Figure 3, on the other hand, represents an

estimate provided by a model with 11 adjustable parameters. In the two

figures the orange dashed line represents the estimate of the polynomial

regression; the red line f (x); and the small blue dots the data.

The question is to determine the best model out of the two. One intu-

itive criterion would be to compare the mean squared error for each model

by using all the observations (x, y) in our data set. This quantity is called

the training mean squared error (MSEtrain).

MSEtrain =
1

200

200

Â
i=1

(yi � bf (xi))2

1Under the assumptions made in section 2.1, least squares estimates and maximum

likelihood estimates are the same.

6


Figure 2: Polynomial Regression, Adjustable Parameters=4

Figure 3: Polynomial Regression, Adjustable Parameters=11

7


Accordingly, one might conclude that the second model is better than the

first because the MSEtrainfor the second model is smaller:

(0.0406 < 0.0532)

But this criterion would be inadequate.

Obviously, we are not interested in a model that can only fit the data at

hand. What we really want is a model that fits observations that are not

used to construct bf (x): (x(new), y(new)). In other words, we would like to

choose the model that has the smallest MSEtest.

MSEtest =
1
n

n

Â
i=1

(yi(new) � bf (xi(new))
2

Unfortunately, the model that has the smallest MSEtrain is not neces-

sarily the one that has the smallest MSEtest. For example, when we are

trying to fit a polynomial regression model, we can decrease MSEtrain and

increase the MSEtest by adding too many adjustable parameters to our

model (i.e., parameters whose values are not fixed before we fit the model

to the data). When this happens, we say that our model is overfitting the

data.

A more judicious choice would be to compute MSEtest directly with an

independent data set. But in practice, we do not always have the luxury

of having such an independent data set that we do not want to use in the

construction of our model. A more common approach is to choose the

model that minimises MSEtrain and a penalty for the complexity of the

model that is measured with the number of adjustable parameters k. The

goal is to choose a model that does not overfit the data.

8


The Akaike Information Criterion (AIC) is one of the many criteria that

implement that idea. For this analysis (under the assumptions made in

section 2.1), the AIC can be expressed as follows:

AIC = 200 log(
1

200

200

Â
i=1

(yi � bf (xi))2) + 2k

We will want to choose the model with the smallest AIC.

Another option is to estimate MSEtest by cross-validation (CV). One of

the many ways to do cross-validation is to remove one observation from

our data set; construct a model; and then compute the square of the differ-

ence between our prediction of the removed observation and that obser-

vation. If we repeat this procedure for every observation in our data set

and average the results, we will obtain a value that can guide our choice

of model: the smaller the CV the better. Here is how we can express CV,

where bf(�i) is the estimate of f obtained by omitting the pair xi, yi

CV =
1

200

200

Â
i
(yi � bf(�i)(xi))

2

If we use both criteria to make our choice we will find that the second

model as a smaller AIC

(�618.6998 < �578.8056)

and a smaller CV

(0.04445178 < 0.05536721)

In fact, further exploration indicates that the second model is the best poly-

nomial model according to both criteria.

9


2.3 Fitting a Kernel Regression

Now that we have found our best polynomial model (given CV and AIC),

let’s try to find the best kernel regression model. As we will see, this task

will be significantly different. When we constructed the polynomial re-

gression model we used what is called a ’top-down’ approach. We deter-

mined a priori the form of our estimation for f (x) and then tried to find the

values of its adjustable parameters that best fit the data set. In other words,

our estimation of f (x) was limited to the family of polynomial functions.

On the other hand, when we wish to fit a kernel regression model, we

do not make such strong a priori restrictions about the form of f (x). In

fact, we construct an estimate for f (x) with the assumption that close-by x

values must have similar y values. This approach is said to be ’bottom-up’

because the estimate will depend more heavily on the observations that

we have made.

To be more precise, for any given x0, a kernel regression will provide a

weighted mean value of all the observed y values that are within a certain

range h from x0. Its expression can be written as follows, where K is some

unspecified kernel:

bf (x0) =
200

Â
i=1

K( x0�xih )

Â200i=1 K(
x0�xi

h )
yi

A kernel is a function that determines the weight of the nearby observa-

tions. In this paper, I will use an Epanechnikov kernel. It is defined as

follows:

10


K(u) =

8
><

>:

3
4 (1 � u

2), if |u|  1

0, if not

The challenge here will be to find the appropriate value for h. If h is too

small, our estimate of f (x) will overfit the data. But if h is too large, our

estimate will tend to take the form of an horizontal line and the fit with

the data set that we have will be awful. Just like in the parametric context,

we will not be able to rely on MSEtrain to choose our model. But, we will

be able to rely on the AIC and CV.

If we use CV, we find that the best estimate is obtained with h = 1.223.

The CV score associate with that h is 0.04400635. We can visualise the

resulting estimate in Figure 4. As before, the orange dashed line represents

the estimate of the kernel regression; the red line f (x) and the small blue

dots the data.

However, the application of the AIC criterion is not as straightforward

in this case. As we can see, the only adjustable parameter here is h. It is

the only expression in the equation of our model that is not fixed before

we attempt to fit a kernel regression model with the data (the kernel has

been determined a priori). Hence the number of adjustable parameters will

be useless as a measure of complexity. To carry on, we will need to use a

more general definition of a parameter in order to use the AIC for our

kernel regression. We will have to determine what is called ”the effective

number of parameters”(Friedman et al. 2001, p.232).

As mentioned in section 2.1, both the polynomial and the kernel regres-

sion estimates by on the observed data (x, y) can be defined with a linear

11


Figure 4: Epanechnikov Kernel Regression (CV)

operator S that does not depend on y:

bf (x) = Sy

S is an interesting matrix because each element of its diagonal tell us how

much weight is given to an observed yi in order to compute the fitted

value byi. This means that if we compute the trace of S (the sum of all

the elements in the diagonal of S), we will have a useful measure for the

complexity of our model. Indeed, the regression line is likely to be more

convoluted as we give more weight to each yi to compute each byi. In fact,

the trace of S defines the effective number of parameters (Hurvich et al.

1998). It is used to generalise the AIC since it is also equal to the number

of adjustable parameters in the parametric context.

The appropriate definition of the AIC can thus be expressed as follows,

12


where tr(S) is the trace of the matrix S:

AIC = 200 log(
1

200

200

Â
i=1

(yi � bf (xi))2) + 2tr(S)

If we apply this criterion to our data set in order to choose a kernel

regression model, we find that h=1.222 minimises the AIC with a value

of -624.3763 and a number of effective parameters equal to 7.643 (Notice

that the number of effective parameters is not necessarily an integer!). Its

graph is very similar to the one presented in Figure 4.

If we compare our best parametric model with our best nonparametric

models we conclude that we can obtain a better CV and a better AIC with

the nonparametric approach. Here are the results

(0.04400635bestCVnon para < 0.04445178bestCV para )

(�624.3763bestAICnon para < �618.6998bestAIC para )

3 Five Concepts of Simplicity

Given our data set and the choice-criteria that we have defined, the upshot

of the previous analysis is that we will choose a kernel (nonparametric)

regression model over a polynomial (parametric) one. The choice of one

particular kernel regression estimate however is underdetermined since

the two choice-criteria that we used do not converge.

In this section, I rely on that analysis to discuss the importance of sim-

plicity in model selection. I will define 5 different concepts of simplicity.

Doing so, I want to bring some important nuances to the existing literature

on this topic.

13


My first objective is to correct a mistake that we often find in the philo-

sophical literature about what makes a model simpler than another. My

second objective is to show that the importance that we give for a particu-

lar notion of simplicity will depend on the goal that we pursue when we

select a model. Therefore, when we wish to explain why simplicity mat-

ters in science, we have no choice but to take more than one definition of

simplicity into account. In other words, I wish to support a view accord-

ing to which different goals will justify the importance of different notions

of simplicity. This is what I call a pluralist view of simplicity.

This kind of work is different from that of other philosophers, such

as Kevin Kelly, who wish to explain why simplicity is important when

our goal is to find the truth. See (Kelly 2007b) for example. By looking

at other goals, we get a better understanding of the scientific practice of

model selection. We will see that the importance of a particular concept

of simplicity will depend on whether or not we are interested in a good

predictive model; a model that can be constructed under computational

or time constraints; an interpretable model; or in the validity of certain

kinds of models. The fun fact is that we cannot always achieve all of these

goals without making compromises. I will make this clearer in following

sections.

3.1 Parametric Simplicity

Looking back at section 2, we see that simplicity plays a crucial role when

we used the AIC to select our models. It is one of many criteria, like

the Bayesian information criterion (BIC) and the Minimum Description

14


Length criterion (MDL), that relies on the idea that our model should max-

imise its fit to the training data and be penalised for its complexity. The

justification for these criteria is that we want to avoid models that over-

fit the data, i.e.,we wish to avoid choosing models for which the MSEtest

is larger than the MSEtrain. This is essential to obtain a good predictive

model.

For this particular reason, philosophers of science have been quick to

underscore the importance of parametric simplicity in model selection:

Model selection involves a trade-off between simplicity and fit

for reasons that are now fairly well understood (Forster 2001,

p.83).

Simplicity matters. A sufficiently simple hypothesis, formu-

lated on the basis of a given body of data, will not drastically

overfit the data. It does not contain too many parameters whose

values have been set according to the data. Thus, a simple hy-

pothesis that successfully accommodates a given body of data

can be expected to make more accurate predictions about new

data than a more complex theory that fits the data equally well

(Hitchcock and Sober 2004, p.22).

Perhaps the most interesting of the standard arguments in fa-

vor of simplicity is based upon the concept of ”overfitting”.

The idea is that predicting the future by means of an equa-

tion with too many free parameters compared to the size of the

sample is more likely to produce a prediction far from the true

value (Kelly 2007a, p.113).

15


The scientific relevance of simplicity has long been a matter of debate

in philosophical circles. Therefore, it is easy to understand the appeal of

a mathematically rigorous justification for the scientific relevance of para-

metric simplicity in model selection. It is no surprise that parametric sim-

plicity has been the focus of several important articles written by philoso-

phers such as Elliott Sober, Christopher Hitchcock, and Malcolm Forster.

However, the neglect of nonparametric models often results in false

claims:

In statistics, one theory, hypothesis or model is simpler than

another if it has fewer adjustable parameters (Mikkelson 2006,

p.441).

there is general agreement among those working in this area

that simplicity is to be cashed out in terms of the number of free

(or adjustable) parameters of competing hypotheses. (Baker

2013).

Interestingly, all three methods already mentioned, the MDL

criterion, BIC and AIC, define simplicity in exactly the same way

�as the paucity of adjustable parameters, or more exactly, the

dimension of a family of functions [emphasis added on ’define

simplicity’] (Forster 2001, p.90).

As we now know, a nonparametric model, like a kernel regression, can

be too complex according to a criterion like AIC and have only 1 adjustable

parameter. Therefore, it is false to claim that a model that has fewer ad-

16


justable parameter is simpler and that a criterion like AIC defines simplic-

ity in terms of the number of adjustable parameters. I believe that this kind

of mistake is symptomatic of a lack of understanding of how parametric

complexity can cause overfitting.

In the specific cases discussed in section 2, the number of parameters

(effective parameters) is actually a measure of the weight given by a model

to the observed yi in order to compute their corresponding fitted values byi.

This is what explains the link between parametric simplicity and overfit-

ting models. The more weight is given to yi in order to compute its fitted

value, the more our model will fit the data and thus model the irreducible

error.

But more importantly, there is much more to simplicity than a property

that allows us to avoid overfitting models. In fact, a good criterion to

avoid overfitting model does not even need to take parametric simplicity

into account. As we have seen, we can estimate MSEtest with CV and

completely eliminate the need to rely on parametric simplicity. See also

(Forster 2007; Hitchcock and Sober 2004).

In what follows, I will complete the picture2. By comparing para-

metric with non-parametric models, we can identify at least four other

concepts of simplicity: theoretical, computational, epistemic, and dimen-

sional. They are all important facets of simplicity that are not discussed in

the literature mentioned in the introduction. They only become apparent

2I am not suggesting here that the previously mentioned philosophers are not aware

that the picture is incomplete and that more work needs to be done. My intention is to

bring the debates forward.

17


when we compare parametric models with their nonparametric counter-

parts.

3.2 Theoretical Simplicity VS Theory-ladenness

Going back to section 2.2 we can see that I have made a substantial as-

sumption about the form of f (x) in order to estimate it with a polynomial

model. The quality of the estimate depended heavily on this assumption

(that is why I defined f (x) the way I did). If we look at figures 2 and 3 and

compare the red and the orange dashed lines, we can see that a polynomial

estimate will always fail to model the tails of f (x). In other words, a false

a piori assumption about the form of f (x) can impose a limit on the quality

of the estimate. This is why theory-laden approaches can be problematic.

On the other hand, we made no such a priori assumptions when we

fitted a kernel regression model. We can immediately see how this paid

off by looking at Figure 4. We see that the estimate provided by the kernel

regression is closer to the true function. Therefore, theoretical simplicity

seems to be of the utmost importance in this case.

But let us remember that we are not supposed to know the true func-

tion f (x). Thus we are not supposed to see that a polynomial regression

will fail to model the tails of f (x) and that the kernel regression estimate is

closer to the true function. What we do know however is that we obtained

the best CV score with a Kernel regression. This gives us evidence that the

MSEtest is lower for the Kernel regression than it is for the polynomial re-

gression. Thus, we can now appreciate the importance of theoretical sim-

plicity. Theoretically simpler models can have the best MSEtest. In other

18


words they can provide us with better predictive models.

3.3 Computational Simplicity

On the other hand, one of the drawbacks of using nonparametric approaches

is that they are computationally intensive. In the case of a kernel regres-

sion for example, the computer must find the neighbouring observations

for each x, compute a weighted mean and then construct bf (x) point by

point. In that respect, computational simplicity is a pragmatic virtue that

the parametric approach has over the nonparametric.

Generally speaking, if our goal is to provide an estimate of f (x) un-

der time or computational restrictions, then computational simplicity will

be an important virtue. Under such restrictions, we might also have to

compromise on the idea of finding the best predictive model. But with

the increasing power of our computers this is becoming less of an issue.

This is why nonparametric approaches are now genuine alternatives to

their traditional parametric counterparts. (It is also time for philosophers

of science to ’catch the train’.)

3.4 Epistemic Simplicity

Another price to pay for using nonparametric models is that they are much

more difficult to interpret. In comparison with parametric regressions,

such as linear or polynomial regressions, nonparametric regressions ”can

lead to such complicated estimates of f that it is difficult to understand

how any individual predictor is associated with the response” (James et al.

19


2013, p.25).

To see this, let us assume that a dependent variable y can be expressed

in function of x plus an additive error term. As before, let us assume that

the errors are uncorrelated and follow a centred normal distribution. Now

consider the following two 2 estimates of the function:

bf (x) = 4 + 5x (1)

bf (x0) =
200

Â
i=1

K( x0�xi1.223 )

Â200i=1 K(
x0�xi
1.223 )

yi (2)

Just by looking at the parametric equation (1), we can easily obtain

a wide variety of information about the relation between x and y. With

practically no effort, we can interpret the parameters of our model. We

can tell that the dependent variable y will increase by 5 units on average

when x increases by 1 unit. We can also tell that 4 is the average value of y

when x = 0. In other words, it is easy to understand how x is related to y.

In contrast, things are not as simple with the nonparametric equation

(2). The relation between x and y is much more obscure. For instance, in

order to find the roots (if there is any) of equation (2), we would need to

find the x0 (there maybe more than 1 or there maybe be none) such that

close by x values in our data set have y values such that their weighted

mean is equal to 0. Unless we have a computer, this problem is nowhere

as simple as finding the intercept of equation (1).

In other words, the parametric model given by equation (1) is epistem-

ically simpler because it is easier to understand the relationship between

x and y. This is how I define epistemic simplicity. This virtue is especially

important if we want a model that allows us to understand the relation-

20


ship between our variables.

The fact of the matter is that there are research contexts where we do

not necessarily wish to make predictions with a model, but where we want

to know how an independent variable is related with the dependent vari-

ables. For instance, a scientist might be interested in knowing if maternal

depression is positively related (at to what extend) with a child’s learning

difficulties in school. In that context, it is important to be able to interpret

the resulting estimate of the function between the two variables. We could

therefore have to choose a parametric model over a nonparametric one

even if the latter makes more accurate predictions and is parametrically

simpler.

In fact, there is often a compromise to make if we prefer interpretable

over predictive models or vice versa. Depending on our goal (understand-

ing or predicting) we might value parametric simplicity and epistemic

simplicity differently:

when inference is the goal, there are clear advantages to us-

ing simple [...] statistical learning methods. In some settings,

however, we are only interested in prediction, and the inter-

pretability of the predictive model is simply not of interest. For

instance, if we seek to develop an algorithm to predict the price

of a stock, our sole requirement for the algorithm is that it pre-

dict accurately -interpretability is not a concern (James et al.

2013, p.25).

Of course, if our nonparametric regression model only has one inde-

pendent variable x for one dependent variable y, then we can easily plot

21


that model in 2 dimensions in order to visualise and interpret it more eas-

ily. This is what I did in section 2. But this is not a solution when the

number of dimensions is high. This brings me to one last notion of sim-

plicity that is at play in model selection.

3.5 Dimensional Simplicity

When we fit a regression model of any kind, we must be wary of dimen-

sionality. Dimensional simplicity is important for the same reason as para-

metric simplicity (they are often the same). For instance, we can severely

overfit a regression model by adding independent variables. But again,

there is more to dimensional simplicity than a tool to avoid overfitting.

Nonparametric models, more specifically, are plagued with what is known

as ’the curse of dimensionality’ (James et al. 2013, p.108).

Recall that the epistemic foundation for a nonparametric model, like

the one I presented in section 2, is the belief that close-by x values will

have similar y values. In that example we had 200 observations that have

been taken uniformly from 1 independent variable x. The distance be-

tween them was small enough to warrant that belief. However, if we

were to spread 60 observations uniformly on a space determined by 50

independent variables, the distance between the observations would be

so great that this fundamental belief would be very questionable. Gener-

ally speaking, the number of observations that we need in order to keep

the same quality of estimation grows exponentially with the number of

dimensions.

In contexts where further observations are difficult to obtain, this means

22


that it will be useful to implement various techniques, such as principal

component analysis3, in order to reduce the dimension of our data set. In

other words, dimensional simplification can be very important when we

construct and choose a model. Not only does it allow us to avoid over-

fitting models but it is essential to maintain the validity a nonparametric

model.

This conclusion seems to add weight to Sober’s following quotes:

The legitimacy of parsimony stands or falls, in a particular re-

search context, on subject matter specific (and a posteriori) con-

siderations. (Sober 1994, p.141).

I have argued in earlier publications that invocations of par-

simony in science often should be viewed as expressions of

subject-matter-specific background theories; it follows that dif-

ferent invocations in different scientific problems may rest on

very different foundations. Thus conceived, the way to under-

stand the use of parsimony in a given scientific domain is to

uncover the background theory in play (Sober 2009, p.238).

Against the theoretical background of a kernel regression, dimensional

simplicity is particularly relevant. Unless we are in a context where the

distance between our observations is too great to adequately fit a model

like a kernel regression model (a posteriori consideration), we might not

care as much about dimensional simplicity.

3Principal component analysis (PCA) is a technique that can ”summarise” the vari-

ance of a data set into a lower dimension.

23


4 Conclusion

In sum, I have compared a parametric and a nonparametric approach to

regression in order to differentiate five important notions of simplicity at

play in model selection. I have therefore given a more complete account of

the importance of simplicity in model selection than the one given in the

current philosophical literature. The latter neglects the nonparametric ap-

proach and therefore has an unjustified and narrow focus on the number

of adjustable parameters as a measure of simplicity. Here are four take-

away conclusions:

• The number of adjustable parameters is an inadequate measure of complex-

ity for nonparametric models, like kernel regression models.

• The concept of effective parameter is more appropriate to measure simplicity

when we are dealing with the family of linear smoother regressions.

• Besides parametric simplicity, there are at least four other important con-

cepts of simplicity in model selection: theoretical, computational, epistemic,

and dimensional.

• This variety of concepts makes it impossible to give a general and interesting

(epistemic or pragmatic) justification for the importance of simplicity in

model selection. Different goals justify the importance of different notions

of simplicity. This is what I call a pluralist view of simplicity.

Throughout this paper, I chose to discuss model selection within a fre-

quentist framework. This approach is an important part of the current sci-

entific practice. In order to understand the latter, we need to understand

24


the former. By making this choice, I do not mean to imply that other frame-

works, such as the Bayesian framework, are less important or justified. In

fact, it would be interesting to compare the frequentist and the Bayesian

approach to model selection by taking the non-parametric approaches into

consideration. This is a topic for future work.

References

Baker, A. (2013). Simplicity. In E. N. Zalta (Ed.), The Stanford Encyclopedia

of Philosophy (Fall 2013 ed.).

Forster, M. and E. Sober (1994). How to Tell When Simpler, More Unified,

or Less Ad Hoc Theories will Provide More Accurate Predictions. The

British Journal for the Philosophy of Science 45(1), 1–35.

Forster, M. R. (2001). The New Science of Simplicity. In Simplicity, Inference

and Modelling: Keeping it Sophisticatedly Simple, pp. 83–119. Cambridge

University Press.

Forster, M. R. (2007). A Philosopher’s Guide to Empirical Success. Philos-

ophy of Science 74(5), 588–600.

Friedman, J., T. Hastie, and R. Tibshirani (2001). The Elements of Statistical

Learning, Volume 1. Springer series in statistics Springer, Berlin.

Friend, M., N. B. Goethe, and V. S. Harizanov (2007). Induction, Algorithmic

Learning Theory, and Philosophy, Volume 9. Springer Science & Business

Media.

25


Hitchcock, C. and E. Sober (2004). Orediction Versus Accommodation

and the Risk of Overfitting. The British Journal for the Philosophy of Sci-

ence 55(1), 1–34.

Hurvich, C. M., J. S. Simonoff, and C.-L. Tsai (1998). Smoothing Param-

eter Selection in Nonparametric Regression Using an Improved Akaike

Information Criterion. Journal of the Royal Statistical Society: Series B (Sta-

tistical Methodology) 60(2), 271–293.

James, G., D. Witten, T. Hastie, and R. Tibshirani (2013). An Introduction to

Statistical Learning. Springer.

Kelly, K. T. (2007a). How Simplicity Helps You Find the Truth Without

Pointing at it. In Induction, algorithmic learning theory, and philosophy, pp.

111–143. Springer.

Kelly, K. T. (2007b). Ockhams Razor, Empirical Complexity, and Truth-

Finding Efficiency. Theoretical Computer Science 383(2), 270–289.

Lurz, R. W. (2009). The Philosophy of Animal Minds. Cambridge University

Press.

Mikkelson, G. M. (2006). Realism Versus Instrumentalism in a New Statis-

tical Framework. Philosophy of Science 73(4), 440–447.

Sober, E. (1994). From a Biological Point of View: Essays in Evolutionary Phi-

losophy. Cambridge University Press.

Sober, E. (2002). Instrumentalism, Parsimony, and the Akaike Framework.

Philosophy of Science 69(S3), S112–S123.

26


Sober, E. (2009). Parsimony and Models of Animal Minds. In The Philoso-

phy of Animal Minds, pp. 237–257. Cambridge University Press.

Zellner, A., H. A. Keuzenkamp, and M. McAleer (2001). Simplicity, Infer-

ence and Modelling: Keeping it Sophisticatedly Simple. Cambridge Univer-

sity Press.

27


Annex 

The real function and the data set 
library(psych) 
f<-function(x) { 
  dnorm(x, 6, 1)+ dnorm(x, 4, 1)+ dnorm(x, 8, 1) 
} 
set.seed(136) 
x <- runif(200,0, 12) 
y<-f(x) + rnorm(200, 0, 0.2) 
d<-data.frame(x, y) 
#plot(x, y, ylim=c(-0.5, 1), col="dark blue", lty=1, pch=19, lwd=1) 
#par(new=T) 
#curve(f(x), from=min(x), to = max(x), ylim=c(-0.5, 1), ylab="", col="red", 
lty=1, lwd=3) 

Polynomial regressions 
3rd order model 
reg3<- lm(y ~ x+I(x^2)+I(x^3)) 
reg3$coefficients 
##  (Intercept)            x       I(x^2)       I(x^3)  
## -0.244178747  0.263346283 -0.026423155  0.000377407 
fitv<-reg3$fitted.values 
datp<-cbind(d, fitv) 
datp<-as.data.frame(datp) 
datp<-datp[order(datp$x),] 
#plot(x,y, ylim=c(-0.5, 1), col="dark blue", pch=19, lwd=1) 
#lines(datp$x, datp$fitv, lwd=3, col="orange", lty=2, ylim=c(-0.5, 1)) 
#par(new=T) 
#curve(f, from=min(x), to = max(x), col="red", lwd=3, ylim=c(-0.5, 1), ylab=" 
") 

AIC 
logs<-200*log((sum((reg3$fitted.values-y)^2))/200) 
pen=(2*4) 
 
aicrreg3<-logs+pen 
aicrreg3 


## [1] -578.8056 

CV 
cv3<-rep(NA, 200) 
for(i in 1:200){ 
  reg<- lm(y[-i] ~ x[-i]+I(x[-i]^2)+I(x[-i]^3)) 
  ypr<-(reg$coefficients[1]+reg$coefficients[2]*(x[i]) 
        +reg$coefficients[3]*(x[i]^2)+ reg$coefficients[4]*(x[i]^3)) 
  cv3[i]<-(y[i]-ypr) 
  cvr3<-sum(cv3^2)  
} 
cvr3/200 
## [1] 0.05536721 

10th order model 
reg10<- lm(y ~ 
x+I(x^2)+I(x^3)+I(x^4)+I(x^5)+I(x^6)+I(x^7)+I(x^8)+I(x^9)+I(x^10)) 
reg10$coefficients 
##   (Intercept)             x        I(x^2)        I(x^3)        I(x^4)  
##  3.329178e-02 -3.291740e-01  1.074385e+00 -1.518393e+00  1.010995e+00  
##        I(x^5)        I(x^6)        I(x^7)        I(x^8)        I(x^9)  
## -3.576808e-01  7.343625e-02 -9.068850e-03  6.649447e-04 -2.668868e-05  
##       I(x^10)  
##  4.518782e-07 
fitv<-reg10$fitted.values 
datp<-cbind(d, fitv) 
datp<-as.data.frame(datp) 
datp<-datp[order(datp$x),] 
#plot(x,y, ylim=c(-0.5, 1), col="dark blue", pch=19, lwd=1) 
#lines(datp$x, datp$fitv, lwd=3, col="orange", lty=2, ylim=c(-0.5, 1)) 
#par(new=T) 
#curve(f, from=min(x), to = max(x), col="red", lwd=3, ylim=c(-0.5, 1), ylab=" 
") 

AIC 
logs<-200*log((sum((reg10$fitted.values-y)^2))/200) 
pen=(2*11) 
aicrreg10<-logs+pen 
aicrreg10 
## [1] -618.6998 

CV 
cv10<-rep(NA, 200) 
for(i in 1:200){ 


  reg<- lm(y[-i] ~ x[-i]+I(x[-i]^2)+I(x[-i]^3)+I(x[-i]^4) 
           +I(x[-i]^5)+I(x[-i]^6)+I(x[-i]^7)+I(x[-i]^8) 
           +I(x[-i]^9)+I(x[-i]^10)) 
  ypr<-(reg$coefficients[1]+reg$coefficients[2]*(x[i]) 
        +reg$coefficients[3]*(x[i]^2)+reg$coefficients[4]*(x[i]^3) 
        +reg$coefficients[5]*(x[i]^4)+reg$coefficients[6]*(x[i]^5) 
        +reg$coefficients[7]*(x[i]^6) +reg$coefficients[8]*(x[i]^7) 
        +reg$coefficients[9]*(x[i]^8)+reg$coefficients[10]*(x[i]^9) 
        +reg$coefficients[11]*(x[i]^10)) 
  cv10[i]<-(y[i]-ypr) 
  cvr10<-sum(cv10^2)  
} 
cvr10/200 
## [1] 0.04445178 

Kernel regressions 
How to find the best h with CV. 
h = seq(1, 2, 0.001) 
cv<-rep(NA, length(h)) 
for(i in 1:length(h)){ 
  u<-matrix(NA, nrow = 200, ncol = 200) 
  for(j in 1:200){ 
    u[j,]<-(x[j]-x)/h[i] 
  }   
  ep<-function(x){ 
    cond=abs(x)<=1 
    ((3/4)*(1-(x^2)))*cond 
  } 
  M<- ep(u) 
  N<-apply(M, 1, sum) 
  L = matrix(NA, nrow = 200, ncol = 200) 
  for( k in 1:200){ 
    L[k,] = M[k,]/N[k] 
  } 
  yhat = L%*%y 
  v<-rep(NA, 200) 
  for(l in 1:200){ 
    v[l]<-(y[l]-yhat[l])/(1-L[l,l]) 
  } 
  cv[i]<-(sum(v^2))   
} 
min(cv)/200 
## [1] 0.04400635 
h[which.min(cv)] 


## [1] 1.223 

How to find the best h with AIC 
h = seq(1, 2, 0.001) 
aic<-rep(NA, length(h)) 
for(i in 1:length(h)){   
  u<-matrix(NA, nrow = 200, ncol = 200) 
  for(j in 1:200){ 
    u[j,]<-(x[j]-x)/h[i] 
  } 
  ep<-function(x){  
    cond=abs(x)<=1 
    ((3/4)*(1-(x^2)))*cond 
  } 
  M<- ep(u) 
  N<-apply(M, 1, sum) 
  L = matrix(NA, nrow = 200, ncol = 200) 
  for( k in 1:200){ 
    L[k,] = M[k,]/N[k] 
  } 
  yhat = L%*%y 
  trace=tr(L) 
  logsig=200*log(sum((y-yhat)^2)/200) 
  pen=(2*tr(L)) 
  aic[i]<-logsig+pen 
} 
min(aic) 
## [1] -624.3763 
h[which.min(aic)] 
## [1] 1.222 

Kernel regression with the best h (CV) 
hopt=1.223 
uopt<-matrix(NA, nrow = 200, ncol = 200) 
for(j in 1:200){ 
  uopt[j,]<-(x[j]-x)/hopt 
} 
 
ep<-function(x){  
  cond=abs(x)<=1 
  ((3/4)*(1-(x^2)))*cond 
} 
Mopt<- ep(uopt) 
Nopt<-apply(Mopt, 1, sum) 


Lopt = matrix(NA, nrow = 200, ncol = 200) 
for( k in 1:200){ 
  Lopt[k,] = Mopt[k,]/Nopt[k] 
} 
yhatopt = Lopt%*%y 
datpred<-cbind(d, yhatopt) 
datpred<-as.data.frame(datpred) 
datpred<-datpred[order(datpred$x),] 
#plot(x, y, ylim=c(-0.5, 1),pch=19, lwd=1, col="dark blue") 
#lines(datpred$x,datpred$yhatopt, lwd=3, col="orange", lty=2, ylim=c(-0.5, 
1)) 
#par(new=T) 
#curve(f, from=min(x), to = max(x), col="red", lwd=3, ylim=c(-0.5, 1), ylab=" 
")