key: cord-0059359-cj0rz4fv
authors: Ziemann, Volker
title: Regression Models and Hypothesis Testing
date: 2021-01-19
journal: Physics and Finance
DOI: 10.1007/978-3-030-63643-2_7
sha: 6ff8a3202dfcef570c93df14d99bee01a6357f30
doc_id: 59359
cord_uid: cj0rz4fv

This chapter covers the basics of adapting regression models, also known as linear fits in physics, to find the parameters that best explain data in a model and then estimate the error bars of the parameters. The analysis of the model’s reliability stimulates a discussion of [Formula: see text] and t-distributions and their role in testing hypotheses regarding the parameters; for example, whether a parameter can be omitted from the fit. A more elaborate method, based on the F-test, follows. The chapter closes with a discussion of parsimony as a guiding principle when building models. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this chapter (10.1007/978-3-030-63643-2_7) contains supplementary material, which is available to authorized users.

Both in physical and social sciences as well as in economics we often try to describe the behavior of observable parameters-the measurements-in terms of model parameters. Frequently the dependence of the measurements on the parameters of a model-the fit parameters-is linear; the archetypical problem of this type is the fit of data points to a straight line. Let us therefore assume that we are given a list of n data points (s i , y i ) for i = 1, . . . , n that come from varying a variable s and observing another quantity y. Figure 7.1 shows two examples. Visual inspection indicates that the data points cluster around a straight line that we can describe by two parameters a and b, given by y i = as i + b. Our task is thus to find the slope a and the intercept b. This is most easily accomplished by writing one copy of the equation for each of the n measurements, which defines one row in the following matrix-valued equation Here the n measurements y i are assembled into a column vector and the values of the independent variable s i are assembled, together with n times a one, in a n × 2matrix. The unknown parameters a and b are assembled in a second column vector. Note that the equation has the form of y = Ax, where y is the column vector with the y i , the matrix A with the s i , and a column vector x with a and b. Our task is now to find values for a and b such that the line approximates the data points as good as possible (in some sense, defined below). Moreover, we want to assess the error bars on the fit parameters a and b. The larger scatter of the data points on the on the left-hand plot in Fig. 7 .1 will likely reduce our confidence in the fitted value of a and b, compared to the values derived from the data points on the right-hand plot. The scatter of the data points y i is commonly described by adding a vector y to y = Ax + y and the assumption that the components y i of y are sampled from Gaussian distributions with standard deviation σ i , the error bars of the measurements y i . Furthermore, increasing the number of fit parameters, for example, by fitting a third-order polynomial through the data points, will likely make the fit better. One might ask, however, whether a small improvement is really worth having to deal with an additional fit parameter? This fit parameter may even have large error bars associated. We will address questions, such as this one, throughout this chapter. Before considering a number of examples we need to elaborate how to actually find the model parameters such as the slope or intercept in a straight-line fit.

In (7.1) we only try to determine two fit parameters, but we can easily generalize the method to include many, say m, fit parameters x j with j = 1, . . . , m. Since we assume the dependence of the measurements y i on the model parameters x j to be linear, the corresponding equation is matrix valued and in general will have the which can be also written as y = Ax + y. The information about the dynamics of the system, which relates the n measurements y i to the m fit parameters x j , is encoded in the known coefficients of the matrix A i j with i = 1, . . . , n and j = 1, . . . , m. The measurement errors are described by y i with rms values σ i . Systems of equations, defined by (7.2), are over-determined linear systems if n, the number of observations or measurements, is larger than the number of parameters m. Such systems can be easily solved by standard linear-algebra methods in the least squares sense. For this we minimize the so-called χ 2 , which is the squared difference between the observations and the model

Here we introduce the error bar or standard deviation σ i of the observation y i to assign a weight to each measurement. If an error bar of a measurement is large, that measurement contributes less to the χ 2 and will play a less important role in the determination of the fit parameters x i . If all error bars are equal σ i = σ for all i the system is called homoscedastic and if they differ, it is called heteroskedastic. 1 If we consider a homoscedastic system and ignore the σ i for the moment, we see that (7.3) can be written as a matrix equation

, where x and y are column vectors, albeit of different dimensionalities n and m. Minimizing the χ 2 with respect to the fit parameters x, we find 0 = ∂χ 2 /∂x = −2 A t (y − Ax), where ∂/∂x denotes the gradient with respect to the components of x. Solving for x we obtain

For heteroscedastic systems, we introduce the diagonal matrix that contains the inverse of the error bars on its diagonal ii = 1/σ i , which transforms (7.3) into χ 2 = ( y t − x t A t t )( y − Ax) such that both y and A are prepended with . Performing these substitutions in (7.4), we obtain

(7.5)

Here we introduce the matrix J that maps y onto x to explicitely show that the fit parameters x depend linearly on the measurements y. We denote the matrix by the symbol J to remind us that it has the same function as a Jacobian in a coordinate transformation. In our case the transformation is linear, and consequently the Jacobian has constant matrix elements. The error bars and the covariance matrix of the fit parameters x j can be calculated by standard error propagation techniques from the covariance matrix of the measurements y i , which we is given by C i j (y) = y i y j . The analysis is based on the realization that the covariance matrix C i j (y) is defined through the second moments of the deviations y i . On its diagonal C i j (y) contains the squared error bars of the individual measurements σ 2 i and the off-diagonal elements carry information about correlations among the measurements. If, on the other hand, the measurements are uncorrelated, as we assume them to be, all off-diagonal elements are zero. In our example C i j (y) is simply the square of the inverse of the matrix , namely C(y) = 2 −1 , which has σ 2 1 , . . . , σ 2 n on its diagonal. From (7.5) follows that small deviations of the measurements y i lead to small deviations of the model parameters x = J y. This observation allows us to calculate the covariance matrix C kl (x) = x k x l . A little algebra then leads to

(7.6) which describes how covariance matrices transfrom under a change of variables given by (7.5). Furthermore, inserting the definition of J from (7.5) in the previous equation, we find after some additional algebra

which allows us to find the covariance matrix C(x) of the fit parameters x j by simple matrix operations from the error bars of the initial measurements that are buried in and the system matrix A that initially defined the problem in (7.2). In order to illustrate the methodology, let us consider a few examples of regression models taken from several disciplines.

Here we illustrate how we can use regression methods to extract information from data provided in tabular form, either to obtain quantitative numbers or to test some model.

Our first example addresses the question whether education actually pays off. Do extra years of education increase the earnings and if so, at which rate. Moreover, can we find a quantitative description of how well education protects from unemployment? We base our analysis on information for 2018 from the US Bureau of Labor Statistics [1] , which provides data on the median weekly earnings w and the unemployment rate u based on the highest level of educational attainment. Here we characterize the attainment by the additional years t of education past the age of fifteen; we assume a high-school diploma completed after two years, a bachelor after six, a master after eight, a PhD after twelve, and a professional degree after fourteen years. The left plot in Fig. 7.2 shows the weekly wage w as a function of t as black asterisks and a straight-line fit of w = at + b as a dashed red line. The slope a of the line is a = 106 $/year of education, such that each additional year, on average, increases the weekly earnings by 106 $. So, financially, education pays off! The right plot in Fig. 7 .2 shows the unemployment rate u as a function of t. The data points do not follow a straight line, but plotting log(u) versus log(t) reveals a linear dependence. Fitting a straight line to log(u) = a log(t) + b provides fit parameters a ≈ −0.5 and b ≈ 1.76. The dashed red line on the right plot is based on these parameters. We thus find that the unemployment rate u scales as 1/ √ t. Education even protects against unemployment, at least to some extent.

The second example uses data about the number of infected persons and fatalities from the corona epidemic, provided by the Johns-Hopkins University [2] . We ask ourselves whether the reported number of infected people I and reported number of deaths D are consistent. In the basic SIR model [3] of epidemics the rate of recovered and subsequently immune patients d R/dt is proportional to I . Analogously, we assume here that the rate of change of fatalities d D/dt is also proportional to the number infected I , or d D/dt = α I with a proportionality constant α that may differ from country to country, depending on the health system and the accounting of the infected and the corona-related fatalities. Note that the reported numbers of infected persons in the data from [2] do not reflect those having recovered from the infection. We therefore expect the model to work only during the first few weeks of the epidemic, while we can neglect the recovery rate. With these considerations in mind we process the data by integrating the model equation once to obtain D = α I dt. The data files from [2] contain day-by-day values of D and I . Therefore we store the values The integral K i of I i is then given by K i = j≤i I j and we then try to determine α from D i = K i α, which is of the type specified by (7.2) with K i corresponding to a single-column matrix A and I i to y i . We then find α, the model parameter, from (7.4). Figure 7 .3 shows the number of reported deaths from March 1 until April 16, 2020 in three countries. The solid red line shows the integral of the infected persons K i , scaled with the country-specific factor α, which was determined by fitting the range up to 14 days before the end. We observe that the regression line for Germany underestimates the fatalities in the last 14 days, which could be explained by many unaccounted infected persons. The fit and the data for the US show good agreement between model and data, indicating that the assumptions in the model are approximately valid. The data for Italy show less actual fatalities than the model predicts. One could hypothesise that this might be attributed to a significant number of recovered persons such that the underlying assumptions of our simple model are not valid towards the end of the range. As a disclaimer, we need to stress that this example is intended to illustrate regression methods, not to develop policies of how to deal with an epidemic.

Our third example is motivated by the fact that systems in equilibrium, if slightly perturbed, perform harmonic oscillations. Thus, the dynamics is goverend byẍ + ω 2 x = 0 with amplitude x and frequency ω. We now want to determine the frequency ω and the initial conditions x 0 andẋ 0 at time t = 0 from a sequence of measurements of the amplitude x n at times t = n t. This can be visualized as a stroboscopic recording of the oscillation at discrete points in time. To do so, we first realize that x = x 0 cos(ωt) + (ẋ 0 /ω) sin(ωt) solves the differential equation and satifies the initial conditions. Likewise, we find that the position x n+1 after (n + 1) t is related to the position x n after n t by x n+1 = x n cos(ω t) + (ẋ n /ω) sin(ω t) and we find the velocityẋ n+1 = −ωx n sin(ω t) +ẋ n cos(ω t) by differentiation. Both equations can we written as a matrix-valued equation

x ṅ x n .

(7.8)

Inverting this equation, we find (1/ω) sin(ω t)ẋ n+1 = x n+1 cos(ω t) − x n . Moreover, using (7.8) shifted by one time step, results in

Inserting the former into the latter equation yields

This is a linear equation in the positions x n , which allows us to determine cos(ω t) and thereby ω by casting the equation in the form of (7.2) with x n+2 + x n populating the vector y on the left-hand side and filling the single-column matrix A with 2x n+1 . Here cos(ω t) corresponds to a single fit parameter, called x in (7.2). As opposed to Fourier methods, which require many consecutive samples to achieve a high frequency resolution, here we only need a minimum of three positions x n from consecutive periods in order to determine cos(ω t). Using more data points makes the problem over-determined and improves the accuracy. Now that we have determined the frequency ω, we proceed to determine the initial conditions x 0 andẋ 0 from subsequent readings of x n = x 0 cos(nω t) + (ẋ 0 /ω) sin(nω t). Furthermore, we assume that each measurement x n is known to be within its error bars ±σ . This allows us to write the problem as a matrix-valued

where we explicitely absorbed the matrix with the error bars into (7.10), which is of the type of (7.2) and can be solved by the methods discussed in Sect. 7.1 yielding the initial conditions x 0 andẋ 0 of the harmonic oscillation. The error bars of the fit parameters can be determined with (7.7). Going beyond fitting for one or two parameters we consider a fit to a polynomial of order p to n data points (t i , y i ) for i = 1, . . . , n

which is straightforward to convert to an equation similar to (7.2). Here the matrix A is of size n × ( p + 1) and the vector x contains the p + 1 coefficients a p . In Chap. 8, where we will analyze time series, we will use regression methods, similar to those discussed in this chapter to determine models for the temporal behavior of dynamics systems. In Sect. 11.1 we will encounter macroeconomic models that are described through difference equations, relating economical quantities, such as company output, profit, and investment rate, from one time period to the next. The validity of these models can be corroborated or rejected by analyzing the consistency of the models when comparing them with published data for companies.

The above manipulations allow us to calculate the fit parameters and their error bars, but we still do not know how reliable the calculations are, because we could have a small χ 2 and just had misjudged the initial measurement errors σ i . Therefore we will first introduce a parameter R 2 that characterizes how well the fitted model actually explains the observed data y i and then calculate the distribution of χ 2 we can expect and can judge how likely the obtained χ 2 actually is.

The goodness-of-fit parameter R 2 = SSE/SST compares the spread of the measurement values y i , the total sum of squares (SST), around their meanȳ

to the explained sum of squares (SSE), defined as the spread of the valuesŷ i predicted by the fitted model, to the meanȳ

Hereŷ i are the values estimated with the model A i j and the fit parameters x j . R 2 thus characterizes how well the fitted model can explain the variation around the meanȳ.

Let us now inspect how SST and SSE are related to the χ 2 that we introduced in (7.3). We therefore insert and subtract theŷ i in the definition of SST

(7.14)

where the first term is the sum of squared differences of the measured values and the model predictions y i −ŷ i = y i − j A i j x j and is called the sum of squared residuals (SSR). We recognize it as σ 2 χ 2 from (7.3) if all the error bars are equal and have magnitude σ . The second term we recognize as the SSE and the last term is zero, which can be shown by insertingŷ i = i A i j x j and (7.4).

Using (7.14) to write SSE=SST−SSR, the Goodness-of-fit R 2 can be written using either SSE or SSR

We see that the smaller the SSR, or equivalently the χ 2 are, the closer R 2 approaches unity, where R 2 = 1 describes a perfect fit of the model to the data. Calculating R 2 is rather straightforward; first calculate the sample variance SST from the average and variance of the measurement or sample values. Then perform the fit procedure and determine the sum of squared residuals between the measurement samples and the corresponding fitted values SSR. Finally calculate R 2 from (7.15).

We already noted that the SSR is closely related to the χ 2 of the fit. In the following section we derive the probability distribution function of the χ 2 values that we can expect. It will help us to assess our confidence in the fitted parameters.

The quantity we minimize to find a regression model is the χ 2 , which is given by Thus it is natural to ask what the distribution function of n squares of normally distributed random variables is, namely the probability distribution function ψ n (q) such that we have the probability ψ n (q = χ 2 )dq of finding a value of q = χ 2 within the interval [q − dq/2, q + dq/2]. We can calculate this distribution by assuming that the individual constituents z i of the sum above are normally distributed random variables and we need to find the distribution of the q = n i=1 z 2 i . We start by writing the product of n independent and normalized Gaussian distribution functions for the z i

where we introduce spherical coordinates with r 2 = z 2 1 + z 2 2 + · · · z 2 n in the second equation. Since the integrand does not depend on angular variables we know that d = S n is the surface area of an n−dimensional sphere. After substituting q = r 2 we arrive at

which implicitely defines the probability distribution ψ n (q) for the q = χ 2 .

The previous equation still depends on the unknown surface area S n . We can, however, determine S n from the requirement that the distribution function is normalized. Substituting p = q/2 we find

where (z) is the Gamma function [4] , defined as

Solving for S n , we find S n = 2π n/2 (n/2) , (7.21)

which we insert into (7.18) to obtain the following expression for the probability distribution function ψ n (q)dq = 1 2 n/2 (n/2) q n/2−1 e −q/2 dq . where P(a, x) is the incomplete gamma function [4] . Now we are in a position to answer the question whether a χ 2 that arises from a fitting procedure or regression analysis is actually probable. In particular, the probability of finding a value of χ 2 Fig. 7.4 The χ 2 −probability distribution function ψ n (q) from (7.22) for n = 2, 5, and 10.

smaller thanχ 2 is given by P(n/2,χ 2 /2). Conversely, finding a value that is even larger thanχ 2 , is given by 1 − P(n/2,χ 2 /2). The χ 2 -distribution plays an important role in testing hypothesis, but also in assessing the reliability of estimates from a small number of samples. The latter is the topic of the next section.

Assume that you are responsible for the quality control of the base materials for your favorite beer. For example, you need to work out how many bags of barley you have to test in order to assess the average quality of the entire delivery, which could be quite large, say, a hundred bags. This was the task that William Gosset-"Student" was his pen name-faced during the early years of the 20th century. He worked for the Guinness brewery in Dublin and had to ensure the quality of the delivered barley. We follow his lead and work out how well we can estimate the true mean μ and variance σ 2 of a Gaussian distribution from a small number of samples n.

Since all we have are the n samples x i , we start by calculating their averageX n and the variance S 2 n . These quantities are given bȳ

where the x i are the samples. In the following analysis we assume the samples x i to be drawn from a Gaussian distribution N (x; μ, σ ), defined by

(7.25) Note thatX n and S 2 n are commonly called the sample average and the unbiased sample variance, respectively. 2 We first need to understand how fast the estimates X n and S 2 n converge towards the true values μ and σ 2 as we increase n. We therefore calculate the expectation value of the squared difference between the estimate and the true value

where we exploited the fact that the different x i are statistically independent and therefore (

Finally we see that the esti-mateX n on average approaches the true mean μ with σ/ √ n as we increase the number of samples n.

Next we need to address the reliability of an estimate based on a small number n of samples. To achieve this goal we introduce a test-statistic t, which is given by the deviation of the estimate X n from the real mean μ, but divided by the estimated standard deviation S n / √ n.

(7.27)

In the second equality in (7.27) we divide numerator and denominator by σ to visualize that the numerator x = (X n − μ)/(σ/ √ n) indeed stems from a normal distribution with unit variance, given by

Moreover, the denominator y = S 2 n /σ 2 stems from the square root of a χ 2distribution, the latter given by (7.22). Thus t = x/y is a random variable, defined by the ratio of a Gaussian random variable x and a random variable y, which is derived from the root of numbers drawn from a χ 2 -distribution. Note that originally we start from n samples or independent measurements and one degree of freedom was used to determine the averageX n . Therefore the estimate of the variance S n only contains information of ν = n − 1 degrees of freedom. The χ 2 −distribution that we need to consider is therefore one for ν = n − 1 degrees of freedom.

On our way to calculate the distribution ν (t) of the test-statistic t we first determine the distribution function of the square root of the χ 2 variable divided by √ ν.

We introduce y = √ q/ν and change variables from q to y in (7.22)

In appendix A we follow [5] and show that the sum of random variables and the sum of squares of random variables are statistically independent and therefore we can calculate the probability distribution function ν (t) of t = x/y from the product of the two distribution functions in the following way

Inserting the distribution functions ψ x (x) from (7.28) and φ y (y) from (7.29) and rearranging terms we find

(7.31)

The substitution z = (ν + t 2 )y 2 /2 and some rearrangements lead to

The integral is given by the gamma function, defined in (7.20), as ((ν + 1)/2), such that we finally arrive at

which is the commonly found form of Student's t-distribution. Note, that it is expressed in terms of ν, the number of degrees of freedom, rather than the number of samples n = ν + 1.

It is instructive to consider the limiting cases of the distribution. In particular, for n = ν + 1 = 2 we recover the Lorentz, Breit-Wigner, or Cauchy-distribution and for infinitely many degrees of freedom it can be shown that we recover the Gaussian distribution. The distributions for other values of ν lie between these extremes. Figure 7 .5 shows ν (t) for ν = 1, 2, and 100.

In order to assess the reliability of testing only a few samples we need to calculate the probability that our test statistic t = (X n − μ)/(S n / √ n) lies within the central range of the distribution. If we specify this central range to lie within ±t this probability is given by the integral

where I x (a, b) is the (regularized) incomplete beta function [4] and x = ν/(ν +t 2 ). The second equality follows from the substitution s = t 2 and the definition of the incomplete beta function as an integral with the same integrand [4] . On the left-hand side in Fig. 7 .6 we show A(t, ν) for ν = 1, 2, 5, and 100 as a function oft. The solid black line corresponds to the case ν = 100, where ν (t) approaches a Gaussian, which contains 95% of the distribution within two standard deviations and corresponds tot ≈ 2. The range containing a percentage, say 95%, of the distribution, defines the 95% confidence level. When taking fewer samples n = ν + 1 the probability to find a value of t within ±t is reduced compared to the Gaussian. For example, if we only test three samples (ν = 2) the 95% confidence level is only reached at |t| ≈ 4.3. For convenience, we show the values |t| where A(t, ν) as a function of ν reaches 95, 90, and 80% on the right-hand plot in Fig. 7 .6. The solid black curve corresponds to the 95% confidence level and indeed we find the point mentioned above at ν = 2 and |t| ≈ 4.3 on it. Furthermore, for large values of ν it approacheŝ t ≈ 2 as expected for Gaussian distributions. Below the curve for the 95% level, we find the curves for 90 and 80%. They correspond to a smaller area around the center of the distribution function and consequently a smaller probability of finding a value of t closer to the center. We point out that for ν > 10 the curves rapidly approach their asymptotic values, which agree with those of a Gaussian. This is the reason for the common method to use two standard deviations to specify the 95% confidence level. Let us come back to William Gosset at Guinness and ensure the quality of the delivery of a hundred bags of barley. Since we are lazy, we only test n = 3 randomly selected bags that have quality indicators x = 16, 12, and 17. So, what is the range of values x that encloses the true mean μ with 90% confidence? To find out, we use (7.24) and calculateX 3 = 15 and S 3 ≈ 2.6 from the three samples. From the righthand plot in Fig. 7 .6 we find |t| ≈ 2.92 for ν = n − 1 = 2 degrees of freedom at the desired 90% confidence level. From (7.27) we now determine the corresponding range of μ to be μ = X 3 ± |t|S 3 / √ 3 ≈ 15 ± 4.5, which is a rather large range. We therefore decide to take a fourth sample, which turns out to be x = 18. Repeating the above calculation, we now find the 90% range to be μ ≈ 15.8 ± 3.1. Maybe we should take a fifth sample, but that is left as an exercise.

In this section, we determined the range that contains the "true" mean μ of a distribution with a certain level of confidence. In the next section we will address a related problem: the validation or rejection of a hypothesis about the value a parameter is expected to have.

In a regression analysis we might wonder whether we really need to include a certain fit parameterX or whether the model works equally well when omitting it. We can address this problem by testing the hypothesisX = 0. In particular, we reject the hypothesisX = 0 if the test-statistics t = X/σ (X ) is very large and lies in the tails of the distribution. Here X is the value of a fitted parameter from (7.5) and σ (X ) is its error bar, extracted from (7.7). In particular, if we find a value of t that lies in the tails containing 10% probability, as shown by the red areas in Fig. 7 .7, we say that "the hypothesisX = 0 is rejected at the 10% level."

Once we have determined the test-statistics t from the regression analysis we can calculate the probability-the p-value-of finding an even more extreme value as

where we assumed that t was positive. If it is negative we have to integrate from −∞ to t instead. The parameter ν is the number of degrees of freedom ν = n − m of the regression, where n and m are defined in the context of (7.2). For an example let us return to the question posed in the introduction to this chapter whether all fit parameters are really necessary. Therefore we consider fitting data to a straight line as shown in Fig. 7. 1. It appears reasonable to fit the linear dependence y i = a + bs i to the dataset but we might even consider a second-order polynomial y i = a + bs i + cs 2 i , which will follow the data points even closer and result in a smaller χ 2 , because we have the additional parameter c to approximate the dataset. But since we expect the data to lie on the straight line we state the hypothesis that c = 0. We test it by fitting the second-order polynomial to the dataset, then determine the error bars σ (c) of the fit parameter c using (7.7), and finally calculate the test statistics t = c/σ (c). If we have many degrees of freedom ν = n − m 1, we can approximate ν (t) by a Gaussian and check whether t is larger than 2, which would indicate that our hypothesis c = 0 is rejected at the 10% level. Conversely, if t is smaller, we corroborate the hypothesis that c is consistent with zero and we might as well omit it from the fit.

The discussed method works well to test hypotheses about individual fit parameters, but occasionally we have to figure out whether we can omit a larger number of fit parameters at the same time. This is the topic of the following section.

In the previous section we used the t-statistic to determine whether one coefficient in a regression model is compatible with zero and can be omitted. This works well if there is only a single obsolete coefficient, but might fail if several of the coefficients are significantly correlated. In that case the error bars for each of the coefficients is large and leads to the conclusion that the coefficient can be omitted, but the real origin of the problem is that the fitting procedure fails to work out whether to assign the uncertainty to one or the other(s) of the correlated coefficients. This leads to serious mis-interpretations of fit results.

One way to solve the dilemma with several potentially correlated coefficients and to determine which ones to omit is the F-test. In this test the contribution of a group of p fit parameters to the χ 2 p is determined. We define this by the squared difference of the N "measurement" values y i and the regression model, normalized to the measurement error σ i , thus s i = (y i − p j=1 A i j x j )/σ i . For χ 2 p we then obtain

(7.36)

In the next step we increase the number of fit parameters to q > p and test whether χ 2 q is significantly smaller than χ 2 p . In order to quantify this improvement, we introduce the F-statistic

Here the denominator is the χ 2 q per degree of freedom of the "larger" fit, whereas the numerator is the difference χ 2 p − χ 2 q per additional degree of freedom, which come from the additional q − p fit parameters. The F-value thus measures the relative reduction of the χ 2 when increasing the number of fit parameters. We point out that the denominator depends on N − q squared random number s i and the numerator on the q − p additional s i . In Appendix A we motivate that the random numbers in the numerator and those in the denominator are independent, which allows us to derive the distribution of the F-statistic as the ratio of two independent χ 2 -distributions; one with n = q − p degrees of freedom, the other with m = N − q degrees of freedom. For easy reference, we again display the χ 2 -distribution function, already shown in (7.22)

We will use it once to describe the χ 2 −variable x with n = q − p degrees of freedom that appears in the numerator of (7.37) and once for the χ 2 -variable y for the m = N − q degrees of freedom in the denominator. Using these variables the F−statistic f is given by where we substitute z = (1 + n f/m)y/2 in the second step and then recognize the remaining integral as a representation of a Gamma-function with argument (n + m)/2. Note that the specific combination of Gamma-functions can be expressed as a Beta-function B( p, q) with [4] that clearly shows that it depends only on the number of degrees of freedom m = N − q and n = q − p. We can now use it to assess whether we need to include n additional fit parameters in a model and whether this added complexity is worth the effort. Figure 7 .8 shows the F−distribution function for a few values of n and m. In the first two cases we have a small number of degrees of freedom N − q = m = 5 where we fit q parameters to N data points. We then compare what distributions of our testcharacteristic f we can expect, if we add n additional fit parameters The dashed blue line corresponds to a situation, where we add n = 2 additional fit parameters. We see that the distribution function is peaked near zero, which indicates that small values of f are very likely. Adding a third fit parameter (n = 3) causes 3,5 ( f ) to assume the shape indicated by the dot-dashed red line. Now very small values near zero are less likely and the distribution shows a peak. The solid black line in Fig. 7.8 illustrates a case where we add n = 10 additional fit parameters to a fit that originally had N − q = m = 30 degrees of freedom. We observe that the peak of the distribution moves towards f = 1 but is rather broad and shows significant tails.

In order to intuitively asses whether a found f -statistics is likely or not, we calculate the probability that its value is even smaller or larger, depending whether we find a particularly small or large value of f . The probability can be expressed in terms of the cumulative distribution function of n,m ( f ) by Ix (a, b) is the (regularized) incomplete beta function [4] , we already encountered in (7.34), and B(a, b) is the beta function [4] . Now, to assess the probability, the p-value p( f ), of finding an even smaller value f is given by p( f ) = Ix (n/2, m/2). Note thatx depends on f . Conversely, the probability of finding a larger value, which is relevant on the on the right-hand side of the maximum, is given by

Rejecting hypotheses works in the same way as discussed above in Sect. 7.6. If an F−value f , computed from data, lies in the tails of the distribution and exceeds the value for a 10% tail-fraction, we say that the hypothesis is rejected at the 10% level.

In philosophy, a razor is a criterion to remove explanations that are rather improbable. The image of shaving off something unwanted, for example, one's beard, comes to mind. One well-known example is Occam's razor, which, translated from Latin, reads "plurality should not be posited without necessity." Today, it is often rephrased as "the simpler solution is probably the correct one." Applied to the our regression analysis and fitting parameters to models it guides us to seek models with as few fit parameters as possible, which is also called the principle of parsimony. It helps us to avoid adding unnecessary parameters to a model that may lead to over-fitting which causes the model parameters to be overly affected, or over-constrained, by the noise in the data. Such a model then works very well with the existing data set with its particular noise spectrum, but its predictive power to explain new data is limited.

A classical example of over-fitting is the fit of a polynomial of degree n − 1 to n data points. With many data points the polynomial is of very high order. It perfectly fits the data set and makes the χ 2 of the fit to zero, but outside the range of the original data set large excursions typically occur. Any additional data point is unlikely to be well-described by this highly over-constrained polynomial.

A further reason to avoid too many fit parameters is that groups of parameters are highly correlated and, again, this degeneracy is heavily affected by noise. It is therefore advisable to construct a model with the least number of fit parameters-to be parsimonious, in other words. Using fewer parameters therefore makes models more robust against the spurious influence of noise. Now that we have the methods to fit parameters to models and assess their validity and robustness, we can take a closer look at time series and extract useful information from the raw data.

1. In an experiment you recorded the parameter y as you changed another parameter s in Table 7 3. Use (7.5) and (7.6) to prove that (7.7) is correct. 4 . If x is a Gaussian random variable, calculate the probability distribution functions of (a) y = x − a, (b) y = bx, and (c) y = cx 2 . 5. You know that the data in the file ex7_5.dat, available from the book's web page, comes from a process that can be fitted by a polynomial of third order. (a) Plot the data. (b) Find the coefficients of a third-order polynomial in a regression analysis. (c) Estimate the error bars σ y of the y-values (the "measurements") from the rms deviation of your fit-polynomial to the data points. Note that is a very heuristic approach to estimate error bars and can be criticised! (d) Calculate the covariance matrix, based on your estimate of the error bars σ y , and deduce the error bars of the polynomial coefficients. Is there a coefficient that is so small as to be consistent with zero. (e) Determine its F-valuef and the probability to find an F-value that is even larger thanf . 6. Lets pursue the quality control at Guinness from the end of Sect. 7.5 and assume that we test a fifth bag of barley with the test result x = 13. What is the 90% confidence range now? If we were to be satisfied with an 80% confidence level, what would that range be?

Novel Coronavirus COVID-19 (2019-nCoV) Data Repository by

The mathematics of infectious diseases

Handbook of Mathematical Functions

Statistical Inference