key: cord-0195098-i174czxj
authors: Cressie, Noel
title: A few statistical principles for data science
date: 2021-02-03
journal: nan
DOI: nan
sha: 7af68a9622516a226d40141b6dc675a638e3034e
doc_id: 195098
cord_uid: i174czxj

In any other circumstance, it might make sense to define the extent of the terrain (Data Science) first, and then locate and describe the landmarks (Principles). But this data revolution we are experiencing defies a cadastral survey. Areas are continually being annexed into Data Science. For example, biometrics was traditionally statistics for agriculture in all its forms but now, in Data Science, it means the study of characteristics that can be used to identify an individual. Examples of non-intrusive measurements include height, weight, fingerprints, retina scan, voice, photograph/video (facial landmarks and facial expressions), and gait. A multivariate analysis of such data would be a complex project for a statistician, but a software engineer might appear to have no trouble with it at all. In any applied-statistics project, the statistician worries about uncertainty and quantifies it by modelling data as realisations generated from a probability space. Another approach to uncertainty quantification is to find similar data sets, and then use the variability of results between these data sets to capture the uncertainty. Both approaches allow 'error bars' to be put on estimates obtained from the original data set, although the interpretations are different. A third approach, that concentrates on giving a single answer and gives up on uncertainty quantification, could be considered as Data Engineering, although it has staked a claim in the Data Science terrain. This article presents a few (actually nine) statistical principles for data scientists that have helped me, and continue to help me, when I work on complex interdisciplinary projects.

Science and Engineering have long been held separate by our universities and learned academies. Broadly speaking, Science is concerned with 'why,' and Engineering is concerned with 'how.' Data Science seems to include both (Cressie, 2020) , and sometimes it may be hard to find the science in a Data Science solution. For example, an idea is born on how a particular data set could be analysed, perhaps with only a vague idea of the question being answered. Then algorithms and software are developed and, in some cases, the analysis results in superior performance as judged by figures-of-merit applied to that data set, which leads to a publication. Are the results extendible and do they have 'error bars' ?

Usually figures-of-merit are concerned with the efficiency of a method. However, validity should be established first, and error bars allow this to be done. No matter how efficient a prediction method is, if its error bars define a nominal 90% prediction interval but the true quantity (i.e., the predictand) is only contained in these intervals 60% of the time, the method lacks validity. This important figure-of-merit is called coverage and is analogous to one in the medical literature called specificity. This article is aimed at data scientists who want their error bars to be based on probabilities and their coverages to be (approximately) as stated. There are data scientists who have error bars that are based on empirical distributions derived from ensembles of like data sets or from partitioning a single data set into like subsets. While the principles I present are for data scientists who quantify uncertainties using probabilities, they could act as a guide for both types. If your analyses include uncertainty quantifications, then you may find some of the principles in the following sections useful for hypothesis generation, for climbing the knowledge pyramid from data to information to knowledge, and for scientific inference.

When I was planning this article, I hovered between the words 'laws, ' 'rules,' 'principles,' and 'guidelines' to establish some landmarks in Data Science. Laws should not be broken, exceptions prove the rule, and guidelines prevent accidents. But principles should make the analysis go better without interdicting other approaches: Improvising on something Groucho Marx once said, if you don't like my principles, you may have others you like better . . .

A data scientist might see variability between strata and within strata. A statistical scientist uses probability distributions to capture some or all of this variability. For example, probability densities would typically be used to model within-strata variabilities, and these densities may be different across strata in order to capture the between-strata variability. In many, many cases, those strata differences are modelled via a regression model, leaving the errors around the regression line (i.e., the within-strata variability) to be described by a probability distribution, often a Gaussian (or sometimes called normal) distribution.

In my opinion, the worst word in the Statistics lexicon, which happens to have four letters, is 'mean.'

It is the first moment of a probability distribution, and it is also used to describe the average of a collection of numbers, leading to many misunderstandings in Data Science. To data scientists who capture variability through the empirical distributions of summary statistics from similar data sets, the average is their first moment, so they could legitimately call it a mean of their empirical distribution. To avoid confusion, it will be henceforth assumed in this article that variability is captured by probability distributions, and a mean iss (is and only is) the first moment of that distribution.

In this section, five general statistical principles are presented. Different areas of study will have their own and, in Section 3, I present three principles that are important in Spatial Statistics. In an epilogical section (Section 4), I present one more, the principle of respecting units, which makes nine principles in total.

2.1 All models are wrong . . . but some are wrong-er than others

In my work on remote sensing of carbon dioxide, data z from a single sounding is a more-than-2,000dimensional vector of radiances observed at wavelengths in the near infra-red part of the spectrum. Cressie (2018) contains a literature review that features the relationship between z and the state of the atmosphere y (a 40-to-50 dimensional vector) via the statistical model,

where unobserved y and ε are statistically independent; ε ∼ N(0, Σ ε ), where N(µ, Σ) refers to a multivariate Gaussian distribution with mean vector µ and covariance matrix Σ; f is a known vector of nonlinear forward models obtained from atmospheric physics/chemistry; Σ ε is a covariance matrix known from experiments on the remote sensing instrument, performed in the laboratory before the satellite is launched; y = µ α + α, where α ∼ N(0, Σ α ); and µ α and Σ α are the first two moments of the (hidden) atmospheric state y. The goal is to predict y from data z. In my experience, it is the specification of Σ α (the covariance matrix of the state of the atmosphere) where the statistical model can be seriously wrong (Cressie, 2018) . 

Whileŷ WM may be optimal under WM, it is not under TM, and the properties given by (2) will expose the potential weaknesses of a WM chosen for its simplicity or its computational convenience.

It is tempting (but wrong) for scientists to calculate instead:

which are often facile calculations compared to (2). This was the case for NASA's Orbiting Carbon

Observatory-2 (OCO-2) satellite. Over time, empirical distributions of retrieval-error ensembles showed that bias and variance should have been calculated from (2), not (3). This is discussed by Cressie (2018) , and an illustration of applying Principal 2.1 is given in Nguyen et al. (2019) .

In the rest of this article, I shall assume that the WM and TM are one and the same. However, this principle is potentially applicable in all the examples below.

There is an equivalent way to write the model (1) that shows how statistical scientists can naturally build complexity into their models:

This is a hierarchical statistical model (HM), which is defined by layers of conditional probabilities (here two layers).

The top layer is the data model and can be written in abbreviated notation as [z|y] . It models what we see -the data (conditional on the process y). The next layer is the process model that can be written as [y] . It models what we want to get -the latent process, or state, hidden behind the data. In fact, in the remote sensing problem described in Section 2.1, the distribution of y is conditional on w, the meteorology of the atmosphere. If need be, the process model could be modified to now consist of [y|w] and [w] . Then the HM would be made up of three levels, and the sequence of conditional probabilities would be [z|y], [y|w], and [w] . In the specific case of OCO-2 retrievals, the meteorological process, w, is obtained from a numerical weather forecasting model and is considered known.

Principle 2.2: Build statistical models conditionally, through a data model and a process model. Infer the unknown process from the predictive distribution.

The predictive distribution is,

where for convenience any parameters θ in [z|y] and [y] are dropped from the notation (but they are still there). The HM components are featured in the numerator, which is equal to the joint distribution, [z, y] ; and the denominator, [z], is simply a 'normaliser' to ensure that [y|z] is a density (or probability mass) function that integrates (or sums) to 1. Equation (5) is Bayes' Theorem but, in this case, the unknowns are the state elements in y, not the parameters θ. No prior distributions on θ have been assumed in the making of this HM! However, a prior could be assumed or, pragmatically, an estimateθ could be obtained from z.

Derivation of the predictive distribution in (5) can be problematic in many cases of practical interest, such as when [y] is a highly multivariate probability distribution of a complex scientific process. Computational methods, particularly Markov chain Monte Carlo (MCMC), are in constant development to allow realisations from [y|z] to be simulated, from which distributional properties such as the predictive mean and the predictive quantiles can be used to predict the state y and the error bars, respectively.

The simplest special case of (4) is when the data (here, z) and the process (here, y) are univariate, and f (·) is the identity function:

Then [y|z] is Gaussian, characterised by its first two moments:

The predictor,ŷ = E(y|z), is a weighted combination of the data and the mean of the unknown state y,

where the weights depend on the signal-to-noise ratio, σ 2 α /σ 2 ε . These results can be generalised to the multivariate case with linear retrieval equations in (4), namely f (y) = c + Ky. In the case of OCO-2, K has more rows than columns, and the first two moments of the predictive distribution, which is Gaussian, are:

where

ε is sometimes called the gain matrix. From (6), it can be seen that the precision matrix (i.e., the inverse of the variance matrix), written prec(·), of each component of the predictive distribution satisfies, prec(y|z) = prec(y) + K ⊤ prec(z|y)K , a result that holds when the matrix K is any size. This decomposition of precision demonstrates that, when going from [y] to [y|z] , the precision increases (i.e., the variance decreases). Moreover, the predictive mean is unbiased; that is, E(E(y|z)) = E(y) = µ α .

In geostatistics, the importance of (5) is deeply misunderstood, because Matheron (1963) . Then the goal is to predict z(s 0 ) from data z = (z(s 1 ), . . . , z(s n )) ⊤ , for which the generic kriging predictor is E(z(s 0 )|z), the mean of [z(s 0 )|z].

However, when measurement of the process is taken into account, there is a HM that differentiates between the observations z and the underlying latent process {y(s) : s ∈ D} and that is observed imperfectly through z. The spatial trend and the spatial covariance function are defined in the process model, although they are estimated from the noisy data z. Once the measurement error is included in the model, it is clear that geostatistics should do kriging using E(y(s 0 )|z) and not E(z(s 0 )|z); see Cressie and Wikle (2011, Section 4.1) .

Consequently, much of the earlier geostatistics software did not make inference on y(s 0 ) but chose to make inference on z(s 0 ) where the measurement error is confounded with the process error. User beware: some still do, but those written for environmental applications (e.g., geoR, gstat, FRK) give the correct kriging equations. The difference is most apparent when the kriging variance is computed as var(z(s 0 )|z) at location s 0 , but then interpreted incorrectly as the predictive variance var(y(s 0 )|z) of the true process at s 0 .

Building physical models usually involves ensuring that mass or energy is conserved. If the system is leaking energy, then it needs to be plugged or the energy needs to be followed as it moves into another system. Now consider a designed experiment where data z are obtained and the statistical model fitted is

The model in (7) is a linear regression with covariate (or design) matrix X, β is an unknown p-dimensional vector of regression coefficients, and ξ = (ξ 1 , . . . , ξ n ) ⊤ consists of random variables that are independent and identically distributed (iid) Gaussian random variables with mean 0 and variance σ 2 ξ . Suppose that the measuring instrument was carefully calibrated and, in the study protocol, its sample variance from repeated measurements was reported as a number that I shall write as s 2 ε > 0. Note that the uncertainty in the measurements usually needs to be quantified outside the experiment in order to identify the linear-model-error variance. Scientific interest is primarily in β, but σ 2 ξ is by no means a nuisance parameter. Its restricted maximum likelihood (REML) estimate is:

whereβ is the maximum likelihood (and also the ordinary least squares) estimate of β.

It could almost be a rule that in any study of this sort, you will see s 2 ξ > s 2 ε . Where did the component of variability, (s 2 ξ − s 2 ε ), go? Principle 2.3: In any well defined statistical model, there is conservation of variability. Indeed, the model given by (7) is not defined well enough: The error term ξ i is, in fact, made up of two components of variance:

where {ε i : i = 1, . . . , n} represent measurement errors. Often forgotten are {δ i : i = 1, . . . , n}, which represent model errors resulting from using the linear-regression model, {x ⊤ i β : i = 1, . . . , n}, and which are key in accounting for the inexactness of using any model (linear or nonlinear).

The scientific process {y i : i = 1, . . . , n} is given by y i = x i β + δ i , and an observation of it is z i = y i + ε i , for i = 1, . . . , n. The HM captures the variability of z and y beautifully, as follows:

Hence, from (8), we obtain the earlier result in vector form:

where δ and ε are independent mean-zero random vectors.

Comparing (7) and (8), we see that the model given by (7) should be augmented with the equation,

Consequently, knowing s 2 ε (an estimate of σ 2 ε that is obtained outside the experiment) and, having computed the REML estimate of σ 2 ε , an estimate of the model error, σ 2 δ , can be obtained as,

(provided the right-hand side is non-negative).

In its simplest form, conservation of variability says that the total variability is equal to the variability due to model uncertainty plus the variability due to measurement uncertainty. When variability is captured using a probability space defined by a HM, this principle can be expressed as:

var(z) = var(E(z|y)) + E(var(z|y)) .

For example, consider the HM defined by (8). Since var(z) = var(ξ) and var(y) = var(δ), we have

and variability is conserved.

This might seem obvious to you, or perhaps even trivial. Again consider (7), and suppose you want to predict an unknown value, y n+1 , outside the data set, but only the p-dimensional estimateβ and the covariate, x n+1 , are at your disposal. Most regression textbooks would say that you should use as predictor, x ⊤ n+1β . However, a scientist wants to predict the value of the scientific process, y n+1 = x ⊤ n+1 β + δ n+1 . Using the well defined HM (8), its predictor isŷ n+1 = E(y n+1 |z) = x ⊤ n+1β +δ n+1 , wherê δ n+1 = E(δ n+1 |z). Since {δ i } are iid N(0, σ 2 δ ),δ n+1 = E(δ n+1 |z) = 0; however, var(δ n+1 |z) is not zero, and that has to be recognised in order to perform valid predictive inference, as given below (e.g., Cressie, 2020) .

The prediction error is (ŷ n+1 − y n+1 ), and its first two moments are: Principle 2.3 also shows up in the analysis of variance (ANOVA) method, where the 'between sum of squares' corresponds to the variance of the conditional expectation in (11), the 'within sum of squares' corresponds to the expectation of the conditional variance in (11), and the 'total sum of squares' corresponds to the left-hand side of (11). Conservation of variability implicitly includes variability and covariability. For example, if two random variables ε 1 and ε 2 have correlation ρ = 0, then it should be recognised that var(ε 1 + ε 2 ) = var(ε 1 ) + var(ε 2 ) This has often been ignored by scientists doing an error budget (e.g., in remote sensing retrievals; see Connor et al., 2016) . Depending on the sign of ρ, var(ε 1 ) + var(ε 2 ), may be either larger than or smaller than the total variability, since var(ε 1 + ε 2 ) = var(ε 1 ) + var(ε 2 ) + 2ρvar(ε 1 ) 1/2 var(ε 2 ) 1/2 .

The rules of probability theory can explain easily how the variability can seem to disappear, and then they can show us where to find it. It is less natural to do so simply with empirical distributions, because the pairs {(z i , y i )} involve the unavailable, hidden variable {y i }. Further, if {z i } and {x j } are two sets of observations, there is no guarantee that they will occur in pairs, and hence the empirical correlation, r, might be difficult to obtain.

The famous bias-variance trade-off is another manifestation of Principle 2.3. In the context of estimation of a fixed but unknown parameter θ with an estimatorθ(z), where recall z is the data vector, the mean-squared error is the sum of the estimator's squared bias and its variance:

Generally speaking, an estimator that decreases the bias will increase the variance, and vice versa. The mean-squared error might be decreased using different estimators, but there is a trade-off to be made when trying to decrease both bias and variance at the same time.

A statistical analysis cannot get very much simpler than fitting a simple linear regression (a special case of (7)) to {(z i , x i ) : i = 1, . . . , n}, as follows:

where {ξ i : i = 1, . . . , n} are iid N(0, σ 2 ξ ). However, is (12) appropriate if {z i : i = 1, . . . , n} are photon counts at wavelengths in a given band of the electro-magnetic spectrum, or if they are percentages of trace elements in soil?

Exploratory data analysis (EDA) might reveal a highly skewed histogram of {z i }. After plotting the histogram of {log z i }, we might achieve a more symmetric appearance; assume for the moment that we do. This would prompt a plot of not only {z i } versus {x i }, but also of {log z i } versus {x i }, and the latter might look more linear. Further EDA, in the form of residual plots, after fitting a simple linear regression to the data {z i } and one to the transformed {log z i }, would be carried out. And, for example, it might be that the residual variability of {z i } appears to increase with x (i.e., heteroskedasticity), but the residual variability of {log z i } shows no dependence on x (i.e., homoskedasticity).

Admittedly, this is a story, not methodology, but in my experience with analysing data, it happens enough to propose the next principle (Cressie, 1985) . This principle looks more like a doctrine but, as I explain below, a fitting of (12) to the data {z i } is following this principle, where the transformation is simply the identity.

Suppose the quantity y has variabilities that can be expressed as 'large-scale' and 'small-scale.' Scientists usually put their theories in the large scale and their errors (both in the theory and in the measurements) in the small scale. Statistical scientists know that to make inference on the large scale of the scientific process, its small scale has to be modelled well, usually by a random error term, δ. However, the model I now give for the scientific process is more basic, given in terms of those large and small scales of variability: It is additive and of the following form,

where p and N are positive integers and, for convenience, the subscript 'i' has been dropped. For example, the large-scale variation in simple linear regression has p = 2, µ (1) = β 1 , and µ (2) = β 2 x.

For the small-scale variation in (13), N is large, and {δ (l) : l = 1, . . . , N } are physically interpretable components with variabilities that are very small but such that their total variability is the variability

The model (13) is a physical model where the large scales act additively with each other, the small scales act additively with each other, and the large scales and the small scales interact additively. It can be made statistical by assuming that these many small-scale effects, δ (1) , . . . , δ (N ) , are random, statistically independent, have mean zero and variances that are O(1/N ), with the same leading coefficient, denoted here as σ 2 δ . Now let N be large and define the small-scale variations collectively as δ; then (13) can be written as,

The regression (7) Hence, Principle 2.4 covers many statistical models and offers a plausible explanation of why a linear relation, homoskedasticity, and Gaussianity are often found to occur together after a transformation of the data (e.g., Cressie, 1978) . Of course, it is easy to construct probability models where the principle does not hold, but one should keep in mind that probability theory is used to model nature's variability, not the other way around.

When nonlinearity is inherent to the physical system, such as would occur when there are barriers or thresholds, the quest looks to be futile. However, after following Principle 2.2, it may just be that the grail, Principle 2.4, is hidden deep in the process model (Berliner et al., 2000) .

Taleb (2007) My response is that 300+ years ago the best scientific minds in Europe were too certain about their science; that is, if they had been asked to put probabilities (according to abundance) on the colours that swans could be: red, orange, . . ., violet, white, black, they would have put 0, 0, . . . , 1, 0, which is a well defined probability mass function. Certainly, black-swan events are not predictable if the scientific model is 100% certain that they do not exist. Because their probability model gave black swans and indeed coloured (including red) swans, zero probability, this unimagined event did not emerge out of the 'ether' in subsequent inferences, until one was observed. Do we have to wait until a highly unusual event occurs before we are forced to change the probability model? The real lesson from the black-swan meme is that scientific knowledge is never perfect, that modellers need to explore the parameter space thoroughly, and that they need to 'spend' some probability on highly unusual events.

At the time of writing this article, our species is under attack from a virus that was originally called 'novel coronavirus,' so new that it took several weeks before the infection it caused had a name: COVID-19. Virologists certainly had assigned a small positive probability that each new decade would have its own 'novel' pandemic. However, politicians (and most economists) appear to have put zero probability on a severe, worldwide economic disruption.

Given a swine-flu-like pandemic has occurred, the conditional probability of some economic disruption is not zero. But, given a severe pandemic has occurred, the conditional probability of severe economic disruption is substantial. This conditional probability is then multiplied by the probability of a severe pandemic, which is by no means zero given the ability of viruses to mutate and occasionally jump from animals to humans. The product of these two is the joint probability of a severe pandemic followed by severe economic disruption, which is not negligible. This has happened twice in the last hundred years, and it will likely happen more often with humans and animals sharing in an ever-more-crowded environment.

Principle 2.5: When building probability models, look carefully where zero probabilities are assumed (perhaps implicitly) and, with the same care, move appropriate probabilities away from zero. Calculate joint probabilities from products of conditional (not marginal) probabilities, unless entropy is maximal.

Unfortunately, uncertainty quantification through joint probabilities all too often comes from multiplying marginal probabilities as if each event in the causal chain were independent. It does not 'hurt'

for small probabilities to be assigned to Pr(bird is colour c | bird is a swan) where c also covers the colour 'black.' Then Pr(bird is a swan and it is black) = Pr(bird is black | bird is a swan) × Pr(bird is a swan). The point being made here is philosophical, and it could be applied to Pr(pandemic followed by global economic disruption), for different severities of both events. The worst thing would be to make Pr(global economic disruption | pandemic) equal to zero, since then there would be a lack of planning for the health and economic crises brought on by a pandemic (Mackenzie, 2020) .

If nothing were known about the conditional probability, a fall-back is the maximum-entropy model where the marginal probability, Pr(bird is colour c), is used in place of Pr(bird is colour c | bird is a swan). The marginal probability is not zero, since it is based on abundance and, for example, there are birds that are predominantly coloured red (e.g., Australia's king parrot). The maximum-entropy principle is discussed in Cressie et al. (2004) .

Experiences over the last 100 years mean that Pr(global economic disruption), whether caused by a pandemic or not, should not be zero, yet governments described the events of March-April 2020 as unimaginable. Events of small (but not zero) probabilities with consequences expressed through a loss function, allow a non-zero expected loss to be calculated, which can then be used to make optimal decisions that mitigate the consequences (e.g., see Berger, 1985 for a discussion of decision theory).

I have just presented five statistical principles that should be useful in Data Science. Data often come with location information, in which case the data scientist will likely be using spatial-analysis methods.

In the next section, I present three principles that are specific to Spatial Statistics.

Those of us who work in Spatial Statistics will know of Tobler's First Law of Geography (Tobler, 1970) .

In spatial statistical science, it really is a 'principle' rather than a 'law' and, in the following subsections, I shall present it and two other principles that I have found useful in this area of research.

3.1 Patches in close proximity are commonly more alike . . .

In his famous 1935 book on experimental design, Fisher (1935, p. 66) , wrote: 'After choosing the area we usually have no guidance beyond the widely verified fact that patches in close proximity are commonly more alike, as judged by the yield of crops, than those which are further apart.' A spatial statistician sees this as the making of a principle but, at that time, Fisher made a sharp right turn. In his analyses of field trials he applied randomisation to eradicate the pest: spatial correlation! Cressie and Wikle (2011, Ch.1 and Ch. 4) give some historical perspective to the work of researchers who took roads less travelled and developed this vibrant area we now call Spatial Statistics. Some of this development comes from Geography, and so it is fitting that the first and most important principle in this section has become known as the First Law of Geography. Originally articulated by Tobler (1970) , it is given here in exactly his words.

Principle 3.1 Everything is related to everything else, but near things are more related than distant things (Tobler, 1970) . For example, consider a simple process model, appropriate for a small scene D:

where the mean of the process is a linear function of latitude, and {δ(s) : s ∈ D} is a spatial error process with mean zero and stationary covariance function, C δ (h) = cov(δ(s + h), δ(s)) = σ 2 δ · exp(− h /φ). Notice that C y (h) = cov(y(s + h), y(s)) = C δ (h); C y (0) = C δ (0) = σ 2 δ ; and the scale parameter φ controls how 'more related' things are. With this parameterisation, φ = 0 is the degenerate case of no spatial correlation and, as φ increases, the spatial correlation increases for a given distance h . The data model in this example is also simple, that of additive independent measurement error. The data vector is z = (z(s 1 ), . . . , z(s n )) ⊤ and z(s i ) = y(s i ) + ε i ; i = 1, . . . , n , where {ε i : i = 1, . . . , n} are independent mean-zero errors.

Assuming for the moment that all the parameters θ = (β ⊤ , σ 2 δ , φ, σ 2 ε ) ⊤ are known, inference on {y(s) : s ∈ D} comes from summaries of the predictive distribution, [{y(s) : s ∈ D}|z]. For a Gaussian process model y(·) and Gaussian measurement errors {ε i }, the predictive distribution is also a Gaussian process whose first two moments, E(y(s)|z) and cov(y(s), y(u)|z), for s, u ∈ D, can be obtained analytically.

From Section 2.2, a common predictor of y(s) isŷ(s) = E(y(s)|z), whose statistical properties are needed for inference. It is straightforward to see that E(ŷ(s)) = E(y(s)) (i.e., the predictor is unbiased) and the mean-squared predictor error is: E(ŷ(s) − y(s)) 2 = E(var(y(s)|z)) = var(y(s)|z) , which is the predictive variance (the last equality being due to the Gaussian assumptions made). Further summaries might come from well chosen percentiles (e.g., 2.5%, 25%, 50%, 75%, 97.5%) of [y(s)|z].

In spatial statistics, it might appear that the spatial-prediction problem is quite difficult, because the number of predictions to be made is often greater than the number of data. Even for the problem of parameter estimation, the number of 'degrees of freedom' in n spatially dependent data will not be n, as I now show. Applying Principle 3.1 with a stationary spatial covariance model C z (h) = cov(z(s+h), z(s)),

for h ∈ R d , that exhibits only positive correlations, it is easy to see that for

where σ 2 z = C z (0). Hence one can define the effective degrees of freedom, n eff , as n eff = σ 2 z /var(z) < n ; that is, under the almost ubiquitous spatial model of positive spatial correlation, n eff < n. In the remote sensing context where there are many observations taken over a short period of time, Zhang et al. (2017) gave an example where n = 2961 and n eff = 202.3, less than one-tenth of the sample size! These calculations are a warning that, in the spatial (and spatio-temporal) setting, intuition learned from the 'iid errors' model has to be modified, sometimes substantially. Critically, the probability model that captures the spatial variability has to be well specified, and we now discuss how this can be done.

The classical spatial statistical model consists of a deterministic mean process, {µ(s) : s ∈ D}, and a random spatial error process, δ(s) = y(s) − µ(s), so that

see Cressie (1993, Ch. 2) . It is often a personal choice what components go into µ(s). For linear regression, µ(s) = x(s) ⊤ β, for s ∈ D, but the set of possible covariates, x(s), typically comes from the modeller's subject-matter knowledge, augmented by spatial trend terms that are linear and higher-order

What is the effect of not having included an important spatially varying covariate, x p+1 (s) say, in µ(s)? From Principle 2.3, {δ(s)} has to absorb the contribution to spatial variability that {x p+1 (s)} would have made had it been included in the regression. This means that a fixed effect that has been inadvertently left out will be picked up in the spatial statistical model (14), as a random effect.

Principle 3.2: Assume a decomposition of a spatial process into fixed effects plus random effects. While it is not unique, the decomposition must be chosen to conserve variability.

This principle is a refinement of Principle 2.3 that is adapted here for Spatial Statistics. The variability of the deterministic-mean component is measured differently from that of a random-error component. Define the regional variance of µ(·) as,

where µ = (1/|D|) D µ(s) ds is the regional average of µ(·) (Lahiri et al., 1999) . Suppose that a stationary covariance function, C δ (·), describes the covariability in {δ(s) : s ∈ D}, with σ 2 δ = C δ (0); recall that C δ (h) = cov(δ(s + h), δ(s)), for h ∈ R d . LetĈ δ (·) denote the empirical covariance function, and s 2 δ =Ĉ δ (0). Then, according to Principle 3.2, approximately speaking, s 2 µ + s 2 δ should not change as different decompositions given by (14) are fitted. Now suppose that the additional covariate vector,

x p+1 = (x p+1 (s 1 ), . . . , x p+1 (s n )) ⊤ , is included in the regression, and it is not in the column space of X. Then generally, the new empirical covariance functionĈ δ (·), yields a new estimate s 2 δ =Ĉ δ (0). The principle says that this new s 2 δ should decrease to allow for the additional spatial variability in {µ(s) : s ∈ D}. (Understandably, the general shape of the new empirical covariance function will change as well.)

How will we ever know which model is better after each is fitted to the same spatial data {z(s i )}?

Model-selection criteria such as 'Akaike Information Criterion,' 'Deviance Information Criterion,' and cross-validation will remove bad models but, in the 'difficult middle,' there is no way to know whether a graph of {z(s i )} versus {x p+1 (s i )} is showing the behaviour of a component of the mean function or the behaviour of a component of the error process. Germane to Principle 3.2, it is well known to spatial statisticians that a single simulation of a strongly spatially dependent random effect, {δ(s) : s ∈ D}, can look like deterministic spatial trend. If a replicate of the data were available, and if the analogous plot showed the same behaviour as seen in the original plot, then {x p+1 (s) : s ∈ D} would probably belong in the mean function µ(·). From just one realisation, the decomposition (14) is not unique, and hence what is one person's mean function could be another person's spatial error.

Most of us have heard of, or encountered, Simpson's Paradox, which basically says that two variables x and y could be positively correlated but, conditional on a third variable w, it is perfectly acceptable that

x and y be negatively correlated (or uncorrelated, or positively correlated)! Usually, Simpson's Paradox is expressed in terms of categorical data, which we could do here by defining ordinal categories, or bins, from the ranges of w, x, and y.

Let a sample {(w i , x i , y i ) : i = 1, . . . , n} be assigned to the pre-defined bins, creating a three-way contingency table with counts in each cell of the table. A two-way table showing (marginal) dependence between binned x and binned y is obtained by aggregating the counts across the third variable, namely binned w. Then a measure of dependence in the two-way table (e.g., the statistic gamma, due to Goodman and Kruskal, 1954) could show positive dependence, but the same measure applied to the two-way tables of binned x and binned y conditional on each of the values of binned w, could all show negative (or no, or positive) dependence.

What can you do about it? First, understand why it happens, and then spend the rest of your data-science career looking for those lurking variables like w that you may have missed! It manifests in other settings as well, as I now discuss. Let x and y be jointly Gaussian random variables and related through the simple linear regression model given by (12); the probability model of how y varies with x is as follows: Conditional on x, the random variable y is Gaussian with mean and variance, respectively, E(y|x) = E(y) + {ρ xy var(y) 1/2 /var(x) 1/2 }{x − E(x)} ,

where ρ xy = corr(x, y).

Now consider regression of y on both x and w, where again joint Gaussianity of random variables w, x, and y is assumed. To make the calculations easier to follow, in the rest of this subsection consider the simple case where E(w) = E(x) = E(y) = 0, and var(w) = var(x) = var(y) = 1. Then the covariances, cov(x, y), cov(w, y), and cov(x, w) are identical to correlations, which are denoted here as ρ xy , ρ wy , and ρ xw , respectively. The regression lines to be compared are,

and E(y|x) = 0 + ρ xy · x .

Then, conditional on w,

which is to be compared to corr(x, y) = ρ xy .

If w has zero correlation with both x and y, then from (16), corr(x, y|w) = corr(x, y) = ρ xy , which is an intuitively reasonable result. In general, ρ wy = 0 and ρ xw = 0, so from (15) The answer is 'plenty,' if you think about w as being a variable that describes a range of geographical strata. For example, the Australian Bureau of Statistics divides Australia up into small areas, at different levels of aggregation. A study of y =mean weekly income in NSW, Australia, regressed on selected demographic variables x at the finest level of aggregation, was given in Burden et al. (2015) . Given the great divide in Australia between 'city' and 'country,' the most obvious variable w to choose is: w = 1 if the small area is in Greater Sydney (city), and otherwise w = 0 (country). In this set-up, it is easy to imagine that the two regressions, E(y|x, w = 1) and E(y|x, w = 0), would result in more interpretable results than the single regression, E(y|x). (In fact, Burden et al., 2015, used Principle 3.2 and captured the 'geography' by using a spatial error term rather than a geographic covariate w.)

Simpson's Paradox is potentially everywhere in the spatial context, because regressions of y on x can be done at different levels of spatial aggregation. A regression of y on x at the finest level of aggregation may show positive dependence, but when the variables are aggregated to a coarser level, the regression may show negative (or no, or positive) dependence! In Sociology, this phenomenon has been called the ecological effect (Robinson, 1950) , an unfortunate and misleading name that has no direct connection to Ecology. In Geography, the combination of Simpson's Paradox and the ecological effect has been called the Modifiable Areal Unit Problem, or MAUP; a spatial statistical discussion of MAUP is given in Cressie (1996) .

Because it is so natural to aggregate processes over space, meta-data in the spatial data set should include the spatial support of each datum. Define y(B) = (1/|B|) B y(s) ds; then y(B) has spatial support B with volume, |B| = B ds > 0. In geostatistics, B is called a block, and we say that y(B) is an aggregation (or block average) of the original process. Then a change-of-support (COS) is said to have occurred from point support to spatial support B, with a corresponding change of statistical properties.

Data on mutually exclusive supports, {B 1 , . . . , B n }, are typically represented as:

The simplest model would be {ε i : i = 1, . . . , n} iid N(0, σ 2 ε ), but modifications to account for the protocols of measurement and the possible overlap of the supports {B i } have been developed (e.g., Wikle and Berliner, 2005) .

The two earlier principles of Spatial Statistics (nearby things are more alike; and decompose spatial variability into fixed effects plus random effects) are joined here by a change-of-support principle. This is the most important of a number of COS properties (e.g., Gotway and Young, 2002) , whose discussion I have left for a future time.

Let y(s) = y 0 + δ(s), for s ∈ D, where y 0 is a non-degenerate random variable with possibly non-zero mean and independent of {δ(s) : s ∈ D}. Then for B ⊂ D, var(y(B))) = σ 2 0 + var(δ(B)) ,

where σ 2 0 = var(y 0 ) > 0. Therefore, if var(δ(B)) decreases to 0 as the volume |B| increases, var(y(B)) decreases to σ 2 0 > 0. It is this type of behaviour that is of interest to geoscientists analysing remote sensing data. In that literature (reviewed in Cressie, 2018) , they distinguish between two types of error, 'systematic error' and 'random error,' as follows.

Random error : An error is a random error if the average of a collection of n of them has variability that decreases to zero like 1/n, as n becomes large.

Systematic error : An error is a systematic error if the average of a collection of n of them has variability that does not decrease to zero, as n becomes large.

These might be considered 'verbal working definitions,' but a statistical scientist looks at these and tries to find an appropriate probability model. The ones I give below are building blocks for parts of a HM. In practice, the most difficult aspect is to know which errors are of which type and how to group them together for inference.

For 'random error,' the obvious probability model is: Let δ(s 1 ), . . . , δ(s n ) be iid random variables with mean 0 and variance σ 2 δ . Now the average, δ = ( n i=1 δ(s i ))/n, has E(δ) = 0 and var(δ) = σ 2 δ /n, which decreases to zero like 1/n. In a 'data-rich' situation and with enough averaging of the data, random errors can be shown to be annihilated by the averaging (using the law of large numbers). In a spatial setting, Principle 3.1 suggests that the independence assumption between the {δ(s i )} is not appropriate. Under increasing-domain asymptotics, it can be shown that var(δ) still decreases like 1/n (Cressie, 1993) , however strong spatial dependence can make the errors look more systematic than random (Morris and Ebey, 1984) . This does not decrease to zero as n becomes large, and hence e(·) is a form of systematic error. Zhang et al. (2019) used spatial dependence in the {δ(s i )} and an additive random effect y 0 with mean 0, as part of the OCO-2 calibration of remote sensing data to 'ground-truth' data.

Consider spatial prediction errors, {ŷ(s i ) − y(s i ) : i = 1, . . . , n}, which are located at {s 1 , . . . , s n } in the spatial domain D. Typically, these prediction errors have individual means that are not zero, and it is often the case in Spatial Statistics that there is no replication to estimate them. A way out of the conundrum is to assume exchangeability (e.g., Spiegelhalter et al., 2004, Section 3.17) , which results in a spatial statistical model that exhibits both systematic error and random error; see Cressie (2018, Rejoinder) .

In the previous sections, I presented five principles for Statistical Science and three special ones for Spatial Statistics. In what follows, I add a ninth principle that speaks to all of Science, not just Data Science, and I give some concluding remarks.

There is one further principle that has been my 'rock' in the diverse applied-statistics projects in which The first part of the principle is illustrated with the cartoon I referred to above. You might say you would never add apples and oranges but, if in (7), z is measured in petagrams (Pg) of carbon in Earth's atmosphere and x 1 = 1, x 2 = t (in years), and x 3 = t 2 , then a unit analysis using the second part of the principle would reveal that regression coefficient β 1 is in units of Pg, β 2 is in Pg/yr, and β 3 is in Pg/yr 2 . Principle 4.1 is respected, since it is the regression components {β k x k } that are added. However, it is the regression coefficients {β k } that are often interpreted, so I often standardise the covariates by, respectively, subtracting their averages and dividing by their sample standard deviations. Then, after this standardisation, β 1 , β 2 , . . . , β k all have the same units as those of z.

Statistical scientists know that probabilities have no units. Using Principle 4.1, a unit analysis of the fundamental probability equation, f (y) dy = 1, where f (·) is the probability density function of the random variable y, whose units are, say, Pg of carbon, reveals that f (y) has units of (Pg) −1 . Using the same principle, E(y) has units of Pg, and so forth. While probability density functions have units, cumulative distribution functions and probability mass functions do not.

The last part of Principle 4.1 applies to any special function that can be expressed as a Taylor series.

The most common ones in science are logs and exponents, whose Taylor series are log e (1 + x) = x − x 2 /2 + x 3 /3 − . . . and e x = 1 + x + x 2 /2! + x 3 /3! + . . . , which only make sense when x has no units; see the first part of Principle 4.1. Euler's number, e, has no units, because e = lim n→∞ (1 + 1/n) n .

Counts, percentages, and correlations have no units. Also, the Box-Cox transformation (Box and Cox, 1964) ,

, only makes sense if x has been rendered unit-free. One way to accomplish this is to specify x relative to a given standard.

Every term in the process model (and data model) in a HM, should be assigned their rightful units.

The scientific models that make up parts of the process model are sometimes derived theoretically, sometimes empirically. Beware of a 'scientific constant.' It may be an (estimated) regression coefficient, in which case its units (the ratio of the dependent variable's units to the corresponding covariate's units)

are key, since that 'constant' might change if the units change.

In conclusion, it is a critical part of every collaboration to do a 'unit analysis,' which avoids obvious mistakes made by confusing different parts of different systems of units (e.g., metric and imperial), as well as the more subtle ones discussed above. Principle 2.5: When building probability models, look carefully where zero probabilities are assumed (perhaps implicitly) and, with the same care, move appropriate probabilities away from zero. Calculate joint probabilities from products of conditional (not marginal) probabilities, unless entropy is maximal.

[Could swans be red?]

Principle 3.1 Everything is related to everything else, but near things are more related than distant things (Tobler, 1970) . [Patches in close proximity are commonly more alike . . .]

Principle 3.2: Assume a decomposition of a spatial process into fixed effects plus random effects. While it is not unique, the decomposition must be chosen to conserve variability. [What is one person's mean function could be another person's spatial error.] 

These nine principles of Statistical Science are personal, leading to a certain amount of self-referencing, but I hope others find them useful. There are no theorem-proof results, but there are back-of-the-envelope calculations that I use to justify the principles in simple settings. At the very least, they should give data scientists boundaries in their analytics that should be respected, and criteria by which supervised and unsupervised machine-learning methods could be assessed.

I once introduced Adrian Baddeley at a conference session I was chairing (ASC2010, Fremantle, Western Australia) as a national treasure. I repeat it here, and I wish him a very happy 65th birthday!

Statistical Decision Theory

Long-lead prediction of Pacific SSTs via Bayesian dynamic modeling

An analysis of transformations

The SAR model for very large datasets: A reduced-rank approach

Quantification of uncertainties in OCO-2 measurements of XCO2: Simulations and linear error analysis

Removing nonadditivity from two-way tables with one observation per cell

When are relative variograms useful in geostatistics

Statistics for Spatial Data, rev

Change of support and the modifiable areal unit problem

Mission CO 2 ntrol: A statistical scientist's role in remote sensing of atmospheric carbon dioxide (with discussion and a rejoinder)

When is it Data Science and when is it Data Engineering? Comment on "Prediction, Estimation, and Attribution" by B. Efron

Ecological bias: Use of maximum-entropy approximations

Statistics for Spatio-Temporal Data

The Design of Experiments

Measures of association for cross classifications

Combining incompatible spatial data

Prediction of spatial cumulative distribution functions using subsampling (with discussion and a rejoinder)

Covid-19: The Pandemic that Should Never Have Happened, and How to Stop the Next One

Traité de Geostatistique Appliquée

Generalized Linear Models

An interesting property of the sample mean under a first-order autoregressive model

Sensitivity of Optimal Estimation satellite retrievals to misspecification of the prior mean and covariance, with application to OCO-2 retrievals

Ecological correlations and the behavior of individuals

Bayesian Approaches to Clinical Trials and Health-Care Evaluation

The Black Swan: The Impact of the Highly Improbable. Random House

A computer movie simulating urban growth in the Detroit region

Combining information across spatial scales

Statistical properties of atmospheric greenhouse gas measurements looking down from space and looking up from the ground

Inference for errors-in-variables models in the presence of systematic errors with an application to a remote sensing campaign