Securing the Empirical Value of Measurement Results

Kent W. Staley

Saint Louis University

July 2, 2016

Abstract

Reports of quantitative experimental results often distinguish

between the statistical uncertainty and the systematic uncertainty that

characterize measurement outcomes. This paper discusses the practice

of estimating systematic uncertainty in High Energy Physics (HEP).

The estimation of systematic uncertainty in HEP should be understood

as a minimal form of quantitative robustness analysis. The secure

evidence framework is used to explain the epistemic significance of

robustness analysis. However, the empirical value of a measurement

result depends crucially not only on the resulting systematic uncertainty

estimate, but on the learning aims for which that result will be used.

Philosophically important conceptual and practical questions regarding

systematic uncertainty assessment call for further investigation.

1


1 Introduction

The responsible reporting of measurement results requires the characterization of

the quality of the measurement as well as its outcome. But the quality of a

measurement is not one-dimensional. Standard practice in particle physics requires

that all reports of measurement or estimation results must include quantitative

estimates of both the statistical error or uncertainty and the systematic error or

uncertainty. (This terminological ambivalence between error and uncertainty is

addressed below. In the meantime, I will use ‘uncertainty’ to avoid awkwardness.)

Such assessments of uncertainty are essential for the usefulness of measurement, for

without them one cannot determine the consistency of two results from different

experiments, two different results from the same experiment, or a single

experimental result with a given theory (Beauchemin 2015). Although a common

practice of experimental particle physicists (and scientists in many other disciplines

as well), the reporting of estimates of systematic uncertainty has gone largely

unnoticed (or at least un-discussed) by philosophers of science, apart from a few

discussions noted in what follows.

Such neglect is unfortunate, for discussions of systematic uncertainty open a

remarkable window into experimental reasoning. Whereas statistical uncertainty is

simply reported, systematic uncertainty is also discussed. Even the most cursory

presentation will at least note the main sources of systematic uncertainty, while

more careful reports (such as the example discussed in the appendix) detail the

2


ways in which systematic uncertainties arise as well as the methods by which they

are assessed. Such discussions require forthright consideration by experimenters of

the body of knowledge that they bring to bear on their investigation, the ways in

which that knowledge relates to the conclusions they present, and the limitations

on that knowledge. This process is epistemologically crucial to the establishment of

experimental knowledge.

Moreover, philosophical insight regarding the estimation of systematic

uncertainty would be highly valuable. Presently, there is no clear consensus across

scientific disciplines regarding the basis or meaning of the distinction between

statistical and systematic uncertainty, despite some concerted efforts discussed

below. Scientists likewise debate the proper statistical framework in which

systematic uncertainty should be evaluated, a debate with important philosophical

aspects.

It is the contention of this paper that some progress may come from regarding

the estimation of systematic uncertainty as an instance of robustness analysis

applied to a model of a single experiment or measurement. More precisely, the

determination of systematic uncertainty bounds on a measurement result consists

of a weakening of the conclusion of an argument under the guidance of a robustness

analysis of its premises within the bounds of what is epistemically possible.

Experimentalists thereby establish the sensitivity of the measurement result

that is a crucial factor in its empirical value, while also establishing the security of

the evidence supporting the measurement result, which is necessary for the cogency

3


of the argument supporting the claim expressing the measurement result. However,

I will argue, these two achievements are in tension with one another: as one

weakens the conclusion to enhance the security of the evidence, one diminishes the

sensitivity of the measurement result itself. Just how much empirical value a

measurement result maintains, however, depends not only on the extent to which

the sensitivity of the measurement has been weakened by systematic uncertainty

bounds, but also on the use to which it will be put.

This account builds on two significant recent contributions to the

philosophical study of measurement and measurement quality: Eran Tal’s

model-based account of measurement, according to which the evaluation of

measurement accuracy is the outcome of a comparison amongst predictions drawn

from a model of the measurement process (Tal 2012, 2016) and Hugo Beauchemin’s

discussion of systematic uncertainty assessment as an essential component of

measurement needed to determine the sensitivity of measurement results in HEP

(Beauchemin 2015).

Section 2 discusses the concept of systematic uncertainty, surveying the ways

in which uncertainty has been distinguished from error, and systematic uncertainty

from statistical uncertainty. Section 3 uses an example of a typical HEP

measurement to illustrate the complexities and importance of systematic

uncertainty and to introduce some of the debates among particle physicists

regarding the appropriate statistical framework for the estimation of systematic

uncertainty. Section 4 outlines the secure evidence framework employed in the

4


analaysis. I present my argument for viewing systematic uncertainty estimation as

a kind of robustness analysis aimed at establishing empirical value in section 5. A

brief summary appears in section 6.

As an appendix, I present a discussion of an illuminating example from recent

particle physics: the ATLAS collaboration’s measurement of the tt̄ production

cross section from single lepton decays. The case illustrates the proposed analysis

of systematic uncertainty assessment, highlights some subtleties in its application,

and exemplifies the pervasive character of modeling and simulation in systematic

uncertainty estimation.

2 Systematic uncertainty: the very idea

To facilitate better conceptual understanding, we can begin by clarifying our

terminology, with some help from discussions among metrologists. Above I referred

to both error and uncertainty as being distinguished into systematic and statistical

categories. The two terms have distinct histories of usage in science. The scientific

analysis of error dates to the seventeenth century, while “the concept of uncertainty

as a quantifiable attribute is relatively new in the history of measurement”

(Boumans & Hon 2014, 7).

In practice, particle physicists have not always been careful to distinguish

between error and uncertainty. Recent papers from the CMS and ATLAS

collaborations focus their discussions on uncertainty rather than error, although

usage in is not perfectly uniform in this regard.

5


Metrologists, by contrast, have articulated systematic distinctions between

error and uncertainty, as befits the science whose concern is the very act of

measurement. Yet the usefulness and definition of these terms remain matters of

debate among metrologists, whose Joint Committee for Guides in Metrology

(JCGM) publishes the “Guide to the Expression of Uncertainty in Measurement”

(GUM) (Joint Committee for Guides in Metrology Working Group I (JCGM-I)

2008) and the “International Vocabulary of Metrology” (VIM) (Joint Committee

for Guides in Metrology Working Group II (JCGM-II) 2012). Those debates have

turned significantly on the question of the definition of error as articulated in these

canonical texts, particularly insofar as that definition appeals to an unobservable,

even “metaphysical” concept of the “true value” of the measurand (JCGM-I 2008,

36).

Some metrologists have defended the importance of retaining a concept of

error defined in terms of a true value or “target value” (Mari & Giordani 2014;

Willink 2013; Rabinovich 2007, 95). It is not the purpose of this paper to debate

these issues. For the sake of clarity, I will adopt the terminology of Willink (2013)

and understand a measurement to be a process whereby one obtains a numerical

estimate x (the measurement result or measurement estimate) of the target value θ

of the measurand. This usage allows for the straightforward definition of

measurement error as the difference between the measurement result x and the

target value θ.

We may then regard statistical and systematic error as components of the

6


overall measurement error, and turn our attention to how they are to be

distinguished.

Following the JCGM, we could define the former (also called random error )

as the difference between the measurement result x and “the mean that would

result from an infinite number of measurements of the same measurand carried out

under repeatability conditions” (JCGM-I 2008, 37). Note that if the measurement

procedure is itself biased, then the latter quantity, i.e., the quantity that would

emerge as the mean measurement result in the long-run limit, will not be equal to

the target value. It is this inequality that is to be labeled systematic error.

One may approach the concept of systematic error by imagining a

measurement process in which it is absent. For such a process, the “mean that

would result from an infinite number of measurements of the same measurand

carried out under repeatability conditions” simply would be the target value.

Systematic error, then, is a component of measurement error “that in replicate

measurements remains constant or varies in a predictable way” (JCGM-II 2012, 22;

see also Willink 2013) and therefore does not disappear in the long run.

Eran Tal, in proposing a model-based account of measurement, has noted an

important limitation of this conceptualization of measurement error, which is that

it obscures the central role played by the model of the measurement process. The

JCGM’s appeal to an infinite number of measurements carried out under

repeatability conditions relies implicitly on an unspecified standard as to what

constitutes a repetition of a given measurement of a measurand. A model of the

7


measurement process not only supplies that standard, it serves to articulate the

quantity measured by a given process and thus helps to specify what kinds of

measurement outcomes constitute errors. To make these roles of the model explicit,

Tal proposes a “methodological” definition of systematic error as “a discrepancy

whose expected value is nonzero between the anticipated or standard value of a

quantity and an estimate of that value based on a model of a measurement

process” (Tal 2012, 57). The notion of a true value or a target value of the

measurand has been supplanted here by an “anticipated or standard value” that

must be ascertained through a calibration process, which in turn is understood as a

process of modeling the measuring system. Tal’s emphasis on the process by which

the error is estimated renders the concept a methodological one, but not yet a

purely epistemic concept, for which Tal reserves the term ‘uncertainty’ (ibid., 30).

For purposes of the quantitative treatment of uncertainty, the GUM offers the

following definition of it: “parameter, associated with the result of a measurement,

that characterizes the dispersion of the values that could reasonably be attributed

to the measurand” (ibid.). This definition clearly marks uncertainty as something

potentially quantifiable, but also as something epistemic, requiring consideration of

some kind of standard of reasonable attribution. What the GUM’s definition does

not do, however, is to provide guidance for interpreting this notion of reasonability.

Neither does it provide clear guidance in understanding how to characterize the

distinction between evaluations of statistical uncertainty and systematic

uncertainty.1

8


3 Systematic uncertainty in HEP

To better appreciate the distinction between statistical and systematic uncertainty,

and to think more concretely about the epistemic work accomplished by the

evaluation of systematic uncertainty, let us consider the disciplinary practices for

dealing with systematic uncertainty as it arises in measurements in HEP. We begin

with an example.

3.1 Measuring a cross section

Measurements of cross sections are a standard part of the experimental program of

HEP research groups. The cross section σ quantifies the probability of an

interaction process yielding a certain outcome, such as the interaction of two

protons in an LHC collision event yielding a top quark – anti-top quark pair (the tt̄

production cross section). At its crudest level, such a measurement is simply a

matter of counting how many times N, in a given data set, a tt̄ pair was produced,

and then dividing N by the number L of collision events that occurred during data

collection (the latter number particle physicists call the luminosity ). We might call

this the “fantasyland” approach to measuring the tt̄ production cross section:

σtt̄ =
N
L

.

Reality intervenes in several ways to drive the physicist out of fantasyland:

(1) tt̄ pairs are not directly observable in particle physics data, but must be

identified via the identification of their decay products. These products are also

9


not directly observable but must be inferred from the satisfaction of data selection

criteria (“cuts”). Events that satisfy these criteria are candidates for being events

containing tt̄ pairs.

(2) tt̄ candidate events may not contain actual tt̄ pairs, i.e., they may not be

signal events. Other particle processes can produce data that are indistinguishable

from tt̄ decay events. These events are background. It is the nature of background

candidate events that they cannot, given the cuts in terms of which candidates are

defined, be distinguished from the signal candidate events that one is aiming to

capture; one can only estimate the number to be expected Nb and subtract it from

the total number of candidate events observed Nc.

(3) Just as some events that do not contain tt̄ pairs will get counted as

candidate events, some events that do contain tt̄ pairs will not get thus counted.

This problem has two facets.

(3a) The tt̄ production cross section as a theoretical quantity might be

thought of in terms of an idealized experiment in which every tt̄ pair created would

be subject to detection in an ideal detector with no gaps in its coverage. Since

actual detectors do not have the ability to detect every tt̄ event, this limitation of

the detector must be taken into account by estimating the acceptance A.

(3b) The production and decay of tt̄ pairs are stochastic processes and the

resulting decay products will exhibit a distribution of properties. The cuts that are

applied to reduce background events will have some probability of eliminating

signal events. The solution to this is to estimate the efficiency � of the cuts: the

10


fraction of signal events that will be selected by the cuts.

(4) The physical properties of the elements in a collision event are not

perfectly recorded by the detector. Candidate events are defined in terms of

quantitative features of the physical processes of particle production and decay. For

example, top quarks decay nearly always to a W boson and a b quark. A tt̄ pair

will therefore result in two W bosons, each of which in turn can decay either into a

quark-antiquark pair or into lepton-neutrino pair. To identify tt̄ candidate events

via the decay mode in which one of the W bosons decays to a muon (µ) and a

muon neutrino (νµ), physicists might impose a cut that requires the event to

include a muon with a transverse momentum p
µ
T of at least 10 GeV, in order to

discriminate against background processes that produce muons with smaller

transverse momenta. Whether a given event satisfies this criterion or not depends

on a measurement output of the relevant part of the detector, and this measuring

device has a finite resolution, meaning that an event that satisfies the requirement

p
µ
T > 10 GeV might not in fact include a muon with such a large transverse

momentum. Conversely, an event might fail the p
µ
T cut even though it does in fact

include a such a muon.

These detector resolution effects require physicists who wish to calculate the

tt̄ production cross section to base that calculation not simply on the number of

candidate events as determined from the comparison of detector outputs to data

selection criteria, but on the inferred physical characteristics of the events taking

into account detector resolution effects. This process, called unfolding, requires

11


applying a transformation matrix (estimated by means of simulation) to the

detector outputs. As Beauchemin emphasizes, unfolding is not a matter of

correcting the data, but inferring from the detector outputs (via the transformation

matrix) the underlying distribution, to which the cuts are then applied

(Beauchemin 2015, 23).2

(5) Finally, the luminosity is also not a quantity that is susceptible to direct

determination, since distinct events might not get discriminated by the detector, a

single event might mistakenly get counted as two distinct events, and some events

might be missed altogether. The luminosity must therefore be estimated.

We have thus gone from the fantasyland calculation σtt̄ =
N
L

to the physicists’

calculation

σtt̄ =
Nc −Nb
�AL

. (1)

This calculation is not merely more complex than the fantasyland calculation.

Every quantity involved in it is the outcome of an inference from a mixture of

theory, simulation, and data (from the current experiment or from other

experiments).3 Each has its own sources of uncertainty that the careful physicist is

obliged to take into account.

3.2 Methdological debates

But how ought one to take these uncertainties into account? HEP lacks a clear

consensus.

Discussions about the conceptualization of error and uncertainty, and about

12


their classification into categories such as statistical and systematic, are

inseparably bound up with debates regarding the statistical framework in which

these quantities should be estimated and expressed. When discussion focuses on

statistical error alone, the applicability of a strictly frequentist conception of

probability stirs up no significant controversy. One can clearly incorporate into

one’s model of an experiment or measuring device a distribution function

representing the relative frequency with which the experiment or device would

indicate a range of output values (results) for a given value of the measurand. Such

a model, which will inevitably involve some idealization, can be warranted by a

chain of calibrations. Indeed, as argued by Tal (2012), the warranting of inferences

from measurement results in general requires such idealization. One can then

incorporate this distribution of measurement errors into one’s account of the

measurement’s impact on the uncertainty regarding the value of the measurand.

Systematic errors cannot be treated in this same straightforward manner.

Consider the paradigmatic example of systematic error: a biased measuring device.

Suppose that a badly-constructed ruler for measuring length systematically adds

0.5 cm whenever one measures a 10.0 cm length. Repeated measurements of a

given 10.0 cm standard length would produce results that cluster according to

some distribution around the expectation value 10.5 cm. The difference between

the 10.0 cm standard length and the expectation value of 10.5 cm just is the

systematic error on such measurements. If we know that this bias is present, we

can eliminate it by correction.

13


The problem of the estimation of systematic uncertainty arises precisely when

one cannot apply the correction strategy because the magnitude of the error is

unknown. The investigator knows that a systematic error might be present, and

the problem is to give reasonable bounds on its possible magnitude. To this

problem the notion of a frequentist probability distribution no longer has an

obvious direct applicability; the error is either systematically present (and with

some particular, but unknown, magnitude), or it is not.4

As a consequence, investigators employing frequentist statistics to evaluate

statistical uncertainty must report systematic uncertainty separately, as particle

physicists typically do. The quantities denoted ‘statistical’ and ‘systematic’ in a

statement such as ‘σtt̄ = 187 ± 11(stat.)+18−17(syst.) pb’ are conceptually

heterogeneous. Combining them into a single quantity and calling it the “total

uncertainty” is problematic.

One response to this problem is to adopt the Bayesian conception of

probability as a measure of degree of belief. Such a shift from frequentist to

Bayesian probabilities is a natural concomitant of the shift from an Error Approach

to an Uncertainty Approach as discussed in the VIM, because the expectation

value of a Bayesian probability distribution is no longer understood as the mean in

the long-run limit, but the average of all possible measurement results weighted by

how strongly the investigator believes that a given result will obtain, when applied

to a given measurand. A putative advantage of this Bayesian approach is that it

allows for the straightforward synthesis of statistical and systematic uncertainties

14


into a single quantity. Both Sinervo and the JCGM (JCGM-I 2008, 57) cite this as

a point in favor of a Bayesian understanding of the probabilities in a quantitative

treatment of systematic uncertainty.

Adopting a Bayesian approach comes with well-known difficulties, however,

also acknowledged by Sinervo (see also Barlow 2002). Investigators must provide a

prior distribution for each parameter that contributes to the systematic uncertainty

in a given measurement. Just what the constraints on such prior distributions

ought to be (aside from simple coherence) is very unclear.

A third approach to the problem employs a hybrid of Bayesian and

frequentist techniques. The Cousins–Highland method relies on a calculation that

takes a frequentist probability distribution (giving rise to the statistical error) and

“smears” it out by applying a Bayesian probability distribution to whatever

parameters that distribution might depend on that are sources of systematic

uncertainty (Cousins & Highland 1992).

The basic idea is this: suppose that one has a set of observations

xi, i = 1, . . . ,n, distributed according to p(x|θ), and that the data {xi} are to be

used to make inferences about the parameter θ. Now, suppose that such inferences

require assumptions about the value of λ, an additional parameter, the value of

which is subject to some uncertainty. The hybrid method involves introducing a

prior distribution, π(λ), to enable the calculation of a modified probability

distribution pCH(x|θ) =
∫
p(x|θ,λ)π(λ)dλ, which then becomes the basis for

statistical inferences (Sinervo 2003, 128).

15


The Cousins-Highland hybrid approach yields, as critics have noted (Cranmer

2003; Sinervo 2003), neither a coherent Bayesian nor a coherent frequentist

conception. The statistical distribution p(x|θ) is intended to be frequentist, but the

prior distribution on λ has no truly frequentist significance, leaving the modified

distribution pCH(x|θ) without any coherent probability interpretation. Cousins and

Highland defend the approach on the grounds that it adheres as closely as possible

to a frequentist approach while avoiding “physically unacceptable” consequences of

a “consistently classical” approach, viz., that when deriving an upper limit on a

quantity from two otherwise identical experiments, the stricter limit will be

derivable from the one that has a larger systematic uncertainty (Cousins &

Highland 1992; Cousins 1995).

The discussion thus far serves to illustrate some of the ways in which the

conceptual underpinnings of systematic uncertainty estimation remain unsettled.

The conceptual disorder not only poses an intellectual problem, but contributes to

ongoing confusion and controversy over the appropriate methodology for

estimating uncertainty. Moreover, the problems spill over into the use of

uncertainty bounds when determining the compatibility of one measurement result

with another, or with a theoretical prediction. The problem then arises as to how

statistical and systematic uncertainties will be combined. It is common to add

them in quadrature, for example when using a χ2 fit test, but doing so introduces

distributional assumptions that may be unwarranted, and that may not reflect the

manner in which the systematic uncertainty was in fact determined in the first

16


place. One could simply add the two uncertainty components in a linear manner,

but this would in many cases significantly and unnecessarily reduce the sensitivity

of the measurement results (an issue that will be explored further below). HEP

currently lacks a satisfactory methodology for treating systematic uncertainty.

Without attempting to resolve such thorny methodological disputes at once,

we can progress towards a more satisfactory treatment of uncertainty estimation by

first grasping more clearly the epistemic work such estimates seek to accomplish.

I will argue that by working within the secure evidence framework, we can at

least partially assimilate the epistemic work accomplished by systematic

uncertainty estimation to that accomplished by robustness analysis. That

systematic uncertainty estimation provides a means for investigating the robustness

of a measurement result has already been argued in an illuminating essay by

Beauchemin (Beauchemin 2015). By quantifying both the uncertainty of a

measurement result and its sensitivity to the phenomena under investigation, he

argues, scientists can quantify the scientific value of a measurement result and

provide criteria for minimizing the circularity that arises from the theory-laden

aspect of measurement.

Here I aim to build upon the insights of Beauchemin by providing a broader

epistemological framework for understanding what is achieved through robustness

analysis in the context of estimating systematic uncertainty. The secure evidence

framework I invoke involves no explicit commitment to either frequentist or

Bayesian statistical frameworks. The relevant modality in the framework is

17


possibility, which I take to be conceptually prior to probability, insofar as no

probability function can be specified without specifying a space of objects (whether

events, propositions, or sentences) over which the function is to be defined.

4 The secure evidence framework

Here I explain the secure evidence framework that will be used to analyze these

issues (Staley 2004; Staley & Cobb 2011; Staley 2012, 2014). On the one hand, we

might wish to think of the evidence or support for a hypothesis provided by the

outcome of a test of that hypothesis in objective terms, such that facts about the

epistemic situation of the investigator are irrelevant. On the other hand, it seems

quite plausible that evaluating the claims that an investigator makes about such

evidential relations may require determining what kinds of errors investigators are

in a position to justifiably regard as having been eliminated, which does depend on

their epistemic situation. The secure evidence framework provides a set of concepts

for understanding the relationship between the situation of an epistemic agent

(either individual or corporate) and the objective evidential relationships that

obtain between the outcomes of tests and the hypotheses that are tested. This

account relies centrally on a concern with possible errors, and explicitly

understands the relevant modality for possible errors to be epistemic.

An epistemic agent who evaluates the evidential bearing of some body of data

x0 with regard to a hypothesis H must also consider the possibilities for error in

the evaluation thus generated. This is the “critical mode” of evidential evaluation.

18


Evidential judgments rely on premises, and errors in the premises of such a

judgment may result in errors in the conclusion. A responsible evaluation of

evidence therefore requires consideration of the ways the world might be, such that

a putative evidential judgment would be incorrect. The evaluator must reflect on

what, among the propositions relevant to the judgment in question, may safely be

regarded as known, and what propositions must be regarded as assumed, but

possibly incorrect.

Such possibilities of error are here regarded as epistemic possibilities. Often,

when one makes a statement in the indicative that something might be the case,

one is expressing an epistemic possibility, with what must be the case functioning

as its dual expressing epistemic necessity (Kratzer 1977). An expression roughly

picking out the same modality (at least for the singular first-person case) is ‘for all

I know’, as in ‘for all I know there is still some ice cream in the freezer’. Theorists

have offered a range of views regarding the semantics of epistemic modality (see

Egan & Weatherson 2011), with various versions of contextualism and relativism

among the contending positions. For our purposes we need only note that what is

epistemically possible for an epistemic agent does depend on that agent’s

knowledge state and that when an agent acquires new knowledge it always follows

that some state of affairs that was previously epistemically possible for that agent

ceases to be so.

The epistemic possibilities that are relevant to the critical mode of evidential

judgment are error scenarios, which are to be understood as follows: Suppose that

19


P is some proposition, S is an epistemic agent considering the assertion of P ,

whose epistemic situation (her situation regarding what she knows, is able to infer

from what she knows, and her access to information) is K. Then, a way the world

(or some relevant part of the world) might be, relative to K, such that P is false,

we will call an error scenario for P relative to K.

Of special importance for this discussion are error scenarios for evidence

claims, where EC is an evidence claim if it is a statement expressing a proposition

of the form ‘data x from test T are evidence for hypothesis H’ (such hypotheses

may include statements about the value of some measurand in a measurement

process, and it is here assumed that measurement procedures and hypothesis

testing are epistemologically susceptible to the same analysis).

Suppose that, relative to a certain epistemic situation K, there is a set of

scenarios that are epistemically possible, and call that set Ω0. If proposition P is

true in every scenario in the range Ω0, then P is fully secure relative to K. If P is

true across some more limited portion Ω1 of Ω0 (Ω1 ⊆ Ω0), then P is secure

throughout Ω1.

To put this notion more intuitively, then, a proposition is secure for an

epistemic agent just insofar as that proposition remains true, whatever might be

the case for that agent. Thus defined, security is applicable to any proposition, but

the application of interest here is to evidence claims.

Note that inquirers might never be called upon to quantify the degree of

security of any of their inferences. The methodologically significant concept is not

20


security per se, but the securing of evidence, i.e., the pursuit of strategies that

increase or assess the relative security of an evidence claim.

Two strategies for making inferences more secure are the weakening of

evidential conclusions to render them immune to otherwise threatening error

scenarios and the strengthening of premises, in which additional information is

gathered that rules out previously threatening error scenarios. Robustness analysis

constitutes a strategy for assessing the security of an evidence claim by

investigating classes of potential error scenarios to determine which scenarios are

and which are not compatible with a given evidential conclusion (Staley 2014).

The present analysis aims to show how the consideration and evaluation of

systematic uncertainty constitutes an application of the weakening strategy under

the guidance of robustness analysis.

5 Systematic uncertainty assessment as robustness analysis

To see how the evaluation of systematic uncertainty can be viewed as a variety of

robustness analysis, consider again the example of measuring the tt̄ production

cross section in proton-proton collisions. Recall how equation 1 relates that

quantity to other empirically accessible quantities:

σtt̄ =
Nc −Nb
�AL

.

This equation is a premise of an argument for a conclusion about the value of σtt̄,

as are statements attributing values to each of the variables in the equation.

21


If we take the conclusion of such an argument to be the attribution of some

definite value σtt̄ = σtt̄0, then (because the data are finite) the premises would fail

to provide cogent support for the conclusion. Were we to repeat the exact same

experiment, it is very probable that a different value assignment would result. It is

not clear that such a statement should be considered an experimental conclusion at

all. It appears to be a statement about the value of the quantity σtt̄, but its

usefulness for empirical inquiry is severely limited by the fact that the comparison

of any two such point-value determinations is effectively certain to yield the result

that they disagree. To restore cogency and empirical value, it is necessary to

replace the conclusion σtt̄ = σtt̄0 with one stating σtt̄ = σtt̄0 ± δ. Call this the

unmodulated conclusion. The addendum ‘±δ’ expresses the statistical uncertainty

that is a function of the number of candidate events, Nc. But it will also be

necessary to report the systematic uncertainty resulting from imperfect knowledge

of the acceptance, efficiency, background, and luminosity, as well as the unfolding

matrix used to determine the number of candidate events. That conclusion, which

includes a statement of statistical uncertainty, but adds the results of an

assessment of systematic uncertainty, we can call the modulated conclusion.

Importantly, lack of knowledge enters into statistical and systematic

uncertainties in quite different ways. In the case of statistical uncertainty, the

relevant lack of knowledge concerns the exact value of the quantity σtt̄, the

measured value of which is reported in the conclusion. The aim of the inquiry is to

reduce this uncertainty, and knowledge of the quantities on the right hand side of

22


equation 1 is a means toward the achievement of this aim. The statistical

uncertainty that remains once those means have been deployed is a consequence of

the fact that any finite number of observations yields only partial information

about the value of σtt̄. For the purposes of assessing statistical uncertainty,

however, the premises in the argument for this conclusion are assumed to be

determinately and completely known. We take ourselves to know how many

candidate events were counted, for example, and as long as that number is finite,

there will be some corresponding statistical uncertainty on the estimate of σtt̄.

In other words, for the purposes of assessing statistical uncertainty, we

concern ourselves only with the possibility of errors in the conclusion of the

argument, errors that take the form of assigning incorrect values to the measurand.

For this purpose, the model of the measuring process is taken to be adequate and

the premises of the argument are assumed to be free of error.

Systematic uncertainty may then be regarded as arising from the

consideration that in fact the premises are not determinately and completely

known. Physicists are not in a position to know that a premise attributing a

definite value to, say, � is true. To derive an estimate of σtt̄, some value must be

assigned, but, having made such an assignment, investigators must confront the

fact that other value assignments are compatible with what they know about the

detector and the physical processes involved in the experiment.

The assessment of systematic uncertainty tackles this problem of incomplete

knowledge by, in some way, exploring the extent to which varying the assumed

23


values asserted in the premises, within the bounds of what is possible given the

investigators’ knowledge, makes a difference to the conclusion drawn regarding the

value of the measured quantity. This effectively generates an ensemble of

arguments corresponding to the considered range of possible value-assignment

premises. By considering such a range of arguments with possibly correct premises,

the investigators can then report a weakened conclusion such as

σtt̄ = 187 ± 11(stat.)+18−17(syst.)pb,

the correctness of which would be supported by the soundness of any one of the

arguments in the ensemble.

From the perspective of the secure evidence framework, such assessments can

be regarded as a combination of robustness and weakening strategies. Investigators

begin with a set of data xi, i = 1, . . . ,n reporting observations relevant to the

measurement of some quantity θ. This derivation depends on assigning values

λj = λj0,j = 1, . . . ,m to each of a set of m imperfectly known parameters

necessary for the derivation of an estimate θ̂0 (with a statistical uncertainty

depending on the value of n). This yields the unmodulated conclusion, which we

might also regard as the unsecured conclusion, in the sense that the strategies for

securing evidence claims mentioned previously have not yet been applied to it.

Consideration then turns to limitations on the investigators’ knowledge of the

initial set of model assumptions. Using the robustness analysis strategy, alternate

sets of assumptions that are compatible with the investigator’s epistemic state are

24


considered, taking the form λj = λj0 + εj for each λj and for a range of values of εj,

depending on the extent to which existing knowledge constrains the possible values

of λj. This yields a range of derived estimates lying within an interval θ̂0
+δ1
−δ2

, which

then can provide the basis for a logically weakened conclusion incorporating both

statistical and systematic uncertainties. The convergence of estimates generated by

the ensemble provides the basis for the robustness of this modulated conclusion.

The account just given is misleading in one respect. The evaluation of

systematic uncertainty is typically not a matter of directly varying the input

quantities in equation 1. Instead, physicists look upstream, to the methods and

models employed in the determination of those input quantities, and introduce

variation there. (The appendix gives examples of how this is done.) In some cases,

this is a matter of varying the value of some parameter in a model, in other cases it

is a matter of swapping one model for another. What is at issue in such variation is

not so much the question of what might be the true value of the parameter or the

true model (notions that might be inapplicable in a given case) as which models or

parameter values within a model might be adequate for the purposes of the

inference that is being undertaken (W. S. Parker 2009; W. Parker 2012).

The secure evidence framework gives us a new perspective on what is

distinctive about the assessment of systematic uncertainty: it involves

consideration of what might be the case regarding parameters in the model of the

experiment to which values must be assigned in order to derive an estimate of the

measured quantity. Systematic uncertainty is thus concerned with possible errors

25


and inadequacies in the premises of an argument from a model of measurement

processes. The determination of systematic uncertainty involves the determination

of reasonable bounds on the possibility of errors in those premises and inadequacies

in the model of the process. Such analysis rests on an ensemble of completed

models, each of which corresponds to a potential error scenario regarding claims

about the value of the measured quantity. Only through the consideration of the

outputs of such an ensemble can a systematic uncertainty be assessed, whereas

claims about statistical uncertainty require only a single completed model.

Such a perspective also helps us to understand why systematic uncertainty is

important in science and should be important in the philosophy of science. As

noted in the introduction and documented in the appendix, while statistical

uncertainties are simply reported, a good assessment of systematic uncertainties

involves careful consideration of the relevance of a wide range of factors that are

relevant to the conclusion being drawn from the data, as well as careful probing of

the limitations on the investigators’ prior knowledge of those factors. Whereas a

model of the experiment is used to arrive at an unmodulated conclusion, the

modulated conclusion depends on a stage of critical assessment of that model,

achieved through a process of robustness analysis.

The empirical value of a measurement result depends not only on its security,

however, but also on its sensitivity, as explained in a recent paper by Beauchemin

(Beauchemin 2015).

The sensitivity of a measurement result depends on the comparison of “(1)

26


the difference in the values of a given observable when calculated with different

theoretical assumptions to be tested with the measurement; and (2) the

uncertainty on the measurement result” (ibid., 29). Only if the systematic

uncertainty is sufficiently small in comparison to the differences between different

theoretically calculated quantity values can the measurement result be used to

discriminate empirically between the competing theoretical assumptions.

Sensitivity considerations thus reveal the cost of applying the weakening

strategy to enhance the security of a measurement result: One can weaken the

conclusion of an argument for such a result by enlarging the systematic uncertainty

bounds applied to it, thus achieving a conclusion that is secure across a broader

range of epistemically possible scenarios. However, if the systematic uncertainties

thus reported exceed the differences between values of the measurand yielded by the

theoretical assumptions to be tested amongst, then the level of security achieved

renders the measurement result empirically worthless relative to that testing aim.

6 Conclusion

These considerations highlight some important remaining questions. The most

pressing concern the determination of relevant epistemic possibilities for purposes

of setting systematic uncertainty bounds. The determination of systematic

uncertainty involves the determination of reasonable bounds on the possibility of

errors in those premises and inadequacies in the model of the process. Which

possibilities deserve consideration? What criteria ought to be considered when

27


determining standards of reasonableness on such bounds? The previously

mentioned debates over methodology can be thought of as debates over the best

way to approach such questions.

Given the scant attention that the evaluation of systematic uncertainty has

received in the philosophical literature, it is to be expected that conclusions drawn

at this stage of inquiry should be provisional and exploratory in nature. I have

proposed that, while many questions regarding the assessment of systematic

uncertainty remain unresolved, a good first step towards a philosophical

appreciation of this practice is to regard it as a kind of robustness analysis. The

secure evidence framework provides a context for understanding robustness

analysis in general that also supports this identification of systematic uncertainty

assessment as a kind of robustness analysis. That a coherence can thereby be

established between this view of systematic uncertainty assessment and Tal’s

model-based account of measurement is an additional virtue of the present account.

Further discussion is, of course, needed. Of particular importance is to

engage the philosophical aspects of the debate over statistical methodology in

systematic uncertainty estimation. Gaining clarity about the aims of systematic

uncertainty estimation is crucial for this purpose. I have argued here for the view

that the aim of this practice is to secure the evidence supporting the modulated

conclusions that are the result of the assessment of systematic uncertainty.

28


7 Appendix: measuring the tt̄ production cross section

Because assessments of systematic uncertainty are considered essential to the

publication of any experimental result in particle physics, the choice of an example

to illustrate the practice is largely arbitrary. Here I present a recent example from

the ATLAS group at the Large Hadron Collider (LHC) at CERN, which also

illustrates the pervasive character of modeling and simulation in the analysis of

experimental data in contemporary High Energy Physics (HEP).

The cross section for a given state quantifies the rate at which that state is

produced out of some particle process. Cross sections serve as crucial parameters in

the Standard Model, and, especially in the case of the top quark, provide

important constraints on the viability of numerous Beyond–Standard Model

theories as well. Both ATLAS and CMS, its neighbor at the LHC, have published a

number of measurements of the production cross section for both top–anti top (tt̄)

pairs (Aad et al. 2012; Aad, Abbott, Abdallah, Abdelalim, et al. 2012b, 2012a;

Khachatryan et al. 2011; Chatrchyan et al. 2013) and for single top quarks (Aad,

Abbott, Abdallah, Khalek, et al. 2012). The present discussion focuses on a

measurement of the top quark pair production cross section based on a search for

top quark decays indicated by a single high momentum lepton (electron or muon)

and jets produced by strong interaction processes characterized by Quantum

Chromodynamics (QCD).

29


Let us begin with the result that ATLAS reports:5

σtt̄ = 187 ± 11(stat.)+18−17(syst.)pb (2)

We have already seen the essential logic of such a measurement: One seeks to

estimate the rate at which tt̄ pairs are produced from the number of tt̄ candidate

events in the data. Making that inference, however, demands estimates of the

quantities in equation 1, and the accurate estimation of those quantities demands

skill and expert judgment.

The complete estimate of uncertainty in this case draws on more

considerations than I can address in a brief discussion (see figure 1), but the

following should suffice to communicate the nature of the problem.

Consider first the estimation of signal acceptance and efficiency. Estimating

these quantities requires consideration of characteristics of the detector, drawing on

the engineering knowledge of the detector’s design and construction as well as

experimental knowledge of its performing characteristics. This background

knowledge forms the basis for a computer simulation of the detector itself.

Estimating signal acceptance also involves the knowledge of the characteristics of

the signal itself: how are tt̄ pairs produced in the proton-proton collisions generated

at the LHC, and how do they behave once they have been produced? To model the

tt̄ signal, ATLAS uses a variety of simulation models and a variety of parameter

30


250 ATLAS Collaboration / Physics Letters B 711 (2012) 244–263

the different channels were combined in the likelihood fit by mul-
tiplying the individual likelihood functions.

The normalisation of the tt̄ signal templates is the parame-
ter of interest in the fit and was allowed to vary freely in both
analyses. The tt̄ cross-section was assumed to be common to all
channels and the number of tt̄ events in each subsample returned
by the fit was related to the tt̄ cross-section by the expression
σtt̄ = Nsig/(

!
L dt × ϵsig), where Nsig is the number of tt̄ events,!

L dt is the integrated luminosity and ϵsig is the product of the
signal acceptance, selection efficiency and branching ratio, ob-
tained from tt̄ simulation. The normalisation of the backgrounds
was treated differently in the two analyses. In the untagged anal-
ysis the multijet and small backgrounds (single-top, diboson and
Z + jets production) were fixed in the fit to their expected con-
tributions, whereas the W + jets background was allowed to vary
freely in each channel. In the tagged analysis all backgrounds were
allowed to vary within the uncertainties of their assumed cross-
sections, described in Sections 3 and 6. These uncertainties were
used as Gaussian constraints on the cross-section normalisation.
The robustness of this fitting approach was checked with ensem-
ble tests. The central value and uncertainties returned by the fit
were shown to be unbiased for a wide range of input cross-
sections.

8. Systematic uncertainties

The evaluation of the systematic uncertainties was performed
differently in the two analyses. The untagged analysis performed
pseudo-experiments (PEs) with simulated samples which included
the various sources of uncertainty. For example, for the JES un-
certainty, PEs were performed with jet energies scaled up and
down according to their uncertainties and the impact on the cross-
section was evaluated. The tagged analysis, on the other hand,
accounted for most of the changes in the normalisation and shape
of the templates due to systematic uncertainties by adding ‘nui-
sance’ terms to the fit [41]. Templates of the samples with one
standard deviation ’up’ and ’down’ variations of the systematic un-
certainty source under study were generated in addition to the
nominal templates. The fit interpolated between these templates
with a continuous parameter by means of a Gaussian constraint.
Before the fit, the constraint was such that the mean value was
zero and the width was one; a fitted width less than one means
that the data were able to constrain that particular source of un-
certainty. The effects due to the modelling of the W + jets and
multijet background shapes, initial and final state radiation, parton
density function of the tt̄ signal, NLO generator, hadronisation and
template statistics cannot be fully described by a simple linear pa-
rameter controlling the template interpolation. As a consequence,
they were not treated as nuisance terms but obtained by per-
forming PEs with modified simulated samples, as was done in the
untagged analysis.

The nuisance parameters of the systematic uncertainties were
all fitted together taking into account the correlations among them
in the minimisation process. As a consequence, the uncertainties
on the fitted quantities obtained from the fit include both the sta-
tistical and the total systematic components. Therefore, to obtain
an estimation of the individual contributions to the total uncer-
tainty in the tagged analysis, each individual systematic uncer-
tainty was obtained as the difference in quadrature between the
total uncertainty and the uncertainty obtained after having fixed
the corresponding nuisance parameter to its fitted value. The cen-
tral values of the nuisance parameters after the fit agreed with
their input values. The fit was cross-checked using PEs where
the starting value of the nuisance parameters was different than
the nominal value. The result was found to be unbiased. In addi-

Table 2
Statistical and systematic uncertainties on the measured tt̄ cross-section in the
untagged and tagged analyses. Multijet and small backgrounds normalisation un-
certainties are already included in the statistical uncertainty (a/i) in the tagged
analysis. W + jets heavy-flavour content and b-tagging calibration do not apply (n/a)
to the untagged analysis. The luminosity uncertainty is not included in the table.

Method Untagged Tagged

Statistical Error (%) +10.1 −10.1 +5.8 −5.7
Object selection (%)

JES and jet energy resolution +4.1 −5.4 +3.9 −2.9
Lepton reconstruction,

identification and trigger +1.7 −1.6 +2.1 −1.8
Background modelling (%)

Multijet shape +3.5 −3.5 +0.8 −0.8
Multijet normalisation +1.1 −1.2 a/i
Small backgrounds norm. +0.6 −0.6 a/i
W + jets shape +3.9 −3.9 +1.0 −1.0
W + jets heavy-flavour content n/a +2.7 −2.4
b-tagging calibration n/a +4.1 −3.8

tt̄ signal modelling (%)

ISR/FSR +6.3 −2.1 +5.2 −5.2
NLO generator +3.3 −3.3 +4.2 −4.2
Hadronisation +2.1 −2.1 +0.4 −0.4
PDF +1.8 −1.8 +1.5 −1.5

Others (%)

Simulation of pile-up +1.2 −1.2 < 0.1
Template statistics +1.3 −1.3 +1.1 −1.1

Systematic Error (%) +10.5 −9.4 +9.7 −9.0

tion, large variations of the kinematic dependence of the nuisance
parameters (e.g. the JES as a function of the jet p T ) were con-
sidered and resulted in a negligible impact on the result of the
fit.

The systematic uncertainties on the cross-section for both
methods are summarised in Table 2. The dominant effects in the
untagged analysis were JES, multijet and W + jets backgrounds
shape and ISR/FSR. The latter was also important in the tagged
analysis, together with the uncertainty related to the signal MC
generator. In addition, this analysis was sensitive to effects related
to b-tagging, specifically the determination of the heavy-flavour
content of the W + jets background and the calibration of the
b-tagging algorithm itself. The luminosity uncertainty was 3.4%
[42,43].

Several cross-checks of the cross-section measurements were
performed. These included the results of the likelihoods applied
to individual lepton channels and tt̄ cross-section measurements
done with simpler and complementary approaches, including cut-
and-count methods and fits to kinematic variables such as the
reconstructed top mass. These cross-checks gave consistent results
within the uncertainties.

9. Results and conclusions

The results of the likelihood fits applied to the data are shown
in Figs. 5 and 6, where the distributions of the discriminants in
data are overlaid on the fitted discriminant distributions of the sig-
nal and backgrounds. The final measured cross-section results are:
σtt̄ = 173 ± 17(stat.)+18−16(syst.) ± 6(lumi.) pb = 173

+25
−24 pb in the un-

tagged analysis and σtt̄ = 187 ± 11(stat.)+18−17(syst.) ± 6(lumi.) pb =
187+22−21 pb in the tagged analysis. The two measurements are in
agreement with each other. The latter has a better a priori sensi-
tivity and thus constitutes the main result of this Letter. It is the
most precise tt̄ cross-section measurement at the LHC published to
date and is in good agreement with the SM prediction calculated
at NLO plus next-to-leading-log order 165+11−16 pb [1–3].

Figure 1: Table of statistical and systematic uncertainties for two different analyses,

one requiring events to include a jet from a b-quark (“tagged”) and one without that

requirement (“untagged”). Note that everything below the first line (“statistical

error”) is a contribution to the systematic uncertainty. The total systematic uncer-

tainty is calculated by adding the individual contributions in quadrature. From Aad

et al. 2012b, 250.

31


values within those models. It is this variation of modeling assumptions and their

role in the production of systematic uncertainty estimates that I wish to emphasize.

To simulate the production of tt̄ pairs in the collider environment, numerous

complex stochastic QCD processes must be estimated, none of which can be

calculated exactly from theory. The underlying event is the interaction between

colliding high energy protons. Particles involved in the collisions and their

subsequent decay products also emit QCD radiation, which is relevant to the

calculation of the probability of various outcomes. Both the Initial State Radiation

(ISR, prior to the beam collision) and Final State Radiation (FSR, subsequent to

the collision) must be modeled as well. Finally, quarks and gluons that are

produced in these processes become hadrons (bound states of quarks with other

quarks), a process known as hadronization.

To estimate the rate at which tt̄ pairs produced in
√
s = 7 TeV proton-proton

collisions will qualify as candidate events, ATLAS must simulate these physical

processes using a collection of computer simulations that have been developed over

the years by physicists. The simulations are based on theoretical principles and

constrained by existing data from previous particle physics endeavors. They use

the Monte Carlo method of generating approximate solutions to equations that

cannot be solved analytically.

ATLAS relies primarily on the Herwig (Hadron Emission Reactions With

Interfering Gluons) event generator. To model the the further development of a

collision event, ATLAS used the event generator MC@NLO (for Monte Carlo at

32


Next-to-Leading-Order). This is a simulation that calculates QCD processes at the

level of next-to-leading-order accuracy but also models the parton showers of QCD

radiation that result from proton-proton collisions.

To further complicate things, the outcome of a proton-proton collision

depends on the way in which the momentum of the proton is distributed among its

constituent partons, described probabilistically by the Parton Distribution

Function (PDF). So crucial is the judicious choice of PDF in the simulation of

particle processes at the LHC that a special LHC working group (PDF4LHC) has

devoted its efforts to the formulation of recommendations for the choice of PDF

sets for particular LHC analyses (Botje, Butterworth, Cooper-Sarkar, de Roeck, et

al. 2011).

To evaluate the systematic uncertainty in their acceptance estimate, ATLAS

has to consider the potential errors introduced by their reliance on particular

simulations and particular assumptions that must be stipulated to apply those

simulations. They state directly that “The use of simulated tt̄ samples to calculate

the signal acceptance gives rise to various sources of systematic uncertainty. These

arise from the choice of the event generator and PDF set, and from the modeling of

initial and final state radiation” (Aad, Abbott, Abdallah, Abdelalim, et al. 2012b,

245). Evaluation of these uncertainties involves the quantitative assessment of how

much difference variations in those assumptions make to the estimate they

generate.

In explanation of their approach to this task, ATLAS notes that to evaluate

33


uncertainties due to the “choice of generator and parton shower model” they

compared the results they had obtained using MC@NLO with those obtained

using an alternate simulation called Powheg, using either Herwig or Pythia (an

alternate event generator) to model the hadronization process. Yet another

generator, called AcerMC, in combination with Pythia, is used to assess the

uncertainty introduced by ISR/FSR assumptions, “varying the parameters

controlling the ISR/FSR emission by a factor of two up and down” (ibid.). Finally,

to evaluate the “uncertainty in the PDF set used to generate tt̄ samples, ATLAS

employed “a range of current PDF sets” following the procedure recommended by

the PDF4LHC working group.

Figure 1 gives the results of these procedures, for two different analysis

procedures, one requiring events to include a jet from a b-quark (“tagged”) and one

without that requirement (“untagged”), under the heading “tt̄ signal modelling.”

That table also tabulates all other sources of systematic uncertainty, yielding totals

arrived at by adding the individual contributions in quadrature (i.e., total

systematic uncertainties are equal to the square root of the sums of the squares of

the individual contributions).

One of the categories of systematic uncertainty is “object selection,” under

which heading we find the entries “JES [Jet Energy Scale] and jet energy

resolution” and “Lepton reconstruction, identification and trigger.” The motivation

for these entries concerns the way in which candidate events are defined, which is in

terms of the identification of decay products with certain properties. For example,

34


this paper focused on tt̄ decays with a single high-momentum lepton (electron or

muon) and jets from QCD processes. The implementation of this idea was based

on the idea that the measurements of energy deposits in the detector could be used

to identify a track as resulting from the passage of an electron (or muon), the

momentum (transverse to the beam) of which could then be measured, to

determine whether they satisfied the threshold requirement of pT > 20 GeV. Only

events including such a high-momentum lepton and at least three jets with pT > 25

GeV (and meeting further requirements) could be counted as candidate events.

The identification of an event as including a high-momentum lepton and

three high-momentum jets, however, has its own uncertainty, and this is what the

“object selection” uncertainty seeks to quantify. There is always some chance that

energy will be deposited in the various detector components in a way that will

“fool” the detector into thinking that a high-pT electron has passed when it has

not. The uncertainty that results from this possibility entails that the number of

candidate events itself is to some extent uncertain.

Nonetheless, on the present account, such uncertainty remains systematic in

character insofar as its estimation relies on the consideration of alternate values of

a parameter in a model of the experiment. To assess systematic uncertainty in the

untagged analysis, ATLAS reports that they

performed pseudo-experiments (PEs) with simulated samples which included

the various sources of uncertainty. For example, for the JES uncertainty, PEs

35


were performed with jet energies scaled up and down according to their

uncertainties and the impact on the cross- section was evaluated. (Aad,

Abbott, Abdallah, Abdelalim, et al. 2012b, 250)

Although a different methodology was used in the tagged analysis, that

methodology likewise relied on variation of parameter values in a model of the

experiment.

It is precisely the strategy of varying the inputs to a model-based estimation

procedure within the bounds of what is possible, given the limitations on one’s

knowledge, that is indicative of a robustness analysis in this context.

References

Aad, G., Abbott, B., Abdallah, J., Abdelalim, A., Abdesselam, A., Abdinov, O.,

. . . Zwalinski, L. (2012a). Measurement of the top quark pair production

cross section in pp collisions at
√
s = 7 TeV in dilepton final states with

ATLAS. Physics Letters B , 707 (5), 459–477. doi:

http://dx.doi.org/10.1016/j.physletb.2011.12.055

Aad, G., Abbott, B., Abdallah, J., Abdelalim, A., Abdesselam, A., Abdinov, O.,

. . . Zwalinski, L. (2012b). Measurement of the top quark pair production

cross-section with ATLAS in the single lepton channel. Physics Letters B ,

711 (3–4), 244–263. doi: http://dx.doi.org/10.1016/j.physletb.2012.03.083

36


Aad, G., Abbott, B., Abdallah, J., Khalek, S. A., Abdelalim, A., Abdesselam, A.,

. . . Zwalinski, L. (2012). Measurement of the t-channel single top-quark

production cross section in pp collisions at with the ATLAS detector. Physics

Letters B , 717 (4–5), 330 - 350. doi:

http://dx.doi.org/10.1016/j.physletb.2012.09.031

Aad, G., Abbott, B., Abdallah, J., Khalek, S. A., Abdelalim, A., Abdinov, O., . . .

Zwalinski, L. (2012). Measurement of the top quark pair cross section with

ATLAS in pp collisions at
√
s = 7 TeV using final states with an electron or a

muon and a hadronically decaying τ lepton. Physics Letters B , 717 (1—3), 89

–108. doi: http://dx.doi.org/10.1016/j.physletb.2012.09.032

Barlow, R. (2002). Systematic errors: Facts and fictions. Retrieved from

arXiv:hep-ex/0207026

Beauchemin, P.-H. (2015). Autopsy of measurements with the ATLAS detector at

the LHC. Synthese, 1–38. doi: 10.1007/s11229-015-0944-5

Botje, M., Butterworth, J., Cooper-Sarkar, A., de Roeck, A., et al. (2011). The

PDF4LHC working group interim recommendations (Tech. Rep.).

(arXiv:1101.0538)

Boumans, M., & Hon, G. (2014). Introduction. In M. Boumans, G. Hon, &

A. Petersen (Eds.), Error and uncertainty in scientific practice (pp. 1–12).

London: Pickering and Chatto.

Chatrchyan, S., Khachatryan, V., Sirunyan, A., Tumasyan, A., Adam, W., Aguilo,

E., . . . Swanson, J. (2013). Measurement of the tt̄ production cross section in

37


pp collisions at
√
s = 7 TeV with lepton + jets final states. Physics Letters

B , 720 (1–3), 83–104. doi: http://dx.doi.org/10.1016/j.physletb.2013.02.021

Cousins, R. D. (1995). Why isn’t every physicist a Bayesian? American Journal of

Physics , 63 , 398–410.

Cousins, R. D., & Highland, V. L. (1992). Incorporating systematic uncertainties

into an upper limit. Nuclear Instruments and Methods in Physics Research,

A320 , 331–335.

Cranmer, K. (2003). Frequentist hypothesis testing with background uncertainty.

In L. Lyons, R. Mount, & R. Reitmeyer (Eds.), Statistical problems in

particle physics, astrophysics, and cosmology: Proceedings of PHYSTAT 2003

(pp. 261–264). Stanford, CA: SLAC.

Egan, A., & Weatherson, B. (Eds.). (2011). Epistemic modality. New York: Oxford

University Press.

Joint Committee for Guides in Metrology Working Group I. (2008). Evaluation of

measurement data – guide to the expression of uncertainty in measurement.

Joint Committee for Guides in Metrology. Retrieved from

http://www.bipm.org/en/publications/guides/gum.html

Joint Committee for Guides in Metrology Working Group II. (2012). International

vocabulary of metrology – basic and general concepts and associated terms.

Joint Committee for Guides in Metrology. Retrieved from

http://www.bipm.org/en/publications/guides/vim.html

Khachatryan, V., Sirunyan, A., Tumasyan, A., Adam, W., Bergauer, T.,

38


Dragicevic, M., . . . Weinberg, M. (2011). First measurement of the cross

section for top-quark pair production in proton–proton collisions at
√
s = 7

TeV. Physics Letters B , 695 (5), 424–443. doi:

http://dx.doi.org/10.1016/j.physletb.2010.11.058

Kratzer, A. (1977). What ‘must’ and ‘can’ must and can mean. Linguistics and

Philosophy , 1 , 337–55.

Mari, L., & Giordani, A. (2014). Modelling measurement: Error and uncertainty.

In M. Boumans, G. Hon, & A. Petersen (Eds.), Error and uncertainty in

scientific practice (pp. 79–96). London: Pickering and Chatto.

Parker, W. (2012). Scientific models and adequacy-for-purpose. The Modern

Schoolman, 87 .

Parker, W. S. (2009). Confirmation and adequacy-for-purpose in climate

modelling. Aristotelian Society Supplementary Volume, 83 (1), 233–249. doi:

10.1111/j.1467-8349.2009.00180.x

Parker, W. S. (2015). Computer simulation, measurement, and data assimilation.

The British Journal for the Philosophy of Science. doi: 10.1093/bjps/axv037

Rabinovich, S. (2007). Towards a new edition of the “Guide to the expression of

uncertainty in measurement”. Accreditation and Quality Assurance, 12 (11),

603–608. Retrieved from http://dx.doi.org/10.1007/s00769-007-0284-3

doi: 10.1007/s00769-007-0284-3

Sinervo, P. (2003). Definition and treatment of systematic uncertainties in high

39


energy physics and astrophysics. In L. Lyons, R. Mount, & R. Reitmeyer

(Eds.), Statistical problems in particle physics, astrophysics, and cosmology:

Proceedings of PHYSTAT 2003 (pp. 122–129). Stanford, CA: SLAC.

Staley, K. W. (2004). Robust evidence and secure evidence claims. Philosophy of

Science, 71 , 467–488.

Staley, K. W. (2012). Strategies for securing evidence through model criticism.

European Journal for Philosophy of Science, 2 , 21-43.

(10.1007/s13194-011-0022-x)

Staley, K. W. (2014). Experimental knowledge in the face of theoretical error. In

M. Boumans, G. Hon, & A. Petersen (Eds.), Error and uncertainty in

scientific practice (pp. 39–55). London: Pickering and Chatto.

Staley, K. W., & Cobb, A. (2011). Internalist and externalist aspects of

justification in scientific inquiry. Synthese, 182 , 475-492.

(10.1007/s11229-010-9754-y)

Tal, E. (2012). The epistemology of measurement: A model-based account

(Unpublished doctoral dissertation). University of Toronto, Toronto.

Tal, E. (2016). Making time: A study in the epistemology of measurement. The

British Journal for the Philosophy of Science, 67 (1), 297–335. doi:

10.1093/bjps/axu037

Willink, R. (2013). Measurement uncertainty and probability. New York:

Cambridge University Press.

40


Notes

1In fact, the GUM eschews this distinction in favor of a purely operational-

ist distinction between Type A uncertainty and Type B uncertainty, based

on the method by which uncertainty is evaluated. An evaluation of un-

certainty “by the statistical analysis of series of observations” is Type A.

Any other means of evaluation is classified as Type B.

2In the example discussed by Beauchemin, the measurement aims, not

at the total cross section as here discussed, but at the differential cross

section with respect to transverse momentum of the leading jet, making

the application of unfolding to the measurement of that specific quantity

all the more important.

3This discussion affirms Wendy Parker’s recent argument that computer

simulations can be “embedded” in measurement practices (W. S. Parker

2015).

4See, however, Cranmer 2003 for a step towards a strictly frequentist ap-

proach. Willink (2013) also argues that a frequentist construal of system-

atic uncertainty bounds is applicable for the consumer, rather than the

producer, of measurement results, if one adopts an enlarged view of the

measurement process to include the “background steps” that introduce

41


systematic errors, so that anyone measurement result can be regarded as

having been drawn from a population that includes a variety of different

background steps.

5The paper reports two cross section estimates using different techniques.

The second estimate (σtt̄ = 173 ± 17(stat.)+18−16(syst.) pb) does not employ

a technique for tagging jets containing b quarks. ATLAS reports the sys-

tematic uncertainty due to the estimate of luminosity separately, adding

another ±6 pb to the measurements from each method. Both results, AT-

LAS states, agree with one another and with QCD calculations, but the

method using b-tagging “has a better a priori sensitivity and constitutes

the main result of this Letter” (Aad, Abbott, Abdallah, Abdelalim, et al.

2012b, 244).

42