Securing the Empirical Value of Measurement Results Kent W. Staley Saint Louis University July 2, 2016 Abstract Reports of quantitative experimental results often distinguish between the statistical uncertainty and the systematic uncertainty that characterize measurement outcomes. This paper discusses the practice of estimating systematic uncertainty in High Energy Physics (HEP). The estimation of systematic uncertainty in HEP should be understood as a minimal form of quantitative robustness analysis. The secure evidence framework is used to explain the epistemic significance of robustness analysis. However, the empirical value of a measurement result depends crucially not only on the resulting systematic uncertainty estimate, but on the learning aims for which that result will be used. Philosophically important conceptual and practical questions regarding systematic uncertainty assessment call for further investigation. 1 1 Introduction The responsible reporting of measurement results requires the characterization of the quality of the measurement as well as its outcome. But the quality of a measurement is not one-dimensional. Standard practice in particle physics requires that all reports of measurement or estimation results must include quantitative estimates of both the statistical error or uncertainty and the systematic error or uncertainty. (This terminological ambivalence between error and uncertainty is addressed below. In the meantime, I will use ‘uncertainty’ to avoid awkwardness.) Such assessments of uncertainty are essential for the usefulness of measurement, for without them one cannot determine the consistency of two results from different experiments, two different results from the same experiment, or a single experimental result with a given theory (Beauchemin 2015). Although a common practice of experimental particle physicists (and scientists in many other disciplines as well), the reporting of estimates of systematic uncertainty has gone largely unnoticed (or at least un-discussed) by philosophers of science, apart from a few discussions noted in what follows. Such neglect is unfortunate, for discussions of systematic uncertainty open a remarkable window into experimental reasoning. Whereas statistical uncertainty is simply reported, systematic uncertainty is also discussed. Even the most cursory presentation will at least note the main sources of systematic uncertainty, while more careful reports (such as the example discussed in the appendix) detail the 2 ways in which systematic uncertainties arise as well as the methods by which they are assessed. Such discussions require forthright consideration by experimenters of the body of knowledge that they bring to bear on their investigation, the ways in which that knowledge relates to the conclusions they present, and the limitations on that knowledge. This process is epistemologically crucial to the establishment of experimental knowledge. Moreover, philosophical insight regarding the estimation of systematic uncertainty would be highly valuable. Presently, there is no clear consensus across scientific disciplines regarding the basis or meaning of the distinction between statistical and systematic uncertainty, despite some concerted efforts discussed below. Scientists likewise debate the proper statistical framework in which systematic uncertainty should be evaluated, a debate with important philosophical aspects. It is the contention of this paper that some progress may come from regarding the estimation of systematic uncertainty as an instance of robustness analysis applied to a model of a single experiment or measurement. More precisely, the determination of systematic uncertainty bounds on a measurement result consists of a weakening of the conclusion of an argument under the guidance of a robustness analysis of its premises within the bounds of what is epistemically possible. Experimentalists thereby establish the sensitivity of the measurement result that is a crucial factor in its empirical value, while also establishing the security of the evidence supporting the measurement result, which is necessary for the cogency 3 of the argument supporting the claim expressing the measurement result. However, I will argue, these two achievements are in tension with one another: as one weakens the conclusion to enhance the security of the evidence, one diminishes the sensitivity of the measurement result itself. Just how much empirical value a measurement result maintains, however, depends not only on the extent to which the sensitivity of the measurement has been weakened by systematic uncertainty bounds, but also on the use to which it will be put. This account builds on two significant recent contributions to the philosophical study of measurement and measurement quality: Eran Tal’s model-based account of measurement, according to which the evaluation of measurement accuracy is the outcome of a comparison amongst predictions drawn from a model of the measurement process (Tal 2012, 2016) and Hugo Beauchemin’s discussion of systematic uncertainty assessment as an essential component of measurement needed to determine the sensitivity of measurement results in HEP (Beauchemin 2015). Section 2 discusses the concept of systematic uncertainty, surveying the ways in which uncertainty has been distinguished from error, and systematic uncertainty from statistical uncertainty. Section 3 uses an example of a typical HEP measurement to illustrate the complexities and importance of systematic uncertainty and to introduce some of the debates among particle physicists regarding the appropriate statistical framework for the estimation of systematic uncertainty. Section 4 outlines the secure evidence framework employed in the 4 analaysis. I present my argument for viewing systematic uncertainty estimation as a kind of robustness analysis aimed at establishing empirical value in section 5. A brief summary appears in section 6. As an appendix, I present a discussion of an illuminating example from recent particle physics: the ATLAS collaboration’s measurement of the tt̄ production cross section from single lepton decays. The case illustrates the proposed analysis of systematic uncertainty assessment, highlights some subtleties in its application, and exemplifies the pervasive character of modeling and simulation in systematic uncertainty estimation. 2 Systematic uncertainty: the very idea To facilitate better conceptual understanding, we can begin by clarifying our terminology, with some help from discussions among metrologists. Above I referred to both error and uncertainty as being distinguished into systematic and statistical categories. The two terms have distinct histories of usage in science. The scientific analysis of error dates to the seventeenth century, while “the concept of uncertainty as a quantifiable attribute is relatively new in the history of measurement” (Boumans & Hon 2014, 7). In practice, particle physicists have not always been careful to distinguish between error and uncertainty. Recent papers from the CMS and ATLAS collaborations focus their discussions on uncertainty rather than error, although usage in is not perfectly uniform in this regard. 5 Metrologists, by contrast, have articulated systematic distinctions between error and uncertainty, as befits the science whose concern is the very act of measurement. Yet the usefulness and definition of these terms remain matters of debate among metrologists, whose Joint Committee for Guides in Metrology (JCGM) publishes the “Guide to the Expression of Uncertainty in Measurement” (GUM) (Joint Committee for Guides in Metrology Working Group I (JCGM-I) 2008) and the “International Vocabulary of Metrology” (VIM) (Joint Committee for Guides in Metrology Working Group II (JCGM-II) 2012). Those debates have turned significantly on the question of the definition of error as articulated in these canonical texts, particularly insofar as that definition appeals to an unobservable, even “metaphysical” concept of the “true value” of the measurand (JCGM-I 2008, 36). Some metrologists have defended the importance of retaining a concept of error defined in terms of a true value or “target value” (Mari & Giordani 2014; Willink 2013; Rabinovich 2007, 95). It is not the purpose of this paper to debate these issues. For the sake of clarity, I will adopt the terminology of Willink (2013) and understand a measurement to be a process whereby one obtains a numerical estimate x (the measurement result or measurement estimate) of the target value θ of the measurand. This usage allows for the straightforward definition of measurement error as the difference between the measurement result x and the target value θ. We may then regard statistical and systematic error as components of the 6 overall measurement error, and turn our attention to how they are to be distinguished. Following the JCGM, we could define the former (also called random error ) as the difference between the measurement result x and “the mean that would result from an infinite number of measurements of the same measurand carried out under repeatability conditions” (JCGM-I 2008, 37). Note that if the measurement procedure is itself biased, then the latter quantity, i.e., the quantity that would emerge as the mean measurement result in the long-run limit, will not be equal to the target value. It is this inequality that is to be labeled systematic error. One may approach the concept of systematic error by imagining a measurement process in which it is absent. For such a process, the “mean that would result from an infinite number of measurements of the same measurand carried out under repeatability conditions” simply would be the target value. Systematic error, then, is a component of measurement error “that in replicate measurements remains constant or varies in a predictable way” (JCGM-II 2012, 22; see also Willink 2013) and therefore does not disappear in the long run. Eran Tal, in proposing a model-based account of measurement, has noted an important limitation of this conceptualization of measurement error, which is that it obscures the central role played by the model of the measurement process. The JCGM’s appeal to an infinite number of measurements carried out under repeatability conditions relies implicitly on an unspecified standard as to what constitutes a repetition of a given measurement of a measurand. A model of the 7 measurement process not only supplies that standard, it serves to articulate the quantity measured by a given process and thus helps to specify what kinds of measurement outcomes constitute errors. To make these roles of the model explicit, Tal proposes a “methodological” definition of systematic error as “a discrepancy whose expected value is nonzero between the anticipated or standard value of a quantity and an estimate of that value based on a model of a measurement process” (Tal 2012, 57). The notion of a true value or a target value of the measurand has been supplanted here by an “anticipated or standard value” that must be ascertained through a calibration process, which in turn is understood as a process of modeling the measuring system. Tal’s emphasis on the process by which the error is estimated renders the concept a methodological one, but not yet a purely epistemic concept, for which Tal reserves the term ‘uncertainty’ (ibid., 30). For purposes of the quantitative treatment of uncertainty, the GUM offers the following definition of it: “parameter, associated with the result of a measurement, that characterizes the dispersion of the values that could reasonably be attributed to the measurand” (ibid.). This definition clearly marks uncertainty as something potentially quantifiable, but also as something epistemic, requiring consideration of some kind of standard of reasonable attribution. What the GUM’s definition does not do, however, is to provide guidance for interpreting this notion of reasonability. Neither does it provide clear guidance in understanding how to characterize the distinction between evaluations of statistical uncertainty and systematic uncertainty.1 8 3 Systematic uncertainty in HEP To better appreciate the distinction between statistical and systematic uncertainty, and to think more concretely about the epistemic work accomplished by the evaluation of systematic uncertainty, let us consider the disciplinary practices for dealing with systematic uncertainty as it arises in measurements in HEP. We begin with an example. 3.1 Measuring a cross section Measurements of cross sections are a standard part of the experimental program of HEP research groups. The cross section σ quantifies the probability of an interaction process yielding a certain outcome, such as the interaction of two protons in an LHC collision event yielding a top quark – anti-top quark pair (the tt̄ production cross section). At its crudest level, such a measurement is simply a matter of counting how many times N, in a given data set, a tt̄ pair was produced, and then dividing N by the number L of collision events that occurred during data collection (the latter number particle physicists call the luminosity ). We might call this the “fantasyland” approach to measuring the tt̄ production cross section: σtt̄ = N L . Reality intervenes in several ways to drive the physicist out of fantasyland: (1) tt̄ pairs are not directly observable in particle physics data, but must be identified via the identification of their decay products. These products are also 9 not directly observable but must be inferred from the satisfaction of data selection criteria (“cuts”). Events that satisfy these criteria are candidates for being events containing tt̄ pairs. (2) tt̄ candidate events may not contain actual tt̄ pairs, i.e., they may not be signal events. Other particle processes can produce data that are indistinguishable from tt̄ decay events. These events are background. It is the nature of background candidate events that they cannot, given the cuts in terms of which candidates are defined, be distinguished from the signal candidate events that one is aiming to capture; one can only estimate the number to be expected Nb and subtract it from the total number of candidate events observed Nc. (3) Just as some events that do not contain tt̄ pairs will get counted as candidate events, some events that do contain tt̄ pairs will not get thus counted. This problem has two facets. (3a) The tt̄ production cross section as a theoretical quantity might be thought of in terms of an idealized experiment in which every tt̄ pair created would be subject to detection in an ideal detector with no gaps in its coverage. Since actual detectors do not have the ability to detect every tt̄ event, this limitation of the detector must be taken into account by estimating the acceptance A. (3b) The production and decay of tt̄ pairs are stochastic processes and the resulting decay products will exhibit a distribution of properties. The cuts that are applied to reduce background events will have some probability of eliminating signal events. The solution to this is to estimate the efficiency � of the cuts: the 10 fraction of signal events that will be selected by the cuts. (4) The physical properties of the elements in a collision event are not perfectly recorded by the detector. Candidate events are defined in terms of quantitative features of the physical processes of particle production and decay. For example, top quarks decay nearly always to a W boson and a b quark. A tt̄ pair will therefore result in two W bosons, each of which in turn can decay either into a quark-antiquark pair or into lepton-neutrino pair. To identify tt̄ candidate events via the decay mode in which one of the W bosons decays to a muon (µ) and a muon neutrino (νµ), physicists might impose a cut that requires the event to include a muon with a transverse momentum p µ T of at least 10 GeV, in order to discriminate against background processes that produce muons with smaller transverse momenta. Whether a given event satisfies this criterion or not depends on a measurement output of the relevant part of the detector, and this measuring device has a finite resolution, meaning that an event that satisfies the requirement p µ T > 10 GeV might not in fact include a muon with such a large transverse momentum. Conversely, an event might fail the p µ T cut even though it does in fact include a such a muon. These detector resolution effects require physicists who wish to calculate the tt̄ production cross section to base that calculation not simply on the number of candidate events as determined from the comparison of detector outputs to data selection criteria, but on the inferred physical characteristics of the events taking into account detector resolution effects. This process, called unfolding, requires 11 applying a transformation matrix (estimated by means of simulation) to the detector outputs. As Beauchemin emphasizes, unfolding is not a matter of correcting the data, but inferring from the detector outputs (via the transformation matrix) the underlying distribution, to which the cuts are then applied (Beauchemin 2015, 23).2 (5) Finally, the luminosity is also not a quantity that is susceptible to direct determination, since distinct events might not get discriminated by the detector, a single event might mistakenly get counted as two distinct events, and some events might be missed altogether. The luminosity must therefore be estimated. We have thus gone from the fantasyland calculation σtt̄ = N L to the physicists’ calculation σtt̄ = Nc −Nb �AL . (1) This calculation is not merely more complex than the fantasyland calculation. Every quantity involved in it is the outcome of an inference from a mixture of theory, simulation, and data (from the current experiment or from other experiments).3 Each has its own sources of uncertainty that the careful physicist is obliged to take into account. 3.2 Methdological debates But how ought one to take these uncertainties into account? HEP lacks a clear consensus. Discussions about the conceptualization of error and uncertainty, and about 12 their classification into categories such as statistical and systematic, are inseparably bound up with debates regarding the statistical framework in which these quantities should be estimated and expressed. When discussion focuses on statistical error alone, the applicability of a strictly frequentist conception of probability stirs up no significant controversy. One can clearly incorporate into one’s model of an experiment or measuring device a distribution function representing the relative frequency with which the experiment or device would indicate a range of output values (results) for a given value of the measurand. Such a model, which will inevitably involve some idealization, can be warranted by a chain of calibrations. Indeed, as argued by Tal (2012), the warranting of inferences from measurement results in general requires such idealization. One can then incorporate this distribution of measurement errors into one’s account of the measurement’s impact on the uncertainty regarding the value of the measurand. Systematic errors cannot be treated in this same straightforward manner. Consider the paradigmatic example of systematic error: a biased measuring device. Suppose that a badly-constructed ruler for measuring length systematically adds 0.5 cm whenever one measures a 10.0 cm length. Repeated measurements of a given 10.0 cm standard length would produce results that cluster according to some distribution around the expectation value 10.5 cm. The difference between the 10.0 cm standard length and the expectation value of 10.5 cm just is the systematic error on such measurements. If we know that this bias is present, we can eliminate it by correction. 13 The problem of the estimation of systematic uncertainty arises precisely when one cannot apply the correction strategy because the magnitude of the error is unknown. The investigator knows that a systematic error might be present, and the problem is to give reasonable bounds on its possible magnitude. To this problem the notion of a frequentist probability distribution no longer has an obvious direct applicability; the error is either systematically present (and with some particular, but unknown, magnitude), or it is not.4 As a consequence, investigators employing frequentist statistics to evaluate statistical uncertainty must report systematic uncertainty separately, as particle physicists typically do. The quantities denoted ‘statistical’ and ‘systematic’ in a statement such as ‘σtt̄ = 187 ± 11(stat.)+18−17(syst.) pb’ are conceptually heterogeneous. Combining them into a single quantity and calling it the “total uncertainty” is problematic. One response to this problem is to adopt the Bayesian conception of probability as a measure of degree of belief. Such a shift from frequentist to Bayesian probabilities is a natural concomitant of the shift from an Error Approach to an Uncertainty Approach as discussed in the VIM, because the expectation value of a Bayesian probability distribution is no longer understood as the mean in the long-run limit, but the average of all possible measurement results weighted by how strongly the investigator believes that a given result will obtain, when applied to a given measurand. A putative advantage of this Bayesian approach is that it allows for the straightforward synthesis of statistical and systematic uncertainties 14 into a single quantity. Both Sinervo and the JCGM (JCGM-I 2008, 57) cite this as a point in favor of a Bayesian understanding of the probabilities in a quantitative treatment of systematic uncertainty. Adopting a Bayesian approach comes with well-known difficulties, however, also acknowledged by Sinervo (see also Barlow 2002). Investigators must provide a prior distribution for each parameter that contributes to the systematic uncertainty in a given measurement. Just what the constraints on such prior distributions ought to be (aside from simple coherence) is very unclear. A third approach to the problem employs a hybrid of Bayesian and frequentist techniques. The Cousins–Highland method relies on a calculation that takes a frequentist probability distribution (giving rise to the statistical error) and “smears” it out by applying a Bayesian probability distribution to whatever parameters that distribution might depend on that are sources of systematic uncertainty (Cousins & Highland 1992). The basic idea is this: suppose that one has a set of observations xi, i = 1, . . . ,n, distributed according to p(x|θ), and that the data {xi} are to be used to make inferences about the parameter θ. Now, suppose that such inferences require assumptions about the value of λ, an additional parameter, the value of which is subject to some uncertainty. The hybrid method involves introducing a prior distribution, π(λ), to enable the calculation of a modified probability distribution pCH(x|θ) = ∫ p(x|θ,λ)π(λ)dλ, which then becomes the basis for statistical inferences (Sinervo 2003, 128). 15 The Cousins-Highland hybrid approach yields, as critics have noted (Cranmer 2003; Sinervo 2003), neither a coherent Bayesian nor a coherent frequentist conception. The statistical distribution p(x|θ) is intended to be frequentist, but the prior distribution on λ has no truly frequentist significance, leaving the modified distribution pCH(x|θ) without any coherent probability interpretation. Cousins and Highland defend the approach on the grounds that it adheres as closely as possible to a frequentist approach while avoiding “physically unacceptable” consequences of a “consistently classical” approach, viz., that when deriving an upper limit on a quantity from two otherwise identical experiments, the stricter limit will be derivable from the one that has a larger systematic uncertainty (Cousins & Highland 1992; Cousins 1995). The discussion thus far serves to illustrate some of the ways in which the conceptual underpinnings of systematic uncertainty estimation remain unsettled. The conceptual disorder not only poses an intellectual problem, but contributes to ongoing confusion and controversy over the appropriate methodology for estimating uncertainty. Moreover, the problems spill over into the use of uncertainty bounds when determining the compatibility of one measurement result with another, or with a theoretical prediction. The problem then arises as to how statistical and systematic uncertainties will be combined. It is common to add them in quadrature, for example when using a χ2 fit test, but doing so introduces distributional assumptions that may be unwarranted, and that may not reflect the manner in which the systematic uncertainty was in fact determined in the first 16 place. One could simply add the two uncertainty components in a linear manner, but this would in many cases significantly and unnecessarily reduce the sensitivity of the measurement results (an issue that will be explored further below). HEP currently lacks a satisfactory methodology for treating systematic uncertainty. Without attempting to resolve such thorny methodological disputes at once, we can progress towards a more satisfactory treatment of uncertainty estimation by first grasping more clearly the epistemic work such estimates seek to accomplish. I will argue that by working within the secure evidence framework, we can at least partially assimilate the epistemic work accomplished by systematic uncertainty estimation to that accomplished by robustness analysis. That systematic uncertainty estimation provides a means for investigating the robustness of a measurement result has already been argued in an illuminating essay by Beauchemin (Beauchemin 2015). By quantifying both the uncertainty of a measurement result and its sensitivity to the phenomena under investigation, he argues, scientists can quantify the scientific value of a measurement result and provide criteria for minimizing the circularity that arises from the theory-laden aspect of measurement. Here I aim to build upon the insights of Beauchemin by providing a broader epistemological framework for understanding what is achieved through robustness analysis in the context of estimating systematic uncertainty. The secure evidence framework I invoke involves no explicit commitment to either frequentist or Bayesian statistical frameworks. The relevant modality in the framework is 17 possibility, which I take to be conceptually prior to probability, insofar as no probability function can be specified without specifying a space of objects (whether events, propositions, or sentences) over which the function is to be defined. 4 The secure evidence framework Here I explain the secure evidence framework that will be used to analyze these issues (Staley 2004; Staley & Cobb 2011; Staley 2012, 2014). On the one hand, we might wish to think of the evidence or support for a hypothesis provided by the outcome of a test of that hypothesis in objective terms, such that facts about the epistemic situation of the investigator are irrelevant. On the other hand, it seems quite plausible that evaluating the claims that an investigator makes about such evidential relations may require determining what kinds of errors investigators are in a position to justifiably regard as having been eliminated, which does depend on their epistemic situation. The secure evidence framework provides a set of concepts for understanding the relationship between the situation of an epistemic agent (either individual or corporate) and the objective evidential relationships that obtain between the outcomes of tests and the hypotheses that are tested. This account relies centrally on a concern with possible errors, and explicitly understands the relevant modality for possible errors to be epistemic. An epistemic agent who evaluates the evidential bearing of some body of data x0 with regard to a hypothesis H must also consider the possibilities for error in the evaluation thus generated. This is the “critical mode” of evidential evaluation. 18 Evidential judgments rely on premises, and errors in the premises of such a judgment may result in errors in the conclusion. A responsible evaluation of evidence therefore requires consideration of the ways the world might be, such that a putative evidential judgment would be incorrect. The evaluator must reflect on what, among the propositions relevant to the judgment in question, may safely be regarded as known, and what propositions must be regarded as assumed, but possibly incorrect. Such possibilities of error are here regarded as epistemic possibilities. Often, when one makes a statement in the indicative that something might be the case, one is expressing an epistemic possibility, with what must be the case functioning as its dual expressing epistemic necessity (Kratzer 1977). An expression roughly picking out the same modality (at least for the singular first-person case) is ‘for all I know’, as in ‘for all I know there is still some ice cream in the freezer’. Theorists have offered a range of views regarding the semantics of epistemic modality (see Egan & Weatherson 2011), with various versions of contextualism and relativism among the contending positions. For our purposes we need only note that what is epistemically possible for an epistemic agent does depend on that agent’s knowledge state and that when an agent acquires new knowledge it always follows that some state of affairs that was previously epistemically possible for that agent ceases to be so. The epistemic possibilities that are relevant to the critical mode of evidential judgment are error scenarios, which are to be understood as follows: Suppose that 19 P is some proposition, S is an epistemic agent considering the assertion of P , whose epistemic situation (her situation regarding what she knows, is able to infer from what she knows, and her access to information) is K. Then, a way the world (or some relevant part of the world) might be, relative to K, such that P is false, we will call an error scenario for P relative to K. Of special importance for this discussion are error scenarios for evidence claims, where EC is an evidence claim if it is a statement expressing a proposition of the form ‘data x from test T are evidence for hypothesis H’ (such hypotheses may include statements about the value of some measurand in a measurement process, and it is here assumed that measurement procedures and hypothesis testing are epistemologically susceptible to the same analysis). Suppose that, relative to a certain epistemic situation K, there is a set of scenarios that are epistemically possible, and call that set Ω0. If proposition P is true in every scenario in the range Ω0, then P is fully secure relative to K. If P is true across some more limited portion Ω1 of Ω0 (Ω1 ⊆ Ω0), then P is secure throughout Ω1. To put this notion more intuitively, then, a proposition is secure for an epistemic agent just insofar as that proposition remains true, whatever might be the case for that agent. Thus defined, security is applicable to any proposition, but the application of interest here is to evidence claims. Note that inquirers might never be called upon to quantify the degree of security of any of their inferences. The methodologically significant concept is not 20 security per se, but the securing of evidence, i.e., the pursuit of strategies that increase or assess the relative security of an evidence claim. Two strategies for making inferences more secure are the weakening of evidential conclusions to render them immune to otherwise threatening error scenarios and the strengthening of premises, in which additional information is gathered that rules out previously threatening error scenarios. Robustness analysis constitutes a strategy for assessing the security of an evidence claim by investigating classes of potential error scenarios to determine which scenarios are and which are not compatible with a given evidential conclusion (Staley 2014). The present analysis aims to show how the consideration and evaluation of systematic uncertainty constitutes an application of the weakening strategy under the guidance of robustness analysis. 5 Systematic uncertainty assessment as robustness analysis To see how the evaluation of systematic uncertainty can be viewed as a variety of robustness analysis, consider again the example of measuring the tt̄ production cross section in proton-proton collisions. Recall how equation 1 relates that quantity to other empirically accessible quantities: σtt̄ = Nc −Nb �AL . This equation is a premise of an argument for a conclusion about the value of σtt̄, as are statements attributing values to each of the variables in the equation. 21 If we take the conclusion of such an argument to be the attribution of some definite value σtt̄ = σtt̄0, then (because the data are finite) the premises would fail to provide cogent support for the conclusion. Were we to repeat the exact same experiment, it is very probable that a different value assignment would result. It is not clear that such a statement should be considered an experimental conclusion at all. It appears to be a statement about the value of the quantity σtt̄, but its usefulness for empirical inquiry is severely limited by the fact that the comparison of any two such point-value determinations is effectively certain to yield the result that they disagree. To restore cogency and empirical value, it is necessary to replace the conclusion σtt̄ = σtt̄0 with one stating σtt̄ = σtt̄0 ± δ. Call this the unmodulated conclusion. The addendum ‘±δ’ expresses the statistical uncertainty that is a function of the number of candidate events, Nc. But it will also be necessary to report the systematic uncertainty resulting from imperfect knowledge of the acceptance, efficiency, background, and luminosity, as well as the unfolding matrix used to determine the number of candidate events. That conclusion, which includes a statement of statistical uncertainty, but adds the results of an assessment of systematic uncertainty, we can call the modulated conclusion. Importantly, lack of knowledge enters into statistical and systematic uncertainties in quite different ways. In the case of statistical uncertainty, the relevant lack of knowledge concerns the exact value of the quantity σtt̄, the measured value of which is reported in the conclusion. The aim of the inquiry is to reduce this uncertainty, and knowledge of the quantities on the right hand side of 22 equation 1 is a means toward the achievement of this aim. The statistical uncertainty that remains once those means have been deployed is a consequence of the fact that any finite number of observations yields only partial information about the value of σtt̄. For the purposes of assessing statistical uncertainty, however, the premises in the argument for this conclusion are assumed to be determinately and completely known. We take ourselves to know how many candidate events were counted, for example, and as long as that number is finite, there will be some corresponding statistical uncertainty on the estimate of σtt̄. In other words, for the purposes of assessing statistical uncertainty, we concern ourselves only with the possibility of errors in the conclusion of the argument, errors that take the form of assigning incorrect values to the measurand. For this purpose, the model of the measuring process is taken to be adequate and the premises of the argument are assumed to be free of error. Systematic uncertainty may then be regarded as arising from the consideration that in fact the premises are not determinately and completely known. Physicists are not in a position to know that a premise attributing a definite value to, say, � is true. To derive an estimate of σtt̄, some value must be assigned, but, having made such an assignment, investigators must confront the fact that other value assignments are compatible with what they know about the detector and the physical processes involved in the experiment. The assessment of systematic uncertainty tackles this problem of incomplete knowledge by, in some way, exploring the extent to which varying the assumed 23 values asserted in the premises, within the bounds of what is possible given the investigators’ knowledge, makes a difference to the conclusion drawn regarding the value of the measured quantity. This effectively generates an ensemble of arguments corresponding to the considered range of possible value-assignment premises. By considering such a range of arguments with possibly correct premises, the investigators can then report a weakened conclusion such as σtt̄ = 187 ± 11(stat.)+18−17(syst.)pb, the correctness of which would be supported by the soundness of any one of the arguments in the ensemble. From the perspective of the secure evidence framework, such assessments can be regarded as a combination of robustness and weakening strategies. Investigators begin with a set of data xi, i = 1, . . . ,n reporting observations relevant to the measurement of some quantity θ. This derivation depends on assigning values λj = λj0,j = 1, . . . ,m to each of a set of m imperfectly known parameters necessary for the derivation of an estimate θ̂0 (with a statistical uncertainty depending on the value of n). This yields the unmodulated conclusion, which we might also regard as the unsecured conclusion, in the sense that the strategies for securing evidence claims mentioned previously have not yet been applied to it. Consideration then turns to limitations on the investigators’ knowledge of the initial set of model assumptions. Using the robustness analysis strategy, alternate sets of assumptions that are compatible with the investigator’s epistemic state are 24 considered, taking the form λj = λj0 + εj for each λj and for a range of values of εj, depending on the extent to which existing knowledge constrains the possible values of λj. This yields a range of derived estimates lying within an interval θ̂0 +δ1 −δ2 , which then can provide the basis for a logically weakened conclusion incorporating both statistical and systematic uncertainties. The convergence of estimates generated by the ensemble provides the basis for the robustness of this modulated conclusion. The account just given is misleading in one respect. The evaluation of systematic uncertainty is typically not a matter of directly varying the input quantities in equation 1. Instead, physicists look upstream, to the methods and models employed in the determination of those input quantities, and introduce variation there. (The appendix gives examples of how this is done.) In some cases, this is a matter of varying the value of some parameter in a model, in other cases it is a matter of swapping one model for another. What is at issue in such variation is not so much the question of what might be the true value of the parameter or the true model (notions that might be inapplicable in a given case) as which models or parameter values within a model might be adequate for the purposes of the inference that is being undertaken (W. S. Parker 2009; W. Parker 2012). The secure evidence framework gives us a new perspective on what is distinctive about the assessment of systematic uncertainty: it involves consideration of what might be the case regarding parameters in the model of the experiment to which values must be assigned in order to derive an estimate of the measured quantity. Systematic uncertainty is thus concerned with possible errors 25 and inadequacies in the premises of an argument from a model of measurement processes. The determination of systematic uncertainty involves the determination of reasonable bounds on the possibility of errors in those premises and inadequacies in the model of the process. Such analysis rests on an ensemble of completed models, each of which corresponds to a potential error scenario regarding claims about the value of the measured quantity. Only through the consideration of the outputs of such an ensemble can a systematic uncertainty be assessed, whereas claims about statistical uncertainty require only a single completed model. Such a perspective also helps us to understand why systematic uncertainty is important in science and should be important in the philosophy of science. As noted in the introduction and documented in the appendix, while statistical uncertainties are simply reported, a good assessment of systematic uncertainties involves careful consideration of the relevance of a wide range of factors that are relevant to the conclusion being drawn from the data, as well as careful probing of the limitations on the investigators’ prior knowledge of those factors. Whereas a model of the experiment is used to arrive at an unmodulated conclusion, the modulated conclusion depends on a stage of critical assessment of that model, achieved through a process of robustness analysis. The empirical value of a measurement result depends not only on its security, however, but also on its sensitivity, as explained in a recent paper by Beauchemin (Beauchemin 2015). The sensitivity of a measurement result depends on the comparison of “(1) 26 the difference in the values of a given observable when calculated with different theoretical assumptions to be tested with the measurement; and (2) the uncertainty on the measurement result” (ibid., 29). Only if the systematic uncertainty is sufficiently small in comparison to the differences between different theoretically calculated quantity values can the measurement result be used to discriminate empirically between the competing theoretical assumptions. Sensitivity considerations thus reveal the cost of applying the weakening strategy to enhance the security of a measurement result: One can weaken the conclusion of an argument for such a result by enlarging the systematic uncertainty bounds applied to it, thus achieving a conclusion that is secure across a broader range of epistemically possible scenarios. However, if the systematic uncertainties thus reported exceed the differences between values of the measurand yielded by the theoretical assumptions to be tested amongst, then the level of security achieved renders the measurement result empirically worthless relative to that testing aim. 6 Conclusion These considerations highlight some important remaining questions. The most pressing concern the determination of relevant epistemic possibilities for purposes of setting systematic uncertainty bounds. The determination of systematic uncertainty involves the determination of reasonable bounds on the possibility of errors in those premises and inadequacies in the model of the process. Which possibilities deserve consideration? What criteria ought to be considered when 27 determining standards of reasonableness on such bounds? The previously mentioned debates over methodology can be thought of as debates over the best way to approach such questions. Given the scant attention that the evaluation of systematic uncertainty has received in the philosophical literature, it is to be expected that conclusions drawn at this stage of inquiry should be provisional and exploratory in nature. I have proposed that, while many questions regarding the assessment of systematic uncertainty remain unresolved, a good first step towards a philosophical appreciation of this practice is to regard it as a kind of robustness analysis. The secure evidence framework provides a context for understanding robustness analysis in general that also supports this identification of systematic uncertainty assessment as a kind of robustness analysis. That a coherence can thereby be established between this view of systematic uncertainty assessment and Tal’s model-based account of measurement is an additional virtue of the present account. Further discussion is, of course, needed. Of particular importance is to engage the philosophical aspects of the debate over statistical methodology in systematic uncertainty estimation. Gaining clarity about the aims of systematic uncertainty estimation is crucial for this purpose. I have argued here for the view that the aim of this practice is to secure the evidence supporting the modulated conclusions that are the result of the assessment of systematic uncertainty. 28 7 Appendix: measuring the tt̄ production cross section Because assessments of systematic uncertainty are considered essential to the publication of any experimental result in particle physics, the choice of an example to illustrate the practice is largely arbitrary. Here I present a recent example from the ATLAS group at the Large Hadron Collider (LHC) at CERN, which also illustrates the pervasive character of modeling and simulation in the analysis of experimental data in contemporary High Energy Physics (HEP). The cross section for a given state quantifies the rate at which that state is produced out of some particle process. Cross sections serve as crucial parameters in the Standard Model, and, especially in the case of the top quark, provide important constraints on the viability of numerous Beyond–Standard Model theories as well. Both ATLAS and CMS, its neighbor at the LHC, have published a number of measurements of the production cross section for both top–anti top (tt̄) pairs (Aad et al. 2012; Aad, Abbott, Abdallah, Abdelalim, et al. 2012b, 2012a; Khachatryan et al. 2011; Chatrchyan et al. 2013) and for single top quarks (Aad, Abbott, Abdallah, Khalek, et al. 2012). The present discussion focuses on a measurement of the top quark pair production cross section based on a search for top quark decays indicated by a single high momentum lepton (electron or muon) and jets produced by strong interaction processes characterized by Quantum Chromodynamics (QCD). 29 Let us begin with the result that ATLAS reports:5 σtt̄ = 187 ± 11(stat.)+18−17(syst.)pb (2) We have already seen the essential logic of such a measurement: One seeks to estimate the rate at which tt̄ pairs are produced from the number of tt̄ candidate events in the data. Making that inference, however, demands estimates of the quantities in equation 1, and the accurate estimation of those quantities demands skill and expert judgment. The complete estimate of uncertainty in this case draws on more considerations than I can address in a brief discussion (see figure 1), but the following should suffice to communicate the nature of the problem. Consider first the estimation of signal acceptance and efficiency. Estimating these quantities requires consideration of characteristics of the detector, drawing on the engineering knowledge of the detector’s design and construction as well as experimental knowledge of its performing characteristics. This background knowledge forms the basis for a computer simulation of the detector itself. Estimating signal acceptance also involves the knowledge of the characteristics of the signal itself: how are tt̄ pairs produced in the proton-proton collisions generated at the LHC, and how do they behave once they have been produced? To model the tt̄ signal, ATLAS uses a variety of simulation models and a variety of parameter 30 250 ATLAS Collaboration / Physics Letters B 711 (2012) 244–263 the different channels were combined in the likelihood fit by mul- tiplying the individual likelihood functions. The normalisation of the tt̄ signal templates is the parame- ter of interest in the fit and was allowed to vary freely in both analyses. The tt̄ cross-section was assumed to be common to all channels and the number of tt̄ events in each subsample returned by the fit was related to the tt̄ cross-section by the expression σtt̄ = Nsig/( ! L dt × ϵsig), where Nsig is the number of tt̄ events,! L dt is the integrated luminosity and ϵsig is the product of the signal acceptance, selection efficiency and branching ratio, ob- tained from tt̄ simulation. The normalisation of the backgrounds was treated differently in the two analyses. In the untagged anal- ysis the multijet and small backgrounds (single-top, diboson and Z + jets production) were fixed in the fit to their expected con- tributions, whereas the W + jets background was allowed to vary freely in each channel. In the tagged analysis all backgrounds were allowed to vary within the uncertainties of their assumed cross- sections, described in Sections 3 and 6. These uncertainties were used as Gaussian constraints on the cross-section normalisation. The robustness of this fitting approach was checked with ensem- ble tests. The central value and uncertainties returned by the fit were shown to be unbiased for a wide range of input cross- sections. 8. Systematic uncertainties The evaluation of the systematic uncertainties was performed differently in the two analyses. The untagged analysis performed pseudo-experiments (PEs) with simulated samples which included the various sources of uncertainty. For example, for the JES un- certainty, PEs were performed with jet energies scaled up and down according to their uncertainties and the impact on the cross- section was evaluated. The tagged analysis, on the other hand, accounted for most of the changes in the normalisation and shape of the templates due to systematic uncertainties by adding ‘nui- sance’ terms to the fit [41]. Templates of the samples with one standard deviation ’up’ and ’down’ variations of the systematic un- certainty source under study were generated in addition to the nominal templates. The fit interpolated between these templates with a continuous parameter by means of a Gaussian constraint. Before the fit, the constraint was such that the mean value was zero and the width was one; a fitted width less than one means that the data were able to constrain that particular source of un- certainty. The effects due to the modelling of the W + jets and multijet background shapes, initial and final state radiation, parton density function of the tt̄ signal, NLO generator, hadronisation and template statistics cannot be fully described by a simple linear pa- rameter controlling the template interpolation. As a consequence, they were not treated as nuisance terms but obtained by per- forming PEs with modified simulated samples, as was done in the untagged analysis. The nuisance parameters of the systematic uncertainties were all fitted together taking into account the correlations among them in the minimisation process. As a consequence, the uncertainties on the fitted quantities obtained from the fit include both the sta- tistical and the total systematic components. Therefore, to obtain an estimation of the individual contributions to the total uncer- tainty in the tagged analysis, each individual systematic uncer- tainty was obtained as the difference in quadrature between the total uncertainty and the uncertainty obtained after having fixed the corresponding nuisance parameter to its fitted value. The cen- tral values of the nuisance parameters after the fit agreed with their input values. The fit was cross-checked using PEs where the starting value of the nuisance parameters was different than the nominal value. The result was found to be unbiased. In addi- Table 2 Statistical and systematic uncertainties on the measured tt̄ cross-section in the untagged and tagged analyses. Multijet and small backgrounds normalisation un- certainties are already included in the statistical uncertainty (a/i) in the tagged analysis. W + jets heavy-flavour content and b-tagging calibration do not apply (n/a) to the untagged analysis. The luminosity uncertainty is not included in the table. Method Untagged Tagged Statistical Error (%) +10.1 −10.1 +5.8 −5.7 Object selection (%) JES and jet energy resolution +4.1 −5.4 +3.9 −2.9 Lepton reconstruction, identification and trigger +1.7 −1.6 +2.1 −1.8 Background modelling (%) Multijet shape +3.5 −3.5 +0.8 −0.8 Multijet normalisation +1.1 −1.2 a/i Small backgrounds norm. +0.6 −0.6 a/i W + jets shape +3.9 −3.9 +1.0 −1.0 W + jets heavy-flavour content n/a +2.7 −2.4 b-tagging calibration n/a +4.1 −3.8 tt̄ signal modelling (%) ISR/FSR +6.3 −2.1 +5.2 −5.2 NLO generator +3.3 −3.3 +4.2 −4.2 Hadronisation +2.1 −2.1 +0.4 −0.4 PDF +1.8 −1.8 +1.5 −1.5 Others (%) Simulation of pile-up +1.2 −1.2 < 0.1 Template statistics +1.3 −1.3 +1.1 −1.1 Systematic Error (%) +10.5 −9.4 +9.7 −9.0 tion, large variations of the kinematic dependence of the nuisance parameters (e.g. the JES as a function of the jet p T ) were con- sidered and resulted in a negligible impact on the result of the fit. The systematic uncertainties on the cross-section for both methods are summarised in Table 2. The dominant effects in the untagged analysis were JES, multijet and W + jets backgrounds shape and ISR/FSR. The latter was also important in the tagged analysis, together with the uncertainty related to the signal MC generator. In addition, this analysis was sensitive to effects related to b-tagging, specifically the determination of the heavy-flavour content of the W + jets background and the calibration of the b-tagging algorithm itself. The luminosity uncertainty was 3.4% [42,43]. Several cross-checks of the cross-section measurements were performed. These included the results of the likelihoods applied to individual lepton channels and tt̄ cross-section measurements done with simpler and complementary approaches, including cut- and-count methods and fits to kinematic variables such as the reconstructed top mass. These cross-checks gave consistent results within the uncertainties. 9. Results and conclusions The results of the likelihood fits applied to the data are shown in Figs. 5 and 6, where the distributions of the discriminants in data are overlaid on the fitted discriminant distributions of the sig- nal and backgrounds. The final measured cross-section results are: σtt̄ = 173 ± 17(stat.)+18−16(syst.) ± 6(lumi.) pb = 173 +25 −24 pb in the un- tagged analysis and σtt̄ = 187 ± 11(stat.)+18−17(syst.) ± 6(lumi.) pb = 187+22−21 pb in the tagged analysis. The two measurements are in agreement with each other. The latter has a better a priori sensi- tivity and thus constitutes the main result of this Letter. It is the most precise tt̄ cross-section measurement at the LHC published to date and is in good agreement with the SM prediction calculated at NLO plus next-to-leading-log order 165+11−16 pb [1–3]. Figure 1: Table of statistical and systematic uncertainties for two different analyses, one requiring events to include a jet from a b-quark (“tagged”) and one without that requirement (“untagged”). Note that everything below the first line (“statistical error”) is a contribution to the systematic uncertainty. The total systematic uncer- tainty is calculated by adding the individual contributions in quadrature. From Aad et al. 2012b, 250. 31 values within those models. It is this variation of modeling assumptions and their role in the production of systematic uncertainty estimates that I wish to emphasize. To simulate the production of tt̄ pairs in the collider environment, numerous complex stochastic QCD processes must be estimated, none of which can be calculated exactly from theory. The underlying event is the interaction between colliding high energy protons. Particles involved in the collisions and their subsequent decay products also emit QCD radiation, which is relevant to the calculation of the probability of various outcomes. Both the Initial State Radiation (ISR, prior to the beam collision) and Final State Radiation (FSR, subsequent to the collision) must be modeled as well. Finally, quarks and gluons that are produced in these processes become hadrons (bound states of quarks with other quarks), a process known as hadronization. To estimate the rate at which tt̄ pairs produced in √ s = 7 TeV proton-proton collisions will qualify as candidate events, ATLAS must simulate these physical processes using a collection of computer simulations that have been developed over the years by physicists. The simulations are based on theoretical principles and constrained by existing data from previous particle physics endeavors. They use the Monte Carlo method of generating approximate solutions to equations that cannot be solved analytically. ATLAS relies primarily on the Herwig (Hadron Emission Reactions With Interfering Gluons) event generator. To model the the further development of a collision event, ATLAS used the event generator MC@NLO (for Monte Carlo at 32 Next-to-Leading-Order). This is a simulation that calculates QCD processes at the level of next-to-leading-order accuracy but also models the parton showers of QCD radiation that result from proton-proton collisions. To further complicate things, the outcome of a proton-proton collision depends on the way in which the momentum of the proton is distributed among its constituent partons, described probabilistically by the Parton Distribution Function (PDF). So crucial is the judicious choice of PDF in the simulation of particle processes at the LHC that a special LHC working group (PDF4LHC) has devoted its efforts to the formulation of recommendations for the choice of PDF sets for particular LHC analyses (Botje, Butterworth, Cooper-Sarkar, de Roeck, et al. 2011). To evaluate the systematic uncertainty in their acceptance estimate, ATLAS has to consider the potential errors introduced by their reliance on particular simulations and particular assumptions that must be stipulated to apply those simulations. They state directly that “The use of simulated tt̄ samples to calculate the signal acceptance gives rise to various sources of systematic uncertainty. These arise from the choice of the event generator and PDF set, and from the modeling of initial and final state radiation” (Aad, Abbott, Abdallah, Abdelalim, et al. 2012b, 245). Evaluation of these uncertainties involves the quantitative assessment of how much difference variations in those assumptions make to the estimate they generate. In explanation of their approach to this task, ATLAS notes that to evaluate 33 uncertainties due to the “choice of generator and parton shower model” they compared the results they had obtained using MC@NLO with those obtained using an alternate simulation called Powheg, using either Herwig or Pythia (an alternate event generator) to model the hadronization process. Yet another generator, called AcerMC, in combination with Pythia, is used to assess the uncertainty introduced by ISR/FSR assumptions, “varying the parameters controlling the ISR/FSR emission by a factor of two up and down” (ibid.). Finally, to evaluate the “uncertainty in the PDF set used to generate tt̄ samples, ATLAS employed “a range of current PDF sets” following the procedure recommended by the PDF4LHC working group. Figure 1 gives the results of these procedures, for two different analysis procedures, one requiring events to include a jet from a b-quark (“tagged”) and one without that requirement (“untagged”), under the heading “tt̄ signal modelling.” That table also tabulates all other sources of systematic uncertainty, yielding totals arrived at by adding the individual contributions in quadrature (i.e., total systematic uncertainties are equal to the square root of the sums of the squares of the individual contributions). One of the categories of systematic uncertainty is “object selection,” under which heading we find the entries “JES [Jet Energy Scale] and jet energy resolution” and “Lepton reconstruction, identification and trigger.” The motivation for these entries concerns the way in which candidate events are defined, which is in terms of the identification of decay products with certain properties. For example, 34 this paper focused on tt̄ decays with a single high-momentum lepton (electron or muon) and jets from QCD processes. The implementation of this idea was based on the idea that the measurements of energy deposits in the detector could be used to identify a track as resulting from the passage of an electron (or muon), the momentum (transverse to the beam) of which could then be measured, to determine whether they satisfied the threshold requirement of pT > 20 GeV. Only events including such a high-momentum lepton and at least three jets with pT > 25 GeV (and meeting further requirements) could be counted as candidate events. The identification of an event as including a high-momentum lepton and three high-momentum jets, however, has its own uncertainty, and this is what the “object selection” uncertainty seeks to quantify. There is always some chance that energy will be deposited in the various detector components in a way that will “fool” the detector into thinking that a high-pT electron has passed when it has not. The uncertainty that results from this possibility entails that the number of candidate events itself is to some extent uncertain. Nonetheless, on the present account, such uncertainty remains systematic in character insofar as its estimation relies on the consideration of alternate values of a parameter in a model of the experiment. To assess systematic uncertainty in the untagged analysis, ATLAS reports that they performed pseudo-experiments (PEs) with simulated samples which included the various sources of uncertainty. For example, for the JES uncertainty, PEs 35 were performed with jet energies scaled up and down according to their uncertainties and the impact on the cross- section was evaluated. (Aad, Abbott, Abdallah, Abdelalim, et al. 2012b, 250) Although a different methodology was used in the tagged analysis, that methodology likewise relied on variation of parameter values in a model of the experiment. It is precisely the strategy of varying the inputs to a model-based estimation procedure within the bounds of what is possible, given the limitations on one’s knowledge, that is indicative of a robustness analysis in this context. References Aad, G., Abbott, B., Abdallah, J., Abdelalim, A., Abdesselam, A., Abdinov, O., . . . Zwalinski, L. (2012a). Measurement of the top quark pair production cross section in pp collisions at √ s = 7 TeV in dilepton final states with ATLAS. Physics Letters B , 707 (5), 459–477. doi: http://dx.doi.org/10.1016/j.physletb.2011.12.055 Aad, G., Abbott, B., Abdallah, J., Abdelalim, A., Abdesselam, A., Abdinov, O., . . . Zwalinski, L. (2012b). Measurement of the top quark pair production cross-section with ATLAS in the single lepton channel. Physics Letters B , 711 (3–4), 244–263. doi: http://dx.doi.org/10.1016/j.physletb.2012.03.083 36 Aad, G., Abbott, B., Abdallah, J., Khalek, S. A., Abdelalim, A., Abdesselam, A., . . . Zwalinski, L. (2012). Measurement of the t-channel single top-quark production cross section in pp collisions at with the ATLAS detector. Physics Letters B , 717 (4–5), 330 - 350. doi: http://dx.doi.org/10.1016/j.physletb.2012.09.031 Aad, G., Abbott, B., Abdallah, J., Khalek, S. A., Abdelalim, A., Abdinov, O., . . . Zwalinski, L. (2012). Measurement of the top quark pair cross section with ATLAS in pp collisions at √ s = 7 TeV using final states with an electron or a muon and a hadronically decaying τ lepton. Physics Letters B , 717 (1—3), 89 –108. doi: http://dx.doi.org/10.1016/j.physletb.2012.09.032 Barlow, R. (2002). Systematic errors: Facts and fictions. Retrieved from arXiv:hep-ex/0207026 Beauchemin, P.-H. (2015). Autopsy of measurements with the ATLAS detector at the LHC. Synthese, 1–38. doi: 10.1007/s11229-015-0944-5 Botje, M., Butterworth, J., Cooper-Sarkar, A., de Roeck, A., et al. (2011). The PDF4LHC working group interim recommendations (Tech. Rep.). (arXiv:1101.0538) Boumans, M., & Hon, G. (2014). Introduction. In M. Boumans, G. Hon, & A. Petersen (Eds.), Error and uncertainty in scientific practice (pp. 1–12). London: Pickering and Chatto. Chatrchyan, S., Khachatryan, V., Sirunyan, A., Tumasyan, A., Adam, W., Aguilo, E., . . . Swanson, J. (2013). Measurement of the tt̄ production cross section in 37 pp collisions at √ s = 7 TeV with lepton + jets final states. Physics Letters B , 720 (1–3), 83–104. doi: http://dx.doi.org/10.1016/j.physletb.2013.02.021 Cousins, R. D. (1995). Why isn’t every physicist a Bayesian? American Journal of Physics , 63 , 398–410. Cousins, R. D., & Highland, V. L. (1992). Incorporating systematic uncertainties into an upper limit. Nuclear Instruments and Methods in Physics Research, A320 , 331–335. Cranmer, K. (2003). Frequentist hypothesis testing with background uncertainty. In L. Lyons, R. Mount, & R. Reitmeyer (Eds.), Statistical problems in particle physics, astrophysics, and cosmology: Proceedings of PHYSTAT 2003 (pp. 261–264). Stanford, CA: SLAC. Egan, A., & Weatherson, B. (Eds.). (2011). Epistemic modality. New York: Oxford University Press. Joint Committee for Guides in Metrology Working Group I. (2008). Evaluation of measurement data – guide to the expression of uncertainty in measurement. Joint Committee for Guides in Metrology. Retrieved from http://www.bipm.org/en/publications/guides/gum.html Joint Committee for Guides in Metrology Working Group II. (2012). International vocabulary of metrology – basic and general concepts and associated terms. Joint Committee for Guides in Metrology. Retrieved from http://www.bipm.org/en/publications/guides/vim.html Khachatryan, V., Sirunyan, A., Tumasyan, A., Adam, W., Bergauer, T., 38 Dragicevic, M., . . . Weinberg, M. (2011). First measurement of the cross section for top-quark pair production in proton–proton collisions at √ s = 7 TeV. Physics Letters B , 695 (5), 424–443. doi: http://dx.doi.org/10.1016/j.physletb.2010.11.058 Kratzer, A. (1977). What ‘must’ and ‘can’ must and can mean. Linguistics and Philosophy , 1 , 337–55. Mari, L., & Giordani, A. (2014). Modelling measurement: Error and uncertainty. In M. Boumans, G. Hon, & A. Petersen (Eds.), Error and uncertainty in scientific practice (pp. 79–96). London: Pickering and Chatto. Parker, W. (2012). Scientific models and adequacy-for-purpose. The Modern Schoolman, 87 . Parker, W. S. (2009). Confirmation and adequacy-for-purpose in climate modelling. Aristotelian Society Supplementary Volume, 83 (1), 233–249. doi: 10.1111/j.1467-8349.2009.00180.x Parker, W. S. (2015). Computer simulation, measurement, and data assimilation. The British Journal for the Philosophy of Science. doi: 10.1093/bjps/axv037 Rabinovich, S. (2007). Towards a new edition of the “Guide to the expression of uncertainty in measurement”. Accreditation and Quality Assurance, 12 (11), 603–608. Retrieved from http://dx.doi.org/10.1007/s00769-007-0284-3 doi: 10.1007/s00769-007-0284-3 Sinervo, P. (2003). Definition and treatment of systematic uncertainties in high 39 energy physics and astrophysics. In L. Lyons, R. Mount, & R. Reitmeyer (Eds.), Statistical problems in particle physics, astrophysics, and cosmology: Proceedings of PHYSTAT 2003 (pp. 122–129). Stanford, CA: SLAC. Staley, K. W. (2004). Robust evidence and secure evidence claims. Philosophy of Science, 71 , 467–488. Staley, K. W. (2012). Strategies for securing evidence through model criticism. European Journal for Philosophy of Science, 2 , 21-43. (10.1007/s13194-011-0022-x) Staley, K. W. (2014). Experimental knowledge in the face of theoretical error. In M. Boumans, G. Hon, & A. Petersen (Eds.), Error and uncertainty in scientific practice (pp. 39–55). London: Pickering and Chatto. Staley, K. W., & Cobb, A. (2011). Internalist and externalist aspects of justification in scientific inquiry. Synthese, 182 , 475-492. (10.1007/s11229-010-9754-y) Tal, E. (2012). The epistemology of measurement: A model-based account (Unpublished doctoral dissertation). University of Toronto, Toronto. Tal, E. (2016). Making time: A study in the epistemology of measurement. The British Journal for the Philosophy of Science, 67 (1), 297–335. doi: 10.1093/bjps/axu037 Willink, R. (2013). Measurement uncertainty and probability. New York: Cambridge University Press. 40 Notes 1In fact, the GUM eschews this distinction in favor of a purely operational- ist distinction between Type A uncertainty and Type B uncertainty, based on the method by which uncertainty is evaluated. An evaluation of un- certainty “by the statistical analysis of series of observations” is Type A. Any other means of evaluation is classified as Type B. 2In the example discussed by Beauchemin, the measurement aims, not at the total cross section as here discussed, but at the differential cross section with respect to transverse momentum of the leading jet, making the application of unfolding to the measurement of that specific quantity all the more important. 3This discussion affirms Wendy Parker’s recent argument that computer simulations can be “embedded” in measurement practices (W. S. Parker 2015). 4See, however, Cranmer 2003 for a step towards a strictly frequentist ap- proach. Willink (2013) also argues that a frequentist construal of system- atic uncertainty bounds is applicable for the consumer, rather than the producer, of measurement results, if one adopts an enlarged view of the measurement process to include the “background steps” that introduce 41 systematic errors, so that anyone measurement result can be regarded as having been drawn from a population that includes a variety of different background steps. 5The paper reports two cross section estimates using different techniques. The second estimate (σtt̄ = 173 ± 17(stat.)+18−16(syst.) pb) does not employ a technique for tagging jets containing b quarks. ATLAS reports the sys- tematic uncertainty due to the estimate of luminosity separately, adding another ±6 pb to the measurements from each method. Both results, AT- LAS states, agree with one another and with QCD calculations, but the method using b-tagging “has a better a priori sensitivity and constitutes the main result of this Letter” (Aad, Abbott, Abdallah, Abdelalim, et al. 2012b, 244). 42