Evidence and Experimental Design in Sequential Trials


Evidence and Experimental Design in Sequential TrialsAuthor(s): Jan Sprenger

Source: Philosophy of Science , Vol. 76, No. 5 (December 2009), pp. 637-649

Published by: The University of Chicago Press on behalf of the Philosophy of Science 
Association

Stable URL: https://www.jstor.org/stable/10.1086/605818

JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide 
range of content in a trusted digital archive. We use information technology and tools to increase productivity and 
facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org. 
 
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at 
https://about.jstor.org/terms

The University of Chicago Press  and Philosophy of Science Association  are collaborating with 
JSTOR to digitize, preserve and extend access to Philosophy of Science

This content downloaded from 
�������������87.79.184.140 on Mon, 04 May 2020 12:27:53 UTC������������� 

All use subject to https://about.jstor.org/terms

https://www.jstor.org/stable/10.1086/605818


Philosophy of Science, 76 (December 2009) pp. 637–649. 0031-8248/2009/7605-0031$10.00
Copyright 2009 by the Philosophy of Science Association. All rights reserved.

637

Evidence and Experimental Design in
Sequential Trials

Jan Sprenger†‡

To what extent does the design of statistical experiments, in particular sequential trials,
affect their interpretation? Should postexperimental decisions depend on the observed
data alone, or should they account for the used stopping rule? Bayesians and fre-
quentists are apparently deadlocked in their controversy over these questions. To resolve
the deadlock, I suggest a three-part strategy that combines conceptual, methodological,
and decision-theoretic arguments. This approach maintains the pre-experimental rel-
evance of experimental design and stopping rules but vindicates their evidential, post-
experimental irrelevance.

1. Exposition. Which relevance does the design of a statistical experiment
in science have, once the experiment has been performed and the data
have been observed? Do data speak for themselves, or do they have to
be assessed in conjunction with the design that was used to generate them?
Few questions in the philosophy of statistics are the subject of greater
controversy.

The paradigmatic example is the inferential role of stopping rules in
sequential trials. Those trials, which can be compared to the repeated toss
of a coin, accumulate evidence from several independent and identically
distributed trials. Sequential trials are standardly applied in medicine when
the efficacy of a drug is tested by giving it to several patients after each
other. The stopping rule describes under which circumstances the trial is
terminated and is thus a centerpiece of the experimental design. Possible
stopping rules could be “give the drug to 100 patients,” “give the drug

†To contact the author, please write to: Tilburg Center for Logic and Philosophy of
Science, Tilburg University, P.O. Box 90153, 5000 LE Tilburg, The Netherlands;
e-mail: j.sprenger@uvt.nl.

‡I would like to thank José Bernardo, Bruce Glymour, Valeriano Iranzo, Kevin Korb,
Deborah Mayo, Jonah Schupbach, Gerhard Schurz, Aris Spanos, Kent Staley, Roger
Stanev, Carl Wagner, the referees of Philosophy of Science, and especially Teddy Sei-
denfeld for their helpful and stimulating feedback.

This content downloaded from 
�������������87.79.184.140 on Mon, 04 May 2020 12:27:53 UTC������������� 

All use subject to https://about.jstor.org/terms


638 JAN SPRENGER

until the number of failures exceeds the number of recoveries,” or “give
the drug until funds are exhausted.” In other words, they indicate the
number of repetitions of the trial as a function of some feature of the
observed data; that is, technically speaking, a stopping rule t is a function
from a measurable space —the infinity product of the sample� �(X , A )
space—into the natural numbers such that, for each , the setn � �

is measurable.1{xFt(x) p n}
In the above example, the question about the relevance of stopping

rules can be recast as the question whether our inference about the efficacy
of the drug should be sensitive to the proposed ways to conduct and to
terminate the experiment. If stopping rules were really indispensable and
if we performed fewer trials than the stopping rule prescribed, for example,
because funds are exhausted or because unexpected side effects occur, a
proper statistical interpretation of the observed data would be difficult,
if not impossible. The (ir)relevance of stopping rules thus has severe im-
plications for scientific practice and the proper interpretation of sequential
trials. Therefore, both scientists and philosophers of science should pay
great attention to the question whether stopping rules are a crucial and
indispensable part of statistical evidence or not.

The statistical community is deeply divided over that question. From
a frequentist (Neyman-Pearsonian, error-statistical) point of view, a biased
stopping rule, such as sampling on until the result favors our pet hy-
pothesis, will lead us to equally biased conclusions (Mayo 1996, 343–345).
Bayesians, however, claim that “the design of a sequential experiment is
. . . what the experimenter actually intended to do” (Savage 1962, 76).
Since such intentions are “locked up in [the experimenter’s] head” (76),
stopping rules cannot matter for sound inference (see also Edwards, Lind-
man, and Savage 1963, 239). The following principle captures the Bayesian
position in a nutshell:2

Stopping Rule Principle (SRP). In a sequential experiment with ob-
served data , all experimental information about cx p (x , . . . , x )1 n
is contained in the function ; the stopping rule t that was usedP (xFc)n
provides no additional information about c. (See Berger and Berry
1988, 34.)

However, the debate is characterized by a mutual deadlock, because each

1. I confine myself to noninformative stopping rules—stopping rules that are indepen-
dent of the prior distribution of the parameter. This means that for a sequence of
random variables representing the trial results, the event is measurable(X ) {t p n}n n��
with respect to . See Schervish 1995, 565.j(X , . . . , X )1 n
2. See also Royall 1997, 68–71. Note that the first part of the SRP contains the
Likelihood Principle (Birnbaum 1962; Berger and Wolpert 1984).

This content downloaded from 
�������������87.79.184.140 on Mon, 04 May 2020 12:27:53 UTC������������� 

All use subject to https://about.jstor.org/terms


EVIDENCE IN SEQUENTIAL TRIALS 639

side presupposes its own inferential framework and measures by its own
standards. For instance, Howson and Urbach (2006, 251) bluntly claim
that unless the Bayesian position is led into absurdity in Bayesian terms,
“there is no case whatever for the Bayesian to answer.” Frequentists re-
spond in a similar way to the Bayesian charge, pointing out that error
probabilities are the hallmark of a sound inference and that they do
depend on stopping rules (Mayo 1996, 348). But do we really have to
wear Bayesian or frequentist glasses in order to enter the debate? Isn’t
there a strategy to overcome the stalemate between Bayesians and
frequentists?

I believe that we can break the deadlock, and here I outline my strategy.
First, we make explicit the distinction between pre-experimental and post-
experimental evidential relevance. This will help us to disentangle and to
classify the existing arguments. Second, we elicit which conception of
statistical evidence best responds to the practical needs of empirical sci-
entists. This has immediate consequences for the relevance of stopping
rules. Third, we assert the pre-experimental relevance and postexperi-
mental irrelevance of stopping rules and vindicate this standpoint from
a decision-theoretic perspective. Thus, instead of solely relying on foun-
dational intuitions, we combine arguments from mathematical statistics
and decision theory with a methodological perspective on the needs of
experimental practice.

2. Measures of Evidence: A Practitioner’s Perspective. The two senses in
which stopping rules can be relevant correspond to two stages of a se-
quential trial: first, the pre-experimental stage, where the trial is planned
and the stopping rule is determined, and second, the postexperimental
stage, where observed data are interpreted and transformed into an evi-
dential assessment. For the latter project, we need evidence measures that
summarize raw data to make us see which of two competing hypotheses
is favored over its rival. Such quantifications help us to endorse or to
reject scientific hypotheses or to make policy-relevant decisions. For in-
stance, frequentist statistics is concerned with statistical testing and the
comparison of two mutually exclusive hypotheses, the null hypothesis

and the alternative . After looking at the data, one of them isH H0 1
accepted and the other one is rejected. Such decision rules are character-
ized and ranked according to their error probabilities, that is, the prob-
ability of erroneously rejecting the null hypothesis (type I error) and the
probability of erroneously rejecting the alternative hypothesis (type II
error).3 However, such error probabilities characterize (pre-experimen-

3. Other frequentist procedures, such as constructing confidence intervals, are equally

This content downloaded from 
�������������87.79.184.140 on Mon, 04 May 2020 12:27:53 UTC������������� 

All use subject to https://about.jstor.org/terms


640 JAN SPRENGER

tally) a particular testing procedure and do not directly tell us (postex-
perimentally) the strength of the observed evidence. For this reason, fre-
quentists supplement their error analysis by a measure of evidence, such
as p-values, significance levels, or, most recently, degrees of severity (Mayo
and Spanos 2006, 337–346). Neglecting subtle differences between those
measures, they are united in measuring the evidence against by sum-H0
ming up the likelihoods of those observations that have a greaterH0
discrepancy from the null hypothesis than the observed value x:

p :p P (T(X ) ≥ T(x)). (1)H0
Here T is a suitable (minimally sufficient) transformation of the data that
indicates the discrepancy between the data and . In other words, p-H0
values (taken pars pro toto)4 give the probability that, if the null hypothesis
were true and the experiment were repeated, the results would speak at
least as much against as the actual data do. Thus, p-values summarizeH0
the evidential import of the data and measure the tenability of the null
hypothesis in the light of incoming evidence such that we can base further
decisions on them. Indeed, p-values are widespread in the empirical sci-
ences and are often a compulsory benchmark for experimental reports,
for example, in medicine or experimental psychology. In particular, only
results with a p-value lower than .05 are generally believed to be (statis-
tically) significant and publishable (Goodman 1999).

All those measures of evidence are sensitive to the used stopping rule.
This comes as no surprise since each stopping rule shapes up a different
sample space; for example, in a fixed sample size scheme and a variable
sample size scheme, different observations are possible. In other words,
p-values depend not only on the likelihoods of the actually observed results
but also on the likelihood of results that could have been observed under
the actual experimental design as equation (1) makes clear. Hence, for a
frequentist statistician who works with p-values, significance levels, de-
grees of severity, or the like, the strength of the observed evidence depends
on the used stopping rule.

To see whether this is a desirable or embarrassing property, we should
clarify our expectations of a measure of statistical evidence. Evidence
about a parameter is required for inferences about that parameter, for
example, for sensible estimates and decisions to work with this rather
than that value (e.g., instead of ). An evidence measurec p c c p c0 1
transforms the data to provide the basis for a scientific inference. In order

justified by the error probabilities that characterize that procedure. Thus, in the re-
mainder of the paper, I focus on the hypothesis testing framework.

4. In Mayo and Spanos’s (2006, 342) framework, the severity with which the alternative
passes a test against the null is equal to .1 � p

This content downloaded from 
�������������87.79.184.140 on Mon, 04 May 2020 12:27:53 UTC������������� 

All use subject to https://about.jstor.org/terms


EVIDENCE IN SEQUENTIAL TRIALS 641

to be suitable for public communication in the scientific community and
for use in research reports, a measure of evidence should be free of sub-
jective bias and immune to manipulation, as well as independent of prior
opinions. While we can disagree on the a priori plausibility of a hypothesis,
we should agree on the strength of the observed evidence; that is the very
point of evidence-based approaches in science and policy making. There-
fore, we need a method to quantify the information contained in the data
that is independent of idiosyncratic convictions and immune to deliberate
manipulations.

To clarify the point, consider an example. A malicious experimenter
conducts a sequential trial with a certain stopping rule, but the evidence
against the null that she finds is not as strong as desired. In particular,
the p-value is not significant enough to warrant rejection of the null and
publication of the results ( ). What can she do? The first optionp ≈ .051
consists in outright fraud: she could fake some data (e.g., replace some
observed failures by successes) and make the results significant in that
way. While tempting, such a deception of the scientific community is risky
and would be heavily punished if discovered. The career of our experi-
menter would be over once and for all. Therefore, a second option looks
more attractive: not to report the true stopping rule (fixed sample size),t1
but a modified stopping rule under which the data D yield a p-valuet2
smaller than .05.5 The results are now “statistically significant” and get
published. But clearly, as readers of a scientific journal, we want to be
protected against such tricks. The crucial point is that the malicious ex-
perimenter did not manipulate the data: she was just insincere about her
intentions when to terminate the experiment. Using fake data involves
considerable risk: if continued replications fail to reproduce the results,
our experimenter will lose all her reputation. By contrast, she can never
be charged for insincerely reporting her intentions. The crucial point here
is not the intuition that “intentions cannot matter for strength of evi-
dence,” but rather that the scientific community is unable to control
whether these intentions have been correctly reported. This inability to
detect subjective distortion and manipulation of statistical evidence is a
grave problem for frequentist methodology.

What kind of answers could the frequentist give? To propose a stan-
dardized stopping rule t, such as fixed sample size, does not help: exper-
imenters could still use another stopping rule and report the results as′t
if they had been generated by t. What about the (actually made) proposal

5. For instance, she could have tested the null hypothesis in 46 Bernoullic p 0.5
(success/failure) trials with a fixed sample size and have obtained 29 successes with

. However, under a negative binomial stopping rule (sample until you get 17p p .052
failures), the p-value would have been .p p .036 ! .05

This content downloaded from 
�������������87.79.184.140 on Mon, 04 May 2020 12:27:53 UTC������������� 

All use subject to https://about.jstor.org/terms


642 JAN SPRENGER

to fix and to publicly declare a stopping rule in advance? This sounds
good, but a stopping rule that covers all eventualities in advance is hard,
if not impossible, to find. What if research funds expire because the trial
proves to be more expensive than thought? What if unexpected technical
problems blur the measurements or force the termination of the experi-
ment? These problems are certainly no remote thought experiments: they
frequently occur in scientific practice. Considering all those external in-
fluences in advance and assigning probabilities to them (!), as required
for explicit stopping rules, is certainly not feasible. Should we then con-
sider data from early stopped experiments as entirely worthless because
this course of events was not accounted for in planning the experiment
and formulating the stopping rule?

At this point, we cannot just be a little bit frequentist: if we believe in
the evidential, postexperimental relevance of stopping rules, then we have
to be silent on the meaning of data where the stopping rule is unavailable.
But if we throw the data into the trash bin, we give away a great deal of
what reality tells us, impeding scientific progress as well as responsible,
evidence-based policy making. In fact, no journal article that reports
p-values (and is implicitly committed to the relevance of stopping rules)
ever bothers about fine-tuning the stopping rule to the external circum-
stances under which the experiment was conducted. Thus, empirical sci-
entists do not take the relevance of stopping rules as seriously as their
widespread adherence to the frequentist framework of statistical inference
suggests. In fact, they have no other choice when they want to maintain
ordinary experimental practice. Specifying the stopping rule in advance
sounds good, but specifying the correct, comprehensive stopping rule
(which we need to interpret the results properly) is practically impossible.
Thus, the frequentist understanding of evidence, whether explicated as
p-values, significance levels, or degrees of severity, is unable to cope with
the practical problems that arise when the relevance of stopping rules is
taken seriously.

The nonfrequentist alternatives, such as likelihood ratios, or their gen-
eralization, Bayes factors, fare much better. These measures of evidence
merely build on publicly accessible factors, such as the likelihood of ob-
served data under competing hypotheses and possibly explicit prior dis-
tributions:

P(cFH )P(xFc, H )dc∫c�HP(H Fx) P(H ) 1 111 0B(H , H , x) :p p . (2)1 0 P(H ) P(H Fx) P(cFH )P(xFc, H )dc∫c�H1 0 0 00

For the case of two competing point hypotheses and , the BayesH H0 1

This content downloaded from 
�������������87.79.184.140 on Mon, 04 May 2020 12:27:53 UTC������������� 

All use subject to https://about.jstor.org/terms


EVIDENCE IN SEQUENTIAL TRIALS 643

factor collapses into the likelihood ratio of the two hypotheses:

P(xFc p c )1L(H , H , x) p . (3)1 0 P(xFc p c )0

It is easy to check that both (2) and (3) conform to the SRP and remain
unaffected by stopping rules.6 Furthermore, Lele (2004) has shown that
in comparing point hypotheses, the likelihood ratio is the only measure
of evidence that satisfies a number of reasonable invariance conditions.7

The preceding arguments have dealt with the evidential, postexperi-
mental irrelevance of experimental design. Now we have to integrate this
position into a decision-theoretic framework and defend it against at-
tempts to render it incoherent. Furthermore, we have to explore why
stopping rules often appear to be relevant and whether they are pre-
experimentally relevant, that is, relevant for responsible and efficient plan-
ning of an experiment.

3. Coherent Testing: A Decision-Theoretic Argument. How can frequentist
statisticians respond to the charge? Usually, they aim at a reductio ad
absurdum of the evidential irrelevance of stopping rules (e.g., Mayo 1996;
Mayo and Kruse 2001); that is, they try to beat the Bayesians and their
allies in their own game. One example is the following: Medical scientists
conduct a phase II trial, that is, a trial with 100–300 participants that
test the efficacy of a newly invented drug. If the drug proves to be effective
in the phase II trial, a large-scale randomized controlled (phase III) trial
will take place. Because of the costs of the experiments, the desire for
subsequent funding, pressure from the pharmacy industry, and so forth,
the scientists would be happier with a significant result (i.e., rejecting the
null hypothesis that the new drug is not effective) than an insignificant
one. Thus, upon learning that the collected results do not achieve the
required significance level, our scientists decide to sample on and to in-
clude new patients. Finally, they obtain a result that, if reported as a fixed
sample size experiment, would move the test into phase III. Shouldn’t we
be suspicious about such a move? Isn’t such a conclusion less trustworthy
than a conclusion drawn from the same data, but achieved in an “honest”
way, without any interim decisions? Mayo writes:

the try-and-try-again method allows experimenters to attain as small

6. Note that this does not hold for improper priors where the integral over the prob-
ability density is not equal to one; see Mayo and Kruse 2001.

7. A prima facie counterargument against Bayes factors consists in the “subjectivity”
of the prior probabilities in . But priors can be reported separately andB(H , H , 7)1 0
disentangled from the “impact of the evidence.”

This content downloaded from 
�������������87.79.184.140 on Mon, 04 May 2020 12:27:53 UTC������������� 

All use subject to https://about.jstor.org/terms


644 JAN SPRENGER

a level of significance as they choose (and thereby reject the null
hypothesis at that level), even though the null hypothesis is true.
(1996, 343)

Because of the dodgy way in which the conclusion was achieved, fre-
quentist statisticians are ostensibly justified to assert that such data do
not provide genuine evidence against the null hypothesis, whereas Bayes-
ians are allegedly unable to detect that the experiment was biased toward
a particular conclusion (see the discussion in Savage 1962). Evidently, the
above example can be easily transferred to other testing problems in
science.

Counterexamples of the above type raise two kinds of worries. The first
is a pre-experimental one, namely, that certain stopping rules inevitably
drive our inference into a particular direction. Hence, Bayesians appar-
ently neglect bias and manipulation as a source of impressively high pos-
terior probabilities. This worry is addressed by the results of Kadane,
Schervish, and Seidenfeld (1996), who prove that the posterior probability
of a hypothesis cannot be arbitrarily manipulated. If we stop an exper-
iment if and only if the posterior of a hypothesis rises above a certain
threshold, there will be a substantial chance that the experiment never
terminates. It is therefore not possible to reason to a foregone conclusion
and to appraise a wrong hypothesis, or to discredit a true hypothesis,
come what may.

Of course, this does not mean that Bayesians should deny the impor-
tance of experimental design. By contrast, when each single sample comes
at a certain cost (such as in medical trials where surveillance is expensive),
Bayesians and frequentists alike have to design the experiment in a way
that the expected sample size is minimized. Indeed, a huge pile of literature
deals with designing sequential experiments, from both Bayesian and fre-
quentist perspectives (e.g., Wald 1947; Armitage 1975; Berry 1987). So
both sides are well advised to affirm the pre-experimental relevance of
stopping rules and also of error probabilities. The crucial question is the
postexperimental issue: once we have observed the data, do we gain any-
thing from learning the stopping rule according to which they have been
produced?

To decide the question, note that in science, hypothesis testing is used
to substantiate decisions of all kinds, such as establishing a working hy-
pothesis for further research, moving a trial into the next stage, or ap-
proving a new medical drug. Thus, we should adopt a decision-theoretic
perspective where gains and losses for right and wrong decisions, and the
risk of various testing strategies, are taken into account. As I do not want
to beg the question, I focus on a frequentist understanding of risk with
respect to hypothesis tests and decision rules. In the remainder of the

This content downloaded from 
�������������87.79.184.140 on Mon, 04 May 2020 12:27:53 UTC������������� 

All use subject to https://about.jstor.org/terms


EVIDENCE IN SEQUENTIAL TRIALS 645

section I demonstrate that not the Bayesians but the frequentists are
beaten in their own game. Let me elaborate.

In testing hypotheses and making decisions, frequentists rely on pre-
specified error probabilities. In particular, they specify the level of the
type I error—the probability of erroneously rejecting the null hypothesis
(e.g., )—and aim at the most powerful test (i.e., the test with thea p .05
lowest type II error) at this level. This gives a decision rule for accepting
or rejecting the null hypothesis. In particular, upon learning that stopping
rule was used, frequentist inference interprets the data as produced byt1
the statistical model induced by . In other words, the frequentist hy-t1
pothesis test—usually the most powerful test at level a—and the asso-
ciated decision rule are based on calculations in that model, and vice versa
if they learn that was used, and so forth. Such a stopping rule–sensitivet2
procedure is, in the frequentist understanding, preferred to a procedure
that interprets the data as generated by an arbitrary stopping rule. I take
this to be the canonical way to phrase the postexperimental relevance of
stopping rules in frequentist terms (see Schervish, Seidenfeld, and Kadane
2002 on fixed-level testing). In the calculations below, this standpoint is
expressed in the decision rule .dS

Now, assume that the following conditions are met:

1. Let be the parameter of interest, with andm mc � � H : c � V O �0 0
. Let be the correspondingmH : c � V O � (X, A, �, c � V ∪ V )1 1 c 0 1

statistical model, with observed data .nx � X
2. Let be the set of noninformative stopping rules t such that, GSx

, if G , then . In other words, is the�y � X y p x i ≤ n t( y) p n Si i x
set of (noninformative) stopping rules that could have been used to
generate the data x.

3. Let m be a probability measure on , and let ,(S , B) d : S r {0, 1}x S x
for each , be the following 0-1 decision rule: is rejected ift � S Hx 0
and only if passes, conditional on observed data x, an a-levelH1
significance test against in the model .n n t tH (X , A , � , � )0 V V0 1

4. Let be a 0-1 decision rule that interprets data x invariably as adt
result of an experiment with stopping rule t and rejects if andH0
only if passes, conditional on observed data x, an a-level sig-H1
nificance test against in the model . Since t isn n t tH (X , A , � , � )0 V V0 1
treated as a constant, either or .d p 0 d p 1t t

5. Let be the loss matrix— being the loss suffered byL p (l ) lij i, j�{0,1} ij
opting for when is true—with and .H H l ! l l 1 li j 00 10 01 11

Proposition. Assume conditions 1–5. Let be the frequentistR(c, 7)
risk of a decision rule, understood as the expected loss if c happens
to be the true parameter. Then:

This content downloaded from 
�������������87.79.184.140 on Mon, 04 May 2020 12:27:53 UTC������������� 

All use subject to https://about.jstor.org/terms


646 JAN SPRENGER

• For each , either ort � S R(c, d ) ! R(c, d ) Gc � V R(c, d ) !x t S 0 t
.R(c, d ) Gc � VS 1

• For each , either orc � V ∪ V R(c, 0) ! R(c, d ) R(c, 1) ! R(c,0 1 S
.d )S

Proof. For each , either or . Assume firstt � S d p 0 d p 1 d px t t t
. Then let . Then and0 c � H R(c, d ) p l0 t 00

R(c, d ) p m l � (1 � m )l ,S 0 00 0 10

where is accepted on the basis of x in };tm :p m{t � S FH �0 x 0 V0

R(c, d ) � R(c, d ) p m l � (1 � m )l � lS t 0 00 0 10 00

p (1 � m )(l � l )0 10 00

1 0.

Similarly, for , where we choose :d p 1 c � Ht 1

R(c, d ) � R(c, d ) p m l � (1 � m )l � lS t 0 01 0 11 11

p m (l � l )0 01 11

1 0.

The second part of the proposition follows immediately. �

Corollary. Preferring over and leads to incoherence,d d p 0 d p 1S t t
for any value of c, in the sense that a Dutch book (namely, a sure
loss) can be construed against these preferences.

Proof. Follows straightforwardly from the second part of the prop-
osition. Compare to the argument given in Section 5.1. of Schervish,
Kadane, and Seidenfeld 2003.

Remark 1. The proposition sounds complicated, but it merely cap-
tures the intuitive conjecture that which decision rule minimizes the
frequentist risk depends on the true value of c. Also note that the
result is independent of m; that is, when we get to know the used
stopping rule postdata, it does not matter whether this particular
stopping rule was likely to be chosen at the outset.

Remark 2. The frequentist’s dilemma bears a close relationship to
the problem of testing a hypothesis at a fixed level when a random
choice between different experiments is made (Cox 1958) or when

This content downloaded from 
�������������87.79.184.140 on Mon, 04 May 2020 12:27:53 UTC������������� 

All use subject to https://about.jstor.org/terms


EVIDENCE IN SEQUENTIAL TRIALS 647

the value of a nuisance parameter is unknown (Schervish et al. 2003).
In both cases, sticking to fixed-level testing leads to incoherence.8

A practical application of this result is an experiment in which we are
told the data but not the stopping rule. Assume that postmortem elici-
tation would take some time and effort. The above results tell us that if
we decide to treat the data as generated by, for example, a fixed sample
size experiment, we will do better than waiting for the true stopping rule
to be reported for some values of c, while doing worse for others. Thus,
in a frequentist framework, there can be no general argument for taking
into account the stopping rule, as opposed to neglecting it. More precisely,
if a frequentist prefers stopping rule–sensitive fixed-level testing to a fixed-
level test of with respect to arbitrary stopping rules, her set of pref-H0
erences is incoherent. In the medical trial example given at the outset,
this implies that caring for the actually used stopping rule (instead of
treating the trial as, say, a fixed-sample experiment) makes certain pre-
suppositions on our beliefs about the drug efficacy c: for certain values of
c, the expected loss will decrease, while for others, it will increase. Thus,
prior expectations on c have to be formulated to decide between both
options. But these kinds of expectations on c (such as prior distributions)
are what frequentist statisticians or philosophers of statistics, by the very
nature of their approach, want to avoid.9

Bayesians, on the other hand, avoid these troubles by assessing evidence
in terms of Bayes factors and posterior probabilities that are not at all
affected by stopping rules. Hence, the practical argument against the post-
experimental relevance of stopping rules from Section 2 obtains a theo-
retical, decision-theoretic vindication.

4. Evaluation: A Philosopher’s Conclusion. The debate about the rele-
vance of experimental design and stopping rules is blurred by the lack of
clarity about which kind of relevance is meant. Equivocation and con-
fusion result. Moreover, the debate is characterized by a mutual deadlock.
To resolve it, I have suggested to distinguish pre- and postexperimental
relevance and to choose a position that corresponds to the practical needs

8. Teddy Seidenfeld reminded me that is an inadmissible (dominated) decision rule,dS
in the sense that a test that is randomized over the elements of could achieve a lowerSx
type II error than , while maintaining the same type I error level a (see Cox 1958).dS
However, since m is in general unknown, this remains a result of purely theoretical
interest.

9. Frequentists, while conceding that their decision rule is strictly speaking incoherent,
might maintain that it is at least risk averse in the following sense: the expected loss
of will always figure between the expected losses of and . This argumentd d p 0 d p 1S t t
will be pursued in further work.

This content downloaded from 
�������������87.79.184.140 on Mon, 04 May 2020 12:27:53 UTC������������� 

All use subject to https://about.jstor.org/terms


648 JAN SPRENGER

of empirical science. Such a position has to reject the postexperimental,
evidential relevance of stopping rules: First, such a standpoint would yield
measures of evidence that are easily manipulable, without any means of
control on behalf of scientific institutions. Thus, such measures of evidence
cannot play their proper role in scientific communication. Second, such
a standpoint would also lead to decision-theoretic incoherence. In par-
ticular, a frequentist who claims the postexperimental relevance of stop-
ping rules cannot avoid referring to prior expectations on the unknown
parameter, undermining the very foundations of frequentist inference.

The valid core of the frequentist argument is the pre-experimental rel-
evance of stopping rules as a means of providing for efficient, cost-
minimizing sampling. The lack of disentanglement between both concepts
of relevance has obfuscated the debate and led to the belief that stopping
rules should matter postexperimentally, too. This belief is, however, fal-
lacious. Hence, experimental design—and, in particular, the design of
stopping rules—remains indispensable for scientific inference, but in a
more narrow sense than frequentist statisticians and philosophers of sci-
ence believe.

REFERENCES

Armitage, Peter (1975), Sequential Medical Trials. Oxford: Blackwell.
Berger, James O., and Donald A. Berry (1988), “The Relevance of Stopping Rules in Sta-

tistical Inference” (with discussion), in S. Gupta and J. O. Berger (eds.), Statistical
Decision Theory and Related Topics IV. New York: Springer, 29–72.

Berger, James O., and Robert L. Wolpert (1984), The Likelihood Principle. Hayward, CA:
Institute of Mathematical Statistics.

Berry, Donald A. (1987), “Statistical Inference, Designing Clinical Trials, and Pharmaceu-
tical Company Decisions”, Statistician 36: 181–189.

Birnbaum, Allan (1962), “On the Foundations of Statistical Inference”, Journal of the Amer-
ican Statistical Association 57: 269–306.

Cox, David R. (1958), “Some Problems Connected with Statistical Inference”, Annals of
Mathematical Statistics 29: 357–372.

Edwards, Ward, Harold Lindman, and Leonard J. Savage (1963), “Bayesian Statistical
Inference for Psychological Research”, Psychological Review 70: 450–499.

Goodman, Steven N. (1999), “Toward Evidence-Based Medical Statistics. 1: The P Value
Fallacy”, Annals of Internal Medicine 130: 995–1004.

Howson, Colin, and Peter Urbach (2006), Scientific Reasoning: The Bayesian Approach. 3rd
ed. La Salle, IL: Open Court.

Kadane, Joseph B., Mark J. Schervish, and Teddy Seidenfeld (1996), “When Several Bayes-
ians Agree That There Will Be No Reasoning to a Foregone Conclusion”, in Lindley
Darden (ed.), PSA 1996: Proceedings of the 1996 Biennial Meeting of the Philosophy
of Science Association, vol. 1. East Lansing, MI: Philosophy of Science Association,
S281–S289.

Lele, Subhash (2004), “Evidence Functions and the Optimality of the Law of Likelihood”
(with discussion), in Mark Taper and Subhash Lele (eds.), The Nature of Scientific
Evidence. Chicago: University of Chicago Press, 191–216.

Mayo, Deborah G. (1996), Error and the Growth of Experimental Knowledge. Chicago:
University of Chicago Press.

Mayo, Deborah G., and Michael Kruse (2001), “Principles of Inference and Their Con-

This content downloaded from 
�������������87.79.184.140 on Mon, 04 May 2020 12:27:53 UTC������������� 

All use subject to https://about.jstor.org/terms


EVIDENCE IN SEQUENTIAL TRIALS 649

sequences”, in D. Cornfield and J. Williamson (eds.), Foundations of Bayesianism. Dor-
drecht: Kluwer, 381–403.

Mayo, Deborah G., and Aris Spanos (2006), “Severe Testing as a Basic Concept in a
Neyman-Person Philosophy of Induction”, British Journal for the Philosophy of Science
57: 323–357.

Royall, Richard (1997), Statistical Evidence: A Likelihood Paradigm. London: Chapman &
Hall.

Savage, Leonard J. (1962), The Foundations of Statistical Inference: A Discussion. London:
Methuen.

Schervish, Mark (1995), Theory of Statistics. New York: Springer.
Schervish, Mark J., Joseph B. Kadane, and Teddy Seidenfeld (2003), “Measures of Inco-

herence: How Not to Gamble If You Must”, in J. Bernardo et al. (eds.), Bayesian
Statistics 7: Proceedings of the 7th Valencia Conference on Bayesian Statistics. Oxford:
Oxford University Press, 385–402.

Schervish, Mark J., Teddy Seidenfeld, and Joseph B. Kadane (2002), “A Rate of Incoherence
Applied to Fixed-Level Testing”, Philosophy of Science 69 (Proceedings): S248–S264.

Wald, Abraham (1947), Sequential Analysis. New York: Wiley.

This content downloaded from 
�������������87.79.184.140 on Mon, 04 May 2020 12:27:53 UTC������������� 

All use subject to https://about.jstor.org/terms