Bias and Conditioning in Sequential Medical Trials Bias and Conditioning in Sequential Medical Trials Author(s): Cecilia Nardini and Jan Sprenger Source: Philosophy of Science , Vol. 80, No. 5 (December 2013), pp. 1053-1064 Published by: The University of Chicago Press on behalf of the Philosophy of Science Association Stable URL: https://www.jstor.org/stable/10.1086/673732 JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org. Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at https://about.jstor.org/terms The University of Chicago Press and Philosophy of Science Association are collaborating with JSTOR to digitize, preserve and extend access to Philosophy of Science This content downloaded from �������������87.79.184.140 on Mon, 04 May 2020 12:28:49 UTC������������� All use subject to https://about.jstor.org/terms https://www.jstor.org/stable/10.1086/673732 Bias and Conditioning in Sequential Medical Trials Cecilia Nardini and Jan Sprenger*y Randomized controlled trials are currently the gold standard within evidence-based med- icine. Usually they are monitored for early signs of effectiveness or harm. However, ev- idence from trials stopped early is often chargedwith bias toward implausibly large effects. To our mind, this skeptical attitude is unfounded and caused by the failure to perform ap- propriate conditioning in the statistical analysis of the evidence. We contend that condi- tional hypothesis tests give a superior appreciation of the obtained evidence and signifi- cantly improve the practice of sequential medical trials, while staying firmly rooted in frequentist methodology. 1. Introduction. Randomized controlled trials ðRCTsÞ—trials where pa- tients are randomly assigned to a treatment and a control group, while con- trolling for possible confounders—are currently the gold standard within evidence-based medicine ðWorrall 2007Þ. Usually they are conducted as sequential trials allowing for monitoring for early signs of effectiveness or harm. In sequential trials, data are typically monitored as they accumulate. That is, we have interim looks at the data, and we may decide to stop the trial be- fore the planned sample size is reached. By terminating a trial when over- whelming evidence for the effectiveness or harmfulness of a new drug is *To contact the authors, please write to: Cecilia Nardini, University of Milan and European In- stitute of Oncology ðIEOÞ, Campus IFOM-IEO, Via Adamello, 16, 20139 Milan, Italy; e-mail nardini.folsatec@gmail.com. Jan Sprenger, Tilburg Center for Logic and Philosophy of Science ðTiLPSÞ, Tilburg University, P.O. Box 90153, 5000 LE Tilburg, The Netherlands; e-mail j.sprenger@uvt.nl. yThe authors would like to thank the senior and junior members of the FOLSATEC PhD program, as well as David Teira and the PSA audience. Jan Sprenger would also like to thank the Netherlands Organization for Scientific Research ðNWOÞ for supporting this research through Veni grant no. 016.104.079. Philosophy of Science, 80 (December 2013) pp. 1053–1064. 0031-8248/2013/8005-0036$10.00 Copyright 2013 by the Philosophy of Science Association. All rights reserved. 1053 This content downloaded from �������������87.79.184.140 on Mon, 04 May 2020 12:28:49 UTC������������� All use subject to https://about.jstor.org/terms : : available, we can bound the prohibitive costs of a medical trial and protect in-trial patients against receiving inferior treatments. Thus, monitoring con- tributes to meeting ethical and epistemic requirements that clinical inves- 1054 CECILIA NARDINI AND JAN SPRENGER tigators are confronted with. However, the early termination of sequential trials raises an important ethical concern: Is it mandatory to stop a trial as soon as the new treatment shows convincing signs of superiority? Or should the trial be continued in order to achieve a result that would convince the wide medical community of the superiority of the new treatment? On the one hand, the health of actual patients must not be jeopardized by administering an inferior treat- ment; on the other hand, establishing sound and univocal scientific conclu- sions will facilitate an effective cure of future patients. The issue is complicated by the fact that evidence from trials stopped early is often met with skepticism in the medical literature: “RCTs stopped early for benefit . . . show implausibly large treatment effects. . . . Clini- cians should view the results of such trials with skepticism” ðMontori et al. 2005, 2203Þ. This standpoint is affirmed by the recent STOPIT-2 metastudy where Bassler et al. ð2010, 1187Þ blame truncated RCTs for “appreciable overestimates of effect.” While we cannot adjudicate the far-reaching question about the ethi- cal legitimacy of monitoring, we side with Worrall ð2008, 418Þ that “no in- formed view of the ethical issues . . . can be adopted without first taking an informed view of the evidential-epistemological ones.” In particular, we think that the skeptical attitude about trials stopped early for benefit stems from a fallacious statistical interpretation of such trials. These misinterpre- tations are, to our mind, mainly caused by a lack of awareness about issues in statistical methodology that also troubles other disciplines, such as eco- nomics and psychology. Indeed, the two grand schools of statistical infer- ence—Bayesian and frequentist inference—are in outright conflict about how to plan and to evaluate a sequential trial. Our essay takes the following route. First, we expose the arguments for and against the presence of bias in early stopped trials and explain why this problem is related to principled questions in statistical methodology ðsec. 2Þ. Subsequently, we argue that the real problem is the use of unconditional error assessments in sequential trials, rather than the often-invoked divide between Bayesians and frequentists ðsec. 3Þ. Then we show that conditional frequentist tests reconcile the need for valid postexperimental appraisal of the evidence with the realities of the current regulatory framework in med- icine and, in particular, with the implied preference for frequentist analysis ðsec. 4Þ. Finally, we wrap up our results and sketch how a superior meth- odological framework can improve the design and practice of sequential trials and eventually lead to better decisions ðsec. 5Þ. This content downloaded from �������������87.79.184.140 on Mon, 04 May 2020 12:28:49 UTC������������� All use subject to https://about.jstor.org/terms 2. The Assessment of Truncated Trials. The practice of stopping RCTs early for benefit has been subject to severe epistemological criticism. Skep- ticism surrounds the results of these trials, due to the fact that they show SEQUENTIAL MEDICAL TRIALS 1055 implausibly large treatment effects, relative to what the medical commu- nity would be inclined to expect. In a review of 134 trials stopped early for benefit, Montori et al. ð2005Þ point to an inverse correlation between sam- ple size and treatment effect: the smaller the sample size achieved by the trial at the moment of stopping, the larger the estimate it provided for the effect. These findings are supported by a more recent study by Bassler et al. ð2010Þ, where truncated trials report significantly higher effects than trials that were not stopped early. Some prominent cases seem to corroborate this skepticism. Mueller et al. ð2007Þ report a case of two leukemia treatments where interim analy- ses suggested a high relative risk reduction ð53% and 45%Þ in a particu- lar chemotherapy regimen. However, that assessment had to be reversed after completion of the trial. In the medical community, such cases fuel mistrust toward anticipated claims of benefit and nourish the fear of pro- moting a treatment that is actually less efficacious. Therefore, stopping a trial early might lead to a result that the medical community does not trust, canceling the epistemic and ethical benefits that monitoring possesses in the long run. However, not all methodologists share this pessimistic view on trials stopped early. Goodman, Berry, and Wittes ð2010Þ observe that pronounced effect size differences between truncated and completed trials are actually predictable: highly efficacious treatments will naturally be more prone to early termination for benefit. Hence, the observed difference in estimated effect size is precisely what we should expect. Comparing truncated to com- pleted trials amounts, as highlighted by Berry, Carlin, and Connor ð2010Þ, to selecting the trials to be compared on the basis of their outcome. In this context, prior knowledge or empirically based prior expectations are highly relevant for sound decision making. Unfortunately, at present they enter the final decisions only in a methodologically unsatisfactory ad hoc way. This observation suggests that systematic use of Bayesian infer- ence may address the problem. A Bayesian represents subjective uncer- tainty by means of a prior probability distribution over the values of the quantity of interest ðe.g., relative risk reductionÞ. By means of Bayes’s the- orem, this distribution is updated to a posterior probability distribution that synthesizes the observed evidence with the background knowledge. In the Bayesian framework, implausibly large observed effects can be balanced by prior expectations and lead to a more conservative conclusion than in standard frequentist methodology. In particular, it can be explained that truncated trials provide, ceteris paribus, less confidence than trials with This content downloaded from �������������87.79.184.140 on Mon, 04 May 2020 12:28:49 UTC������������� All use subject to https://about.jstor.org/terms a comparable effect size that were completed ðGoodman 2007Þ. The smaller the actual sample, the more will the posterior distribution resem- ble the prior distribution, for a given effect size. So it appears that the wor- 1056 CECILIA NARDINI AND JAN SPRENGER ries of Montori et al. ð2005Þ and Bassler et al. ð2010Þ—overestimation of treatment effect in truncated RCTs—could be alleviated by switching the statistical framework. Despite the advantages just outlined, there are some serious counter- arguments to the viability of Bayesianism in clinical trials. First of all, the specification of a prior probability function ðand a decision modelÞ is prob- lematic in a number of ways ðsee Moyé 2008Þ. Second, in Bayesian statistics, experimental design is apparently irrelevant for the postexperimental con- clusions. This is unacceptable to regulatory bodies that are keen to promote proper design of medical trials as a means to ensure the validity of trial re- sults ðe.g., Food and Drug Administration 2010Þ. Even though some of these worries are regulatory rather than epistemo- logical, they are certainly legitimate. Indeed, we believe that solving the in- terpretational problems with truncated trials does not require one to pass from the frequentist to the Bayesian paradigm. As we will argue in the up- coming sections, it is more fruitful to turn to a different distinction: namely, to replace unconditional by conditional procedures. 3. Problems with Unconditional Inference in Sequential Medical Trials. Sequential medical trials usually control the reliability of a testing procedure from a preexperimental point of view, by means of type I and type II error rates. These error probabilities are extremely important for proper experimental design, and they get a lot of attention from a regula- tory point of view. Moreover, Mayo and Kruse ð2001Þ have argued, among others, that if the sampling plan is violated, error probabilities cannot be properly controlled and are actually inflated far beyond acceptable. However, adherence to a proper sequential sampling plan is not suffi- cient to secure a reliable result. Arguably, what is most disturbing to the medical community is the fact that, according to current procedures, a trun- cated trial has prima facie the same reliability as a trial carried to the planned end. This is because Neyman and Pearson’s type I and II error rates are un- conditional quantities; that is, they are insensitive to whether the data are just at the significance boundary or far beyond it. In line with this observation, we contend that the unconditional nature of Neyman-Pearson hypothesis tests is the culprit for their epistemological shortcomings. To motivate this claim, we walk the reader through an exam- ple by Cox ð1958Þ and Royall ð1997, 74–75Þ. Suppose that we test H0:N ð0; j2Þ against H1:N ð1; j2Þ with known j2 and that the toss of a fair coin decides whether we draw N 5 1 or N 5 100 independently and identically distributed observations. It seems natural to apply the most powerful test at This content downloaded from �������������87.79.184.140 on Mon, 04 May 2020 12:28:49 UTC������������� All use subject to https://about.jstor.org/terms the 5% level in either case. However, the probabilistic mixture of the two most powerful tests at the 5% level is not the most powerful test in the overall experiment. We can do better if we reject H for x > 1:282 in SEQUENTIAL MEDICAL TRIALS 1057 0 1 the case of N 5 1, while rejecting H0 if �x > 0:508 in the case of N 5 100. Both procedures are tests at the 5% level, but the second, “gerrymandered” test has a greater power ð69%Þ than the mixture of unconditional tests ð63%Þ. One may be inclined to dismiss the second test because not all of its components are tests at the 5% level. In the N 5 1 case, the nominal signif- icance level of the test is 10%. However, from an unconditionalist ðpre- experimentalÞ viewpoint, only the overall error rates should count. Here, the superior power features speak for the second, gerrymandered test. This fea- ture of the prevalent Neyman-Pearson methodology reveals the tension be- tween the preexperimental design of unconditional procedures and the need to efficiently learn from the actual data. Unconditional error rates and con- fidence intervals do not address that second goal: “Now if the object of the analysis is to make statements by a rule with certain specified long-run properties, the unconditional test . . . is in order. . . . If, however, our objective is to say what we can learn from the data we have, the unconditional test is surely no good” ðCox 1958, 360Þ. The example can, of course, be easily generalized. It undermines the view that unconditional, preexperimental er- ror probabilities can qualify the goodness of an inference. Therefore, practitioners who rely on unconditional procedures have to find informative and reliable postdata assessments of the evidence. Often, they report the observed p-value to quantify the conclusiveness of the rejection of the null. However, p-values really combine the worst of all worlds. Since comprehensive and devastating criticisms of using p-values in scientific ex- periments have been delivered elsewhere ðRoyall 1997; Goodman 1999Þ; we only mention their most fundamental failures: they neither possess a valid frequency interpretation nor do they provide a useful measure of con- fidence in the null hypothesis. Moving to confidence intervals is often suggested as a way of circum- venting the p-value problem ðe.g., Cumming and Finch 2005Þ. However, a 95% confidence interval merely specifies the set of parameter values that are consistent with the observation at the 95% level. This does not mean that we should have 95% confidence that the confidence interval includes the parameter value. In fact, the degree of confidence is just an average cov- erage rate over intervals from repeated random samples; it is not the cover- age probability of the one particular interval that the investigator happens to get. Therefore, some confidence intervals may include the entire sample space ðsee Seidenfeld 1981Þ, raising the question of whether the entire no- tion is a misnomer. These problems of unconditional inference can be overcome by condi- tioning on the relevant chunks of information. In the next section, we will This content downloaded from �������������87.79.184.140 on Mon, 04 May 2020 12:28:49 UTC������������� All use subject to https://about.jstor.org/terms see how conditional inference may resolve the methodological confusion about interpreting truncated RCTs without abandoning the framework of frequentist statistics. 1058 CECILIA NARDINI AND JAN SPRENGER 4. Conditional Frequentist Inference. Conditional inference tries to im- prove upon unconditional procedures byquantifyingthedegreeof confidence that we can have in our conclusions as a function of the observed evidence. More precisely, conditional inference builds on the strength of the observed evidence. As we will show in this section, it can be justified from both the Bayesian and the frequentist perspective. The idea comes up for the first time in Cox’s ð1958Þ seminal paper and has been developed later by Kiefer ð1977Þ and Berger ð2003Þ, together with various coauthors. The main idea can be motivated by a very simple example ðKiefer 1977; Berger 2003Þ. Two observations X1 and X2 are taken with probability law Xi 5 v 1 1 with probability 1=2 v 2 1 with probability 1=2 � : If we now construct a confidence interval for v, then the interval Cvð�;�Þ defined by CvðX1; X2Þ:5 X1 1 1 if X1 5 X2 ðX1 1 X2Þ=2 if X1 ≠ X2 � has an unconditional coverage of 75%. Yet this does not seem to be a sensi- ble conclusion regarding the confidence that the data warrant with respect to the true value of v. Dependent on whether we observe jX1 2 X2j 5 0 or jX1 2 X2j 5 2, we are entitled to a statement with ða posterioriÞ confidence 50% and 100%, respectively. The unconditional coverage of 75% neglects that, after learning the strength of the evidence ði.e., the value of jX1 2 X2jÞ, we are in a much better position to assess the confidence which the data grant about our inference. Thus, conditioning on the value of jX1 2 X2j im- proves the accuracy of our conclusions ðsee Cox 1958, 361–63Þ. It is also noteworthy that the probability distribution of jX1 2 X2j does not depend on the value of v. That is, jX1 2 X2j is an ancillary statistic with regard to v. In particular, conditioning on the value of jX1 2 X2j is quite different from Bayesian conditionalization: where Bayesians change their subjective probability distributions by conditioning on the entire data, con- ditioning on the value of jX1 2 X2j just helps to better appreciate the ðfre- quentistÞ interpretation of the data. If this idea is applied to hypothesis testing, which is the major issue in medical trials, unconditional error rates are replaced by conditional error probabilities. In the following, we will outline the basic idea of conditional tests, following Berger, Brown, and Wolpert ð1994Þ. This content downloaded from �������������87.79.184.140 on Mon, 04 May 2020 12:28:49 UTC������������� All use subject to https://about.jstor.org/terms Consider, for the purpose of mathematical convenience, the case of testing a point null hypothesis H0: v 5 v0 against the simple alternative H : v 5 v in some probability model ðX BðXÞ; v ∈ VÞ. Define f ðxÞ and SEQUENTIAL MEDICAL TRIALS 1059 1 1 0 f1ðxÞ as the probability densities of data x ∈ X under the hypotheses H0 and H1, and let the Bayes factor BðxÞ:5 f0ðxÞ=f1ðxÞ be the ratio of the prob- ability density functions. Now, let F0 and F1 be the cumulative distribu- tion functions corresponding to the Bayes Factor. F0ðcÞ:5 PH0ðBðXÞ ≤ cÞ F1ðcÞ:5 PH1ðBðXÞ ≤ cÞ: We now divide X into a partition ðXsÞs∈½0;1� defined by Xs:5 fx ∈ XjBðxÞ 5 s ∨ BðxÞ 5 F210 ð1 2 F1ðsÞÞg: ð1Þ The different Xs represent, intuitively, different observed strengths of ev- idence. This can also be made precise mathematically: under the assump- tion F0ð1Þ 5 1 2 F1ð1Þ, which is satisfied for many distributions used in practice, the Xs have the same probability density under H0 and H1, for all values of s. In other words, their distribution is independent of which hy- pothesis is true ðBerger et al. 1994, 1789–90Þ. This ancillarity property is shared with the statistic jX1 2 X2j in the above toy example. Therefore, Xs is excellently suited for the purpose of condi- tioning: to take the observed strength of the evidence into account without already telling us something about the parameter of interest. Thus, condi- tioning exploits a crucial strength of Bayesian paradigm—to identify a sen- sible measure of evidence—without assigning a subjective probability to competing hypotheses. The conditional error probability can now be calculated by conditioning on the particular set Xs in which the observed data fall. In particular, we can define a conditional frequentist test by T*ðXÞ 5 reject H0 if BðXÞ < 1 accept H0 if BðXÞ ≥ 1 � ; ð2Þ and for observed BðxÞ 5 s, we report conditional error probabilities aðsÞ 5 PH0ðreject H0 j X ∈ XsÞ 5 s 1 1 s ; ð3Þ bðsÞ 5 PH1ðaccept H0jX ∈ XsÞ 5 1 1 1 s ; ð4Þ where the latter equalities have been proven by Berger et al. ð1994, theo- rem 1Þ. Clearly, by using the conditional instead of the unconditional error This content downloaded from �������������87.79.184.140 on Mon, 04 May 2020 12:28:49 UTC������������� All use subject to https://about.jstor.org/terms probabilities, we gain a much better appreciation of the chance of a wrong decision, given the particular data that we have observed. The higher the Bayes factor, the more confident we can be about an acceptance of the null 1060 CECILIA NARDINI AND JAN SPRENGER and vice versa. In particular, the classical, unconditional test just detects whether the data are within or outside the rejection region ðand leaves the rest to the notorious p-valuesÞ whereas the conditional test allows for a fine- grained, properly frequentist discrimination among trials with significant outcomes. We turn now to briefly discussing a couple of objections that could be made from within the frequentist perspective. First, it could be argued that T* makes it far too easy to reject the null ðBðXÞ < 1Þ whereas in medicine, evidence has to be really strong before we are convinced of the efficacy of a new treatment and approve of the drug. To this we simply respond that T* has been selected because of its simplicity, but we can easily change the rejection region according to contextual requirements. To obtain a sensible conditional test, we will often have to use nonancillary conditioning statis- tics and to include a no-decision region ðBerger, Boukai, and Wang 1997, 145–47Þ. However, these features align well with the caution toward pre- mature conclusions that prevails in the medical community and do not pose any problem for the practitioner. Second, there may be worries about the scope of the above procedure, which we have only explained for the easiest possible case of hypothesis testing. However, Berger et al. ð1997Þ have extended conditional tests to simple versus composite testing problems and, in particular, to the two- sided null hypothesis testing problems that frequently occur in RCTs. Third, the use of the Bayes factor may indicate that the conditional test is actually a Bayesian test in frequentist cloths. Indeed, for impartial priors pðH0Þ 5 pðH1Þ 5 1=2 the posteriors pðH0jxÞ 5 1 1 BðxÞ21 � �21 5 BðxÞ 1 1 BðxÞ ; pðH1jxÞ 5 1 1 BðxÞ½ �21 5 1 1 1 BðxÞ just correspond to the conditional error probabilities for rejecting and ac- cepting H0, respectively. However, BðXÞ possesses a frequentist interpre- tation, too, since it identifies the most powerful frequentist test in the simple versus simple testing problem.1 1. This is the content of the Neyman-Pearson lemma. Furthermore, Berger ð2003Þ introduced a conditional test that relies on the p-value as the conditioning statistics and yields the same postdata error probabilities as T*. This content downloaded from �������������87.79.184.140 on Mon, 04 May 20hu, 01 Jan 1976 12:34:56 UTC All use subject to https://about.jstor.org/terms Thus, Bayesians and frequentists can conduct the same ðconditionalÞ test and obtain the same numerical conclusions. For the medical practitioner, philosophical questions about the interpretation of probability are clearly SEQUENTIAL MEDICAL TRIALS 1061 secondary as long as there is methodological agreement on procedures and postexperimental data assessment ðsee Berger 2003Þ. In this sense, condi- tional inference is a genuine reconciliation of Bayesian and frequentist methodology and a real asset for practitioners. We would like to conclude this section by means of an application of conditional inference to sequential medical trials. The example involves a trial for adjuvant therapy in resectable hepatocellular carcinoma ðLau et al. 1999Þ. The trial was stopped early based on interim findings, but additional data were available after the decision to stop was made. Pocock and White ð1999Þ describe the situation in detail: At the planned interim analysis, the local disease recurrence rates for the active treatment ðintra-arterial lipiodol-iodine-131Þ and control ðno ad- juvant treatmentÞ groups were three/14 ð21%Þ and 11/16 ð69%Þ respec- Such anal rand quen tively ðp 5 0:01Þ. According to the predefined stopping rule, p < 0:029 was sufficient for early stopping. . . . Thus, the investigators decided to stop the trial. ½However�, 13 more patients were randomised before the trial was stopped, and the investigators also decided to postpone analysis while patients already randomised were followed up. Hence, the report ð18 months after the trial was stoppedÞ reveals updated recurrence rates of six/21 ð29%Þ and 13/22 ð59%Þ, respectively ðp 5 0:04Þ. Thus the ab- solute difference in recurrence rates shrank from 48% to 30% during the interval between stoppage and publication. ð1999, 944Þ shrinkage of the estimated benefit between the interim and the final ysis is precisely what fuels clinicians’ worries about “stopping on a om high” and adds to their skepticism about truncated trials. In this situation conditional error rates can provide real guidance. We set up an alternative hypothesis H1 according to Lau et al.’s ð1999Þ expectations that “131I-lipiodol would reduce the rate of recurrence ½postulated to be 50%� by 50% and double the disease-free survival rate” ð1999, 798Þ. Using this value in calculation of the Bayes factor BðxÞ 5 0:09 yields a conditional type I error rate of a* 5 9% at the interim analysis, instead of the unconditional error rate of a 5 5%.2 Moreover, we can dismiss the apparently strong unconditional p-value of p 5 :01%, which is just indic- ative of an unexpectedly high performance. By contrast, the conditional error reflects the greater statistical uncertainty associated with the small 2. Since the trial was stopped following a proper group sequential rule, a remains the same regardless of when the trial is terminated, unlike in Wald’s ð1947Þ classical se- tial probability ratio test. This content downloaded from �������������87.79.184.140 on Mon, ffff on Thu, 01 Jan 1976 12:34:56 UTC All use subject to https://about.jstor.org/terms sample when the decision to stop the trial was made. At the end of the trial, the conditional test still rejects the null, but the probability of error is now higher: the calculation based on BðxÞ 5 0:16 yields a 14% probability of 1062 CECILIA NARDINI AND JAN SPRENGER error, which is in line with the reservations of the clinicians involved. We now briefly wrap up the advantages of conditional over unconditional inference. First, the assessment of the error probability depends on the ob- served data and is thus way more informative than in the unconditional framework. This alleviates the interpretational problem mentioned in sec- tion 2, since conditional error allows medical readers to assess the confidence in the outcome based on the observed data. Clearly, medical investigators should be more concerned with the actual probability of drawing the wrong inference than with the absolute ðunconditionalÞ error rate of the testing pro- cedure, also because clinicians have to make ethical decisions for their actual patients ðsee Nardini 2013Þ. As a further point, the error probabilities ð3Þ and ð4Þ are independent of the stopping rule, that is the sampling plan determining when the trial is termi- nated. In an RCT, the stopping rule can never be fully specified, since one cannot cover in advance all eventualities that might happen during a se- quential trial. Independence from the stopping rule entails that interpretation oftheresults andassessmentof errorarepossibleeven if thestoppingrule was misspecified or could not be adhered to due to unforeseen circumstances. This is a substantial practical asset ðsee Sprenger 2009Þ. This should not be misunderstood as the claim that predata analysis and experimental design are superfluous. Unfortunately, Berger et al. ð1994, 1803Þ make a claim in that direction, but given the strong emphasis on careful design by methodologists and regulatory bodies ðsee Moyé 2008; Food and Drug Administration 2010Þ, this is unlikely to increase the ac- ceptance of the conditional approach among medical practitioners. We would like to stress that no such claim is required for making a case for the superiority of the conditional frequentist approach. Moreover, since con- ditional tests can be conducted from both a Bayesian and a frequentist perspective, practitioners do not have to decide for either camp. Finally, there are interesting implications for the philosophy of statistics: if the “error statisticians” ðe.g., Mayo 1996Þ are right that learning from error is indeed a cornerstone of inductive inference, then a move to conditional in- ference may protect their framework against the objections that we have mentioned in section 3. In particular, there is no need to tie an error-statistical methodology to unconditional inference. However, further developing this line of thought goes beyond the scope of this article. 5. Conclusions. In this article, we have analyzed the impact of statistical methodology on a substantive ethical and societal question, namely, data This content downloaded from �������������87.79.184.140 on Mon, 04 May 2020 12:28:49 UTC������������� All use subject to https://about.jstor.org/terms monitoring in sequential medical trials. In the medical literature, trials stopped early for benefit are often charged with being biased toward im- plausibly large treatment effects ðe.g., Bassler et al. 2010Þ. SEQUENTIAL MEDICAL TRIALS 1063 We think that this worry is based upon a misinterpretation of sequential trials that is in turn due to shortcomings of standard frequentist procedures. It has been argued ðe.g., Goodman 2007Þ that a Bayesian perspective over- comes this problem: if a trial is stopped early because of an implausibly large effect, blending its result with a ðconservativeÞ prior probability distribution naturally mitigates the conclusion. However, as a matter of research tradition and regulatory requirements—in particular, concerns about individual biases in generating prior distributions—the Bayesian framework does not provide an easy way out. In this essay, we contend that the real issue is not the contrast between Bayesian and frequentist methodology. Rather, we are concerned about the shortcomings of unconditional inference. We have elaborated that while un- conditional error probabilities may be helpful in the design of an experi- ment, they do not tell us what we have actually learned from the data. We have therefore defended proper conditioning—calculating error probabilities con- ditional on the strength of the observed evidence—as a way of curing the deficits of unconditional frequentist inference. This approach has a natural application to sequential testing and both a valid Bayesian and a valid fre- quentist interpretation. As we have demonstrated in a brief example, this approach holds con- siderable promise for the interpretation of early stopped trials in medicine. The possibility of postdata assessments of the probability of an erroneous conclusion represents an invaluable asset for the practitioner and the de- cision maker. The results of a medical trial tell much more than the simple acceptance or rejection of a scientific hypothesis: they indicate where evi- dence is strong and where it is inconclusive, indicating the need for further research. Conditional inference, we believe, can improve the methodology of clinical trials because it allows us to take this additional information into account. In conclusion, a clearer view on issues in statistical methodology can help to better appreciate data from sequential medical trials and lead to more efficient and ethically superior decisions in medical research. REFERENCES Bassler, Dirk, et al. 2010. “Stopping Randomized Trials Early for Benefit and Estimation of Treatment Effects.” Journal of the American Medical Association 303:1180–87. Berger, James O. 2003. “Could Fisher, Jeffreys and Neyman Have Agreed on Testing?” Statistical Science 18:1–12. Berg er, James O., Ben Boukai, and Yinping Wang. 1997. “Unified Frequentist and Bayesian Testing of a Precise Hypothesis.” Statistical Science 12:133–60. This content downloaded from �������������87.79.184.140 on Mon, 04 May 2020 12:28:49 UTC������������� All use subject to https://about.jstor.org/terms Berger, James O., Lawrence D. Brown, and Robert L. Wolpert. 1994. “A Unified Conditional Frequentist and Bayesian Test for Fixed and Sequential Simple Hypothesis Testing.” Annals of Statistics 22:1787–1807. Berry, Scott M., Bradley P. Carlin, and Jason Connor. 2010. “Bias and Trials Stopped Early for Cox Cum Food Goo —— Goo Kief Lau, May May Mon Moy Mue Nard Poco Roy Seid Spre Wald Wor —— 1064 CECILIA NARDINI AND JAN SPRENGER Benefit.” Journal of the American Medical Association 304:156. , David. 1958. “Some Problems Connected with Statistical Inference.” Annals of Mathematical Statistics 29:357–72. ming, Geoff, and Sue Finch. 2005. “Inference by Eye: Confidence Intervals and How to Read Pictures of Data.” American Psychologist 60:170–80. and Drug Administration. 2010. “Guidance for the Use of Bayesian Statistics in Medical Device Clinical Trials.” http://www.fda.gov/MedicalDevices/DeviceRegulationandGuidance /GuidanceDocuments/ucm071072.htm. dman, Steven N. 1999. “Toward Evidence-Based Medical Statistics.” Pt. 1, “The P Value Fallacy.” Annals of Internal Medicine 130:995. —. 2007. “Stopping at Nothing? Some Dilemmas of Data Monitoring in Clinical Trials.” Annals of Internal Medicine 146:882. dman, Steven N., Donald Berry, and Janet Wittes. 2010. “Bias and Trials Stopped Early for Benefit.” Journal of the American Medical Association 304:157. er, Jack. 1977. “Conditional Confidence Statements and Confidence Estimators.” Journal of the American Statistical Association 72:789–808. Wan-Yee, et al. 1999. “Adjuvant intra-Arterial Lipiodol-Iodine-131 for Resectable Hepato- cellular Carcinoma: A Prospective Randomised Trial.” Lancet 353:797–801. o, Deborah G. 1996. Error and the Growth of Experimental Knowledge. Chicago: University of Chicago Press. o, Deborah G., and Michael Kruse. 2001. “Principles of Inference and Their Consequences.” In Foundations of Bayesianism, ed. David Corfield and Jon Williamson, 381–421. Dordrecht: Kluwer Academic. tori, Victor M., et al. 2005. “Randomized Trials Stopped Early for Benefit: A Systematic Review.” Journal of the American Medical Association 294:2203–9. é, Lemuel A. 2008. “Bayesians in Clinical Trials: Asleep at the Switch.” Statistics in Medicine 27:469–82. ller, Paul S., et al. 2007. “Ethical Issues in Stopping Randomized Trials Early because of Apparent Benefit.” Annals of Internal Medicine 146:878–81. ini, Cecilia. 2013. “Monitoring Clinical Trials: Benefit or Bias?” Theoretical Medicine and Bioethics 34:259–74. ck, Stuart, and Ian White. 1999. “Trials Stopped Early: Too Good to Be True?” Lancet 353:943–44. all, Richard. 1997. Statistical Evidence: A Likelihood Paradigm. London: Chapman & Hall. enfeld, Teddy. 1981. “On After-Trial Properties of Best Neyman-Pearson Confidence Inter- vals.” Philosophy of Science 48:281–91. nger, Jan. 2009. “Evidence and Experimental Design in Sequential Trials.” Philosophy of Science 76:637–49. , Abraham. 1947. Sequential Analysis. New York: Wiley. rall, John. 2007. “Evidence in Medicine and Evidence-Based Medicine.” Philosophy Compass 2:981–1022. —. 2008. “Evidence and Ethics in Medicine.” Perspectives in Biology and Medicine 51:418–31. This content downloaded from �������������87.79.184.140 on Mon, 04 May 2020 12:28:49 UTC������������� All use subject to https://about.jstor.org/terms