Testing a Precise Null Hypothesis: The Case of Lindley’s Paradox Testing a Precise Null Hypothesis: The Case of Lindley’s Paradox Author(s): Jan Sprenger Source: Philosophy of Science , Vol. 80, No. 5 (December 2013), pp. 733-744 Published by: The University of Chicago Press on behalf of the Philosophy of Science Association Stable URL: https://www.jstor.org/stable/10.1086/673730 JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org. Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at https://about.jstor.org/terms The University of Chicago Press and Philosophy of Science Association are collaborating with JSTOR to digitize, preserve and extend access to Philosophy of Science This content downloaded from �������������87.79.184.140 on Mon, 04 May 2020 12:30:32 UTC������������� All use subject to https://about.jstor.org/terms https://www.jstor.org/stable/10.1086/673730 Testing a Precise Null Hypothesis: The Case of Lindley’s Paradox Jan Sprenger*y Testing a point null hypothesis is a classical but controversial issue in statistical meth- odology. A prominent illustration is Lindley’s Paradox, which emerges in hypothesis tests with large sample size and exposes a salient divergence between Bayesian and fre- quentist inference. A close analysis of the paradox reveals that both Bayesians and fre- quentists fail to satisfactorily resolve it. As an alternative, I suggest Bernardo’s Bayesian Reference Criterion: ðiÞ it targets the predictive performance of the null hypothesis in future experiments; ðiiÞ it provides a proper decision-theoretic model for testing a point null hypothesis; ðiiiÞ it convincingly addresses Lindley’s Paradox. 1. Introduction: Lindley’s Paradox. Lindley’s Paradox exposes a salient divergence between subjective Bayesian and frequentist reasoning when a parametric point null hypothesis H0 : v 5 v0 is tested against an unspecified alternative H1 : v ≠ v0. Since the paradox has repercussions for the interpre- tation of statistical tests in general, it is of high philosophical interest. To illustrate the paradox, we give an example from parapsychological research ðJahn, Dunne, and Nelson 1987Þ. The case at hand involved the test of a subject’s claim to affect a series of randomly generated zeros and ones ðv0 5 0:5Þ by means of extrasensory capacities ðESPÞ. The subject claimed that his ESP would make the sample mean differ significantly from 0.5. A very large data set ðN 5 104; 490; 000Þ was collected to test this hy- pothesis. The sequence of zeros and ones, X1; : : : ; XN, was described by a binomial model Bðv; NÞ. The null hypothesis asserted that the results were *To contact the author, please write to: Tilburg Center for Logic and Philosophy of Science, Tilburg University, PO Box 90153, 5000 LE Tilburg, The Netherlands; e-mail: j.sprenger@ uvt.nl. yThe author wishes to thank the Netherlands Organisation for Scientific Research ðNWOÞ fo support of his research through Veni grant 016.104.079, as well as José Bernardo, Cecilia Nardini, and the audience at PSA 2012, San Diego, for providing helpful input and feedback Philosophy of Science, 80 (December 2013) pp. 733–744. 0031-8248/2013/8005-0005$10.00 Copyright 2013 by the Philosophy of Science Association. All rights reserved. 733 This content downloaded from �������������87.79.184.140 on Mon, 04 May 2020 12:30:32 UTC������������� All use subject to https://about.jstor.org/terms r . generated by a machine operating with a chance of H0 : v 5 v0 5 1=2, whereas the alternative was the unspecified hypothesis H1 : v ≠ 1=2. Jahn et al. ð1987Þ report that in 104,490,000 trials, 52,263,471 ones and 734 JAN SPRENGER 52,226,529 zeros were observed. Frequentists would now calculate the z- statistic, which is zðxÞ :5 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi N v0ð1 2 v0Þ s 1 N o N i51 xi 2 v0 � � ≈ 3:61; and reject the null hypothesis on the grounds of the very low p-value it induces: p :5 PH0ð zðXÞ ≥ zðxÞ Þ ≪ 0:01:jjjj Thus, the data would be interpreted as strong evidence for the presence of ESP. Compare this to the result of a Bayesian analysis. Jefferys ð1990Þ assigns a conventional positive probability pðH0Þ 5 ε > 0 to the null hypothesis, a uniform prior over the alternative, and calculates a Bayesian measure of evidence in favor of the null, namely, the Bayes factor. The evidence x provides for H0 vis-à-vis H1 is written as B01 and defined as the ratio of prior and posterior odds: B01ðxÞ :5 pðH0 j xÞ pðH1 j xÞ � pðH1Þ pðH0Þ ≈ 12: Hence, the data clearly favor the null over the alternative and do not provide evidence for the presence of ESP. This divergence between Bayesians and frequentists has, since the sem- inal paper of Lindley ð1957Þ, been known as Lindley’s Paradox. In Lindley’s original formulation, the paradox is stated as follows: assume that we com- pare observation sets of different sample size N, all of which attain, in fre- quentist terms, the same p-value ðe.g., the highly significant value of .01Þ. In that case, as N increases, the Bayesian evaluation of the data will become ever more inclined toward the null hypothesis. Thus, a result that seems to refute the null from a frequentist point of view can strongly support it from a Bayesian perspective. Put formally ðfor the case of Gaussian modelsÞ: Lindley’s Paradox: In a Gaussian model Nðv; j2Þ with known vari- ance j2, H0 : v 5 v0, H1 : v ≠ v0, assume pðH0Þ > 0 and any regular proper prior distribution on fv ≠ v g. Then, for any testing level 0 a ∈ ½0; 1�, we can find a sample size NðaÞ and independent, identically distributed data x 5 ðx1; : : : ; xNÞ such that This content downloaded from �������������87.79.184.140 on Mon, 04 May 2020 12:30:32 UTC������������� All use subject to https://about.jstor.org/terms 1. The sample mean �x is significantly different from v0 at level a; 2. pðH0 j xÞ, that is, the posterior probability that v 5 v0, is at least as big as 1 2 a ðcf. Lindley 1957, 187Þ. TESTING A PRECISE NULL HYPOTHESIS 735 As the ESP example makes clear, Lindley’s Paradox actually extends be- yond Gaussian models with known variance. It exposes a general diver- gence between Bayesians and frequentists in hypothesis tests with large sample size. In this article, I consider the following questions: First, which statisti- cal analysis of the ESP example, and Lindley’s Paradox in general, is most adequate? Second, which implications does Lindley’s Paradox have for the methodological debates between Bayesians and frequentists? Third, does our analysis have ramifications for the interpretation of point null hypothe- sis tests? I will argue that both the subjective Bayesian and the standard frequentist ways to conceive of Lindley’s Paradox are unsatisfactory and that alternatives have to be explored. In particular, I believe that José Ber- nardo’s approach ðthe Bayesian Reference Criterion, or BRCÞ holds con- siderable promise as a decision model of hypothesis testing, both in terms of the implied utility structure and as a reply to Lindley’s Paradox. 2. Testing a Precise Null Hypothesis: Frequentist versus Bayesian Accounts. Lindley’s Paradox deals with tests of a precise null hypothesis H0 : v 5 v0 against an unspecified alternative H1 : v ≠ v0 for large sample sizes. But why are we actually testing a precise null hypothesis if we know in advance that this hypothesis is, in practice, never exactly true? For in- stance, in tests for the efficacy of a medical drug, it can be safely assumed that even the most unassuming placebo will have some minimal effect, pos- itive or negative. The answer is that precise null hypotheses often give us a useful ideal- ization of reality. This is rooted in Popperian philosophy of science: “only a highly testable or improbable theory is worth testing and is actually ðand not only potentiallyÞ satisfactory if it withstands severe tests” ðPopper 1963, 219–20Þ. Accepting such a theory is not understood as endorsing the the- ory’s truth but as choosing it as a guide for future predictions and theoret- ical developments. Frequentists have taken the baton from Popper and explicated the idea of severe testing by means of statistical hypothesis tests. Their mathematical rationale is that if the discrepancy between data and null hypothesis is large enough, we can infer the presence of a significant effect and reject the null hypothesis. For measuring the discrepancy in the data x :5 ðx1; : : : ; xNÞ with respect to the postulated mean value v0 of a normal model, one ca- nonically uses the statistic This content downloaded from �������������87.79.184.140 on Mon, 04 May 2020 12:30:32 UTC������������� All use subject to https://about.jstor.org/terms zðxÞ :5 ffiffiffiffi N p j 1 N o N i51 xi 2 v0 � � 736 JAN SPRENGER that we have already encountered above. Higher values of z denote a higher divergence from the null, and vice versa. Since the distribution of z usually varies with the sample size, some kind of standardization is required. Many practitioners use the p-value or significance level, that is, the “tail area” of the null hypothesis under the observed data, namely, p:5 PH0ð zðXÞ ≥ zðxÞ Þjjjj . On that reading, a low p-value indicates evidence against the null: the chance that z takes a value at least as high as zðxÞ would be very small if the null were indeed true. Conventionally, p < :05 means significant evi- dence against the null and p < :01 very significant evidence. In the context of hypothesis testing, it is then common to say that the null hypothesis is rejected at the .05 level, and so on. Subjective Bayesians choose a completely different approach to hypoth- esis testing. For them, scientific inference obeys the rules of probabilistic calculus. Probabilities represent honest, subjective degrees of belief, which are updated by means of Bayesian Conditionalization. A Bayesian inference about a null hypothesis is based on the posterior probability pðH0 j xÞ, the synthesis of data x and prior pðH0Þ. Bayes’s theorem can be used to calcu- late the posterior on the basis of the prior and the likelihood of the data. If we investigate the source of Lindley’s Paradox, one might conjecture that an “impartial” but unrealistically high prior for H0 ðe.g., pðH0Þ 5 1=2Þ is the culprit for the high posterior probability of the null. However, Lindley’s findings persist if the analysis is conducted in terms of Bayes factors, like in the ESP example. These measures of evidence are independent of the par- ticular prior of H0. For instance, if the prior over the alternatives to the null follows an Nðv0; ~j2Þ distribution, then the Bayes factor in favor of the null can be computed as B01ðxÞ 5 pðH0 j xÞ pðH1 j xÞ � pðH1Þ pðH0Þ 5 pðx j H0Þ pðx j H1Þ 5 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 1 N~j2 j2 r e 2NzðxÞ2 2N12j2=~j2 ; which converges, for increasing N, to infinity as the second factor is bounded ðBernardo 1999, 102Þ. This demonstrates that the precise value of pðH0Þ is immaterial for the outcome of the subjective Bayesian analysis. Why is it that this result diverges so remarkably from the frequentist finding of significant evidence against the null? If the p-value, and conse- This content downloaded from �������������87.79.184.140 on Mon, 04 May 2020 12:30:32 UTC������������� All use subject to https://about.jstor.org/terms quently the value of zðXÞ 5 c, remains constant for increasing N, we can make use of the central limit theorem: zðXÞ converges, for all underlying distributions with bounded second moments, in distribution against Nð0; 1Þ. TESTING A PRECISE NULL HYPOTHESIS 737 Thus, as N → `, we obtain that cj ≈ ffiffiffiffi N p ð �X 2 v0Þ, and �X → v0. In other words, the sample mean gets ever closer to v0, favoring the null over the alternatives. For, the deviance between the variance-corrected sample mean z and H0 will be relatively small compared to the deviance between z and all those hypotheses in H1 that are remote from v0. By contrast, sig- nificance tests do not consider the probability of the data under these al- ternatives. In other words, as soon as we take our priors over H1 seriously, as an expression of our uncertainty about which alternatives to H0 are more likely than others, we will, in the long run, end up with results that favor v0 over an unspecified alternative. Bayesians read this as the fatal blow for frequent- ist inference since an ever smaller deviance of the sample mean �x from the parameter value v0 will suffice for a highly significant result. Obviously, this makes no scientific sense. Small, uncontrollable biases will be present in any record of data, and frequentist hypothesis tests are unable to distinguish between statistical significance ðp < :05Þ and scientific significance ða real effect is presentÞ. A Bayesian analysis, however, accounts for this insight: as �X → v0, an ever greater chunk of the alternative H1 will be far away from �X , favoring the null hypothesis. These phenomena exemplify more general and foundational criticisms of frequentist inference, in particular the objection that p-values grossly over- state evidence against the null ðCohen 1994; Royall 1997; Goodman 1999Þ. For instance, even the minimum of pðH0 j xÞ under a large class of priors is typically much higher than the observed p-value ðBerger and Sellke 1987Þ. Still, also the subjective Bayesian stance on hypothesis tests is not en- tirely satisfactory. Assigning a strictly positive degree of belief pðH0Þ > 0 to a precise hypothesis v 5 v0 is a misleading and inaccurate representation of our subjective uncertainty. In terms of degrees of belief, v0 is not that different from any value v0 6 ε in its neighborhood. Standardly, we would assign a continuous prior over the real line, and there is no reason why a set of ðLebesgueÞ measure zero, namely, fv 5 v0g, should have a strictly positive probability. But if we set pðH0Þ 5 0, then for most priors ðe.g., an improper uniform priorÞ the posterior probability distribution will not peak at the null value but somewhere else. Thus, the apparently innocuous assumption pðH0Þ > 0 has a marked impact on the result of the Bayesian analysis. A natural reply to this objection contends that H0 is actually an ideali- zation of the hypothesis v 2 v0 < εjj , for some small ε, rather than a precise hypothesis v 5 v0. Then, it would make sense to use strictly positive priors. This content downloaded from �������������87.79.184.140 on Mon, 04 May 2020 12:30:32 UTC������������� All use subject to https://about.jstor.org/terms Indeed, it has been shown that point null hypothesis tests approximate, in terms of Bayes factors, a test of whether a small interval around the null contains the true parameter value ðtheorem 1 in Berger and Delampady 738 JAN SPRENGER 1987Þ. Seen that way, it does make sense to assign a strictly positive prior to H0. Unfortunately, this will not help us in the situation of Lindley’s Paradox: when N → ,̀ the convergence results break down, and testing a point null is no longer analogous to testing whether a narrow interval contains v ðBer- nardo 1999, 102Þ. In the asymptotic limit, the Bayesian cannot justify the strictly positive probability of H0 as an approximation to testing the hy- pothesis that the parameter value is close to v0—which is the hypothesis of real scientific interest. This may be the toughest challenge posed by Lindley’s Paradox. In the debate with frequentists, Bayesians like to appeal to “foundations,” but as- signing a strictly positive probability to a precise hypothesis is hard to jus- tify as a foundationally sound representation of subjective uncertainty. Moreover, the Bayesian analysis fails to explain why hypothesis tests have such an appeal to scientific practitioners, even to those that are sta- tistically well educated. Why should we bother testing a hypothesis if only posterior probabilities are relevant? Why even consider a precise hypothe- sis if it is known to be wrong? The next section will highlight these ques- tions and briefly discuss the function of hypothesis tests in scientific inquiry. 3. Intermezzo: A Note on Precise Hypotheses. Since both Bayesians and frequentists struggle to deliver satisfactory responses to Lindley’s Paradox, one may conjecture that the real problem is with testing a precise hypothesis as such. For instance, if we constructed a 95% confidence interval for v in the ESP case, it would not include v0. But it would be close enough to v0 as to avoid the impression that the null was grossly mistaken.1 Hence, Lind- ley’s Paradox seems to vanish into thin air if we only adopt a different fre- quentist perspective. However, this proposal is not satisfactory either. Confidence intervals do not state which hypotheses are credible—they only list the hypotheses that are consistent with the data, in the sense that these hypotheses would not be rejected in a significance test. Therefore, confidence intervals are intimately connected to significance tests and share a lot of their foundational prob- 1. A similar point can be made in the error-statistical framework ðMayo 1996Þ: only a small discrepancy from the null hypothesis would be warranted with a high degree of severity. Mayo speaks about acceptances and rejections, too, but in fact she is interested in severely warranted discrepancies from the null, not in decisions to accept or to reject a point null hypothesis. This content downloaded from �������������87.79.184.140 on Mon, 04 May 2020 12:30:32 UTC������������� All use subject to https://about.jstor.org/terms lems ðcf. Seidenfeld 1981; Sprenger 2013Þ. Second, confidence intervals do not involve a decision-theoretic component; they are interval estimators. In particular, they do not explain why tests of a precise null have any role in TESTING A PRECISE NULL HYPOTHESIS 739 scientific methodology. Since any proper resolution of Lindley’s Paradox should address this question, a confidence interval approach evades rather than solves the paradox. On this note, one ought to realize that tests of a precise null usually serve two purposes: to find out whether an intervention has a significant effect and, since any intervention will have some minute effect, to decide whether the null hypothesis can be used as a proxy for the more general model. The point null is usually much easier to test and to handle than any composite hypothesis, so we have positive reasons to “accept” it, as long as the di- vergence to the data is not too large. This view of scientific inference is hardly compatible with an orthodox Bayesian approach. For instance, the assumption pðH0Þ > 0 neglects that hypothesis tests ask, in the first place, whether H0 is a reasonable simpli- fication of a more general model—and not whether we entertain a high de- gree of belief in a precise value of v. Also, point null hypothesis tests are by definition asymmetric, but a subjective Bayesian analysis in terms of Bayes factors or posterior probabilities is essentially symmetric. In total, subjective Bayesians have a hard time explaining why infor- mative and precise but improbable hypotheses should sometimes be pre- ferred over more general alternatives. The challenge for the Bayesian con- sists in modeling that we may be less interested in the truth of H0 than in its usefulness. The next section presents an answer to this effect developed by José Miguel Bernardo ð1999, 2012Þ. 4. The BRC Approach to Hypothesis Testing. This section presents a full Bayesian decision model for point null hypothesis testing that addresses Lindley’s Paradox: José Bernardo’s BRC ð1999, 2012Þ. The point consists in shifting the focus from the truth of H0 to its predictive value and in stipulating a specific utility structure. While classical Bayesian accounts of hypothesis testing involve simple exogenous utilities ðe.g., a loss of zero for correct decisions and one for wrong decisionsÞ and use the posterior probability as the only criterion for accepting or rejecting the null, Ber- nardo’s approach is based on endogenous, prediction-based utilities. In the remainder, I sketch a simplified version of Bernardo’s BRC in order to elaborate the main ideas of philosophical interest. Since the work of R. A. Fisher, the replication of previously observed effects has been recognized as a main goal of experimental research in sci- ence and as a main motivation for significance tests ðcf. Schmidt and Hun- ter 1997Þ. Therefore, a central component of Bernardo’s decision model This content downloaded from �������������87.79.184.140 on Mon, 04 May 2020 12:30:32 UTC������������� All use subject to https://about.jstor.org/terms focuses on the expected predictive accuracy of the null for future data. Hence, we need a function that evaluates the predictive score of a hypothe- sis, given some data y. The canonical approach consists in the logarithmic 740 JAN SPRENGER score log pðy j vÞ ðGood 1952Þ: if an event considered to be likely occurs, then the score is high; if an unlikely event occurs, the score is low. This is a natural way of rewarding good and punishing bad predictions. A generalization of this scoring rule describes the score of data y under parameter value v as qðv; yÞ 5 a logpðy j vÞ 1 bðyÞ, where a is a scaling term, and bðyÞ is a function that depends on the data only. Informally speaking, qð�; �Þ is decomposed into a prediction term and a term that de- pends on the desirability of an outcome, where the latter will eventually turn out to be irrelevant. This is a useful generalization of the logarithmic score. Consequently, if v is the true parameter value, the utility of taking H0 as a proxy for the more general model H1 is Eqðv0; YÞdPYjv 5 aElog pðy j v0Þ pðy j vÞdy 1 EbðyÞpðy j vÞdy: The overall utility U of a decision, however, should depend not only on the predictive score, as captured in q, but also on the cost cj of selecting a specific hypothesis Hj. As explained above, H0 should be preferred to H1 ceteris paribus because it is more informative, simpler, and less prone to the risk of overfitting ðin case there are nuisance parametersÞ. Therefore, it is fair to set c1 > c0. Writing Uð�; vÞ 5 ∫qð�; YÞdPYjv 2 cj, we obtain UðH0; vÞ 5 aElog pðy j v0Þpðy j vÞdy 1 EbðyÞpðy j vÞdy 2 c0 UðH1; vÞ 5 aElog pðy j vÞpðy j vÞdy 1 EbðyÞpðy j vÞdy 2 c1: Note that the utility of accepting H0 is evaluated against the true parameter value v and that the alternative is not represented by a probabilistic average ðe.g., the posterior meanÞ but by its best unknown element, v. Much better than subjective Bayesianism, this approach represents the essential asym- metry in testing a point null hypothesis. Consequently, the difference in expected utility, conditional on the posterior density of v, can be written as This content downloaded from �������������87.79.184.140 on Mon, 04 May 2020 12:30:32 UTC������������� All use subject to https://about.jstor.org/terms E v∈V UðH1; vÞ 2 UðH0; vÞð Þpðv j xÞdv ! TESTING A PRECISE NULL HYPOTHESIS 741 5 aE v∈V Elog pðy j vÞpðy j v0Þ pðy j vÞ pðv j xÞdy dv 1 EbðyÞpðy j vÞdy 2 EbðyÞpðy j vÞdy 1 c0 2 c1 5 aE v∈V Elog pðy j vÞpðy j v0Þ pðy j vÞdy ! pðv j xÞdv 1 c0 2 c1: This means that the expected utility difference between inferring to the null hypothesis and keeping the general model is essentially a function of the expected log-likelihood ratio between the null hypothesis and the true model, calibrated against a “utility constant” d*ða; c02 c1Þ. For the latter, Bernardo suggests a conventional choice that recovers the well-probed scientific practice of regarding 5 standard deviations as compelling evidence against the null.2 The exact value of d* depends, of course, on the context—on how much divergence is required to balance the advantages of working with a simpler, more informative, and more accessible model ðBernardo 1999, 108Þ. Wrapping up all this, we will reject the null if and only if Ev½UðH1; vÞ� > Ev½UðH0; vÞ�, which amounts to the Bayesian Reference Criterion: Data x are incompatible with the null hypothesis H0 : v 5 v0, assuming that they have been generated from the probability model ðpð� j vÞ; v ∈ VÞ, if and only if This testin mary ger a E v∈V pðv j xÞ Elog pðy j vÞpðy j v0Þ pðy j vÞdy ! dv > d*ða; c0 2 c1Þ: ð1Þ approach has a variety of remarkable features. First, it puts hypothesis g on firm decision-theoretic grounds, with predictive value being the pri- criterion. This foundational soundness distinguishes BRC vis-à-vis fre- quentist procedures. 2. This evidential standard was also used in the recent discovery of the Higgs particle. For Bayesian justifications of this practice, see Berger and Delampady ð1987Þ and Ber- nd Sellke ð1987Þ. This content downloaded from �������������87.79.184.140 on Mon, 04 May 2020 12:30:32 UTC������������� All use subject to https://about.jstor.org/terms Second, accepting the null, that is, using v0 as a proxy for v, amounts to claiming that the difference in expected predictive success of v0 and the true parameter value v will be offset by the fact that H is more elegant, 742 JAN SPRENGER 0 more informative, and easier to test. Hence, BRC does not only establish a trade-off between different epistemic virtues: it is also in notable agreement with Popper’s view that “science does not aim, primarily, at high probabili- ties. It aims at high informative content, well backed by experience” ð1934/ 1959, 399Þ. In marked difference to the orthodox Bayesian approach, ac- cepting H0 no longer involves commitment to the truth or likelihood of H0. Third, the approach is better equipped than subjective Bayesianism to account for frequentist intuitions since under some conditions, the results of BRC agree with the results of a frequentist analysis, as we shall see below. Fourth, it is invariant of the particular parametrization; that is, the final in- ference does not depend on whether we work with v or a 1:1 transformation gðvÞ. Fifth, it is neutral with respect to the kind of prior probabilities that are fed into the analysis. 5. Revisiting Lindley’s Paradox. We now investigate how Bernardo’s approach deals with Lindley’s Paradox and return to the ESP example from the introduction. It turns out that the BRC quantifies the expected loss from using v0 as a proxy for the true value v as substantial. Using a bð1=2; 1=2Þ reference prior for v ðBernardo 1979Þ, the expected loss under the null hypothesis is calculated as dðv 5 1=2Þ ≈ log1;400 ≈ 7:24. This establishes that “under the accepted conditions, the precise value v0 5 1=2 is rather incompatible with the data” ðBernardo 2012, 18Þ. In other words, the pre- dictive loss from taking the null as a proxy for the posterior-corrected al- ternative will be substantial. Of course, the rejection of the null hypothesis does not prove the ESP of our subject; a much more plausible explanation is a small bias in the random generator. This is actually substantiated by looking at the posterior distribu- tion of v: due to the huge sample size, we find that for any nonextreme prior probability function, we obtain the posterior v ∼ Nð0:50018; 0:000049Þ, which shows that most of the posterior mass is concentrated in a narrow interval that does not contain the null. In this sense, we are justified to re- ject the null without having to infer to a substantial discrepancy between v and v0. Although BRC has a sound basis in Bayesian decision theory, the results of a BRC analysis disagree with Jeffrey’s subjective Bayesian analysis. Why is this the case? First, the conventional utility structure is substantially changed in BRC, and the final decision is no longer a simple function of the posterior probability of H0. Second, a Bayes factor comparison effec- tively compares the likelihood of the data under H0 to the averaged likeli- hood of the data under H1. However, this quantity is strongly influenced by This content downloaded from �������������87.79.184.140 on Mon, 04 May 2020 12:30:32 UTC������������� All use subject to https://about.jstor.org/terms whether there are some extreme hypotheses in H1 that fit the data poorly. Compared to the huge amount of data that we have just collected, the impact of these hypotheses ðmediated via the conventional uniform priorÞ should TESTING A PRECISE NULL HYPOTHESIS 743 be minute. These arguments explain why most people would tend to judge the data as incompatible with the precise null but fail to see a scientifically interesting effect. Thus, BRC indeed gives a convincing account of Lind- ley’s Paradox in the ESP example. 6. Conclusion and Outlook. We have demonstrated how Lindley’s Para- dox—the extreme divergence of Bayesian and frequentist inference in tests of a precise null hypothesis with large sample size—challenges the standard methods of both Bayesian and frequentist inference. Neither a classical fre- quentist nor a subjective Bayesian analysis provides a convincing account of the problem. Therefore, I have presented Bernardo’s BRC as a full Bayes- ian model of testing point null hypotheses. It turns out that BRC gives a sensible Bayesian treatment of Lindley’s Paradox, due to its focus on pre- dictive performance and likely replication of the effect. Although BRC has sound foundations in subjective expected utility theory, it preserves testing a precise hypothesis as a distinct form of statistical inference and can be motivated from a broadly Popperian perspective. Of course, the BRC approach is not immune to objections ðsee the dis- cussion pieces in Bernardo ½2012�Þ. However, BRC definitely underlines that Bayesian inference in science need not necessarily infer to highly prob- able models—a misconception that is perpetuated in post-Carnapian prim- ers on Bayesian inference and that has attracted understandable criticism. For instance, Earman ð1992, 33Þ takes, in his exposition of Bayesian rea- soning, the liberty of announcing that “issues in Bayesian decision theory will be ignored.” Contrary to Earman, I claim that Bayesian reasoning cannot dispense with the decision-theoretic dimension if it aims at scientific rele- vance. A purely epistemic approach to theory choice, as exemplified in much of Bayesian confirmation theory, falls short of an appropriate model of sci- entific reasoning. Therefore, this article is not only a contribution to statisti- cal methodology: it highlights the need to appreciate the subtle interplay of probabilities and ðpredictiveÞ utilities in Bayesian inference and to change our perspective on the use of Bayesian reasoning in science. REFERENCES Berger, James O., and Mohan Delampady. 1987. “Testing Precise Hypotheses.” Statistical Science 2:317–52. Berger, James O., and Thomas Sellke. 1987. “Testing a Point Null Hypothesis: The Irreconcilability of p-Values and Evidence.” Journal of the American Statistical Association 82:112–39. Bern ardo, José M. 1979. “Reference Posterior Distributions for Bayesian Inference.” Journal of the Royal Statistical Society B 41:113–47. This content downloaded from �������������87.79.184.140 on Mon, 04 May 2020 12:30:32 UTC������������� All use subject to https://about.jstor.org/terms ———. 1999. “Nested Hypothesis Testing: The Bayesian Reference Criterion.” In Bayesian Sta- tistics, vol. 6, Proceedings of the Sixth Valencia Meeting, ed. J. M. Bernardo et al., 101–30. Oxford: Oxford University Press. ———. 2012. “Integrated objective Bayesian estimation and hypothesis testing.” In Bayesian Coh Earm Goo Goo Jahn Jeffe Lind May Popp —— Roy Schm Seid Spre 744 JAN SPRENGER Statistics, vol. 9, Proceedings of the Ninth Valencia Meeting, ed. J. M. Bernardo et al., 1–68. Oxford: Oxford University Press. en, Jacob. 1994. “The Earth Is Round ðp < :05Þ.” American Psychologist 49:997–1001. an, John. 1992. Bayes or Bust? Cambridge, MA: MIT Press. d, I. J. 1952. “Rational Decisions.” Journal of the Royal Statistical Society B 14:107–14. dman, S. N. 1999. “Towards Evidence-Based Medical Statistics.” Pt. 1, “The P Value Fal- lacy.” Annals of Internal Medicine 130:1005–13. , R. G., B. J. Dunne, and R. D. Nelson. 1987. “Engineering Anomalies Research.” Journal of Scientific Exploration 1:21–50. rys, William H. 1990. “Bayesian Analysis of Random Event Generator Data.” Journal of Scientific Exploration 4:153–69. ley, Dennis V. 1957. “A Statistical Paradox.” Biometrika 44:187–92. o, Deborah G. 1996. Error and the Growth of Experimental Knowledge. Chicago: University of Chicago Press. er, Karl R. 1934/1959. Logik der Forschung. Berlin: Akademie. English trans. The Logic of Scientific Discovery ðNew York: Basic, 1959Þ. —. 1963. Conjectures and Refutations: The Growth of Scientific Knowledge. New York: Harper. all, Richard. 1997. Scientific Evidence: A Likelihood Paradigm. London: Chapman & Hall. idt, Frank L., and John E. Hunter. 1997. “Eight Common but False Objections to the Dis- continuation of Significance Testing in the Analysis of Research Data.” In What If There Were No Significance Tests? ed. Lisa L. Harlow et al., 37–64. Mahwah, NJ: Erlbaum. enfeld, Teddy. 1981. “On After-Trial Properties of Best Neyman-Pearson Confidence Inter- vals.” Philosophy of Science 48:281–91. nger, Jan. 2013. “Bayesianism vs. Frequentism in Statistical Inference.” In Oxford Handbook of Probability and Philosophy, ed. A. Hájek and C. Hitchcock. Oxford: Oxford University Press, forthcoming. This content downloaded from �������������87.79.184.140 on Mon, 04 May 2020 12:30:32 UTC������������� All use subject to https://about.jstor.org/terms