key: cord-0931313-w53i6xzd authors: Cahusac, Peter title: Data as evidence date: 2020-05-22 journal: Exp Physiol DOI: 10.1113/ep088664 sha: c6a79088aa97469957943d3edb54d6da1c76ceaa doc_id: 931313 cord_uid: w53i6xzd In an earlier article, Gordon Drummond summarized ongoing changes in how statistics are being used in experimental physiology (Drummond, 2020). He described the near ubiquitous use of the P value, cautioning against declaring an analysis “statistically significant”. He mentioned alternative approaches including Bayesian and likelihood approaches. This article focusses on the latter approach although we will first have another look at that P value. Then the likelihood approach will be introduced with a very artificial example that allows us to easily grasp the concept. A more realistic example is then described with associated calculations. A further example using real categorical data is explained, and how it relates to and is superior to the oft used χ(2) test. A final discussion reveals that the likelihood approach, although mathematically and statistically accurate, is poorly supported by literature and training. This article is protected by copyright. All rights reserved Where to start? We start with a simple, and very artificial, example to help understand the concept. It is well known that height is nearly normally distributed. The heights for a large cohort of men and women across 20 countries were measured (see this fascinating webpage: https://ourworldindata.org/human-height#height-is-normally-distributed -accessed 12 March 2020). From these data we have the means and standard deviations, which allows us to plot the normal distributions for men and women in Figure 1 . The distribution of heights for men and women from 20 countries (aggregate of Europe, North America, Australia and East Asia) in a cohort born between 1980 and 1994. The curves are likelihood functions calculated using the obtained means and standard deviations. Likelihood functions are scaled so that their maximum occurs at 1 (see vertical axis). The mean for each distribution occurs at the peak likelihood value. The curve for men (dashed) is slightly wider since the distribution has a larger standard deviation. A randomly selected person has a height of 180 cm and is indicated by the thin vertical line. The likelihoods, shown as horizontal dashed lines, represent the heights of the two curves where the thin vertical line hits them. The brackets are labelled with the likelihood values and at the top right is given the likelihood ratio calculation. At the top left of the figure the log LR, the S, is shown. If we had the raw data we could have plotted frequency histograms for men and women separately, where the vertical axis would be number of men and women in each respective histogram bin. This article is protected by copyright. All rights reserved. Adding smoothed curves over the histogram bars would give us something very similar to Figure 1 . Instead, we could do plots knowing the means and standard deviations for men and women and assuming the distributions were normal. This would give us a bimodal plot, a peak for the mean for each sex. The vertical axis would now represent the probability density for height as we move along the horizontal axis from short to tall people. If we scaled this plot so that each of the peaks were 1, then these distributions would be the likelihood functions for each sex. A short-cut to doing this is to use the following simple equation for each sex (Equation 1) This would be applied to values for height starting at say 140 cm, and incrementing by 0.1 cm until 210 cm. This could be done for women first (using their mean and standard deviation), producing the left hand curve, and repeated for men. This is easy in Excel -try it! There are clearly two separate curves, however there is a fair amount of overlap between them. The maximum height of the curves, at the respective means, is 1. Now imagine that we pick a person at random from the population of the 20 countries used for the data. We measure the height of this person at 180 cm. From its position in Figure 1 , the vertical line, we can see that this is near the mean for men and quite far from the women's mean. We can calculate the likelihoods, which are the same values as where the vertical line cuts the women's and men's curves. Callout values are given for these as 0.096 and 0.978 respectively. These are merely the heights at those points of the curves, remembering that their maximum heights are 1. We can calculate each of these manually from the z value and equation 1 given above. The z value for women is ̅ The likelihood for women is then ( ) The same for men, using their mean and standard deviation And likelihood ( ) The likelihood ratio (LR) for these, say the likelihood of the person being a man rather than a woman is This article is protected by copyright. All rights reserved. This means that, given the data (180 cm), this person is over 10 times more likely to be a man than a woman. If we invert the LR then we would get LR wm = 0.098, meaning that the person is less than a tenth as likely to be a woman as a man. Taking the natural logarithm of 10.17 gives us the support S of 2.3. For the inverse, with perfect symmetry our S wm = −2.3. Consulting Table 1 we see that this would be regarded as moderately strong evidence in favour of a man than a woman. This calculation flows naturally from the visualisation of the problem. We have two hypotheses and we have calculated which of these is most likely given our data. (Goodman & Royall, 1988) . British courts use the same LR scale, but ramped up, so that the court considers S = 4 as only 'moderate evidence', and S = 8.6 as 'strong evidence'. Therefore, compared with scientists, courts require more than twice the evidence to influence their judgements. Figure 2 shows a graphical representation of This article is protected by copyright. All rights reserved. Graphical illustration of the different scales for LR and S. The LR scale on the left shows that all the LRs in favour of H 2 are 'compressed' between 0 and 1. Above 1 the LR favours H 1 proceeding up to infinity. On the right shows the linear scale provided by the logarithmic transform of the LR, S. This has as its midpoint 0, representing no evidence either way. Above this are values in favour of H 1 , that proceed to +. Below this negative values in favour of H 2 , that proceed to . If the null hypothesis H 0 is used, then this would replace H 2 . An analysis using P values is less obvious, but let's try. We could start by specifying that the null hypothesis is that the 180 cm person is a woman. Using the women's z value of -2.164 gives us a tail probability of .015. Convention is to use a two-tailed test, so we need to double our P value. We can declare the result significant with P = .03, rejecting our null hypothesis that the person was a women (i.e. decide the person is a man). There are a couple of puzzling things. First, it is unclear why we need to consider a tail region containing data we have not collected, such as women taller than 180 cm. Second, what happens if we got a short person with height of 150 cm? The same analysis would lead us again to reject the null hypothesis P = .038, but for what?The way to deal with this is to use a one-tailed test so that only values greater than the women's mean can be considered as evidence against it being a woman's height. Since the likelihood analysis includes both hypotheses (woman vs man), the same 150 cm person would be clear and unambiguous, it would provide extremely strong evidence S = 4.8 that the person was a woman rather than a man. One interesting solution using P values is to convert them to so-called 'surprisal' values (Greenland, 2019) . This means taking the logarithm base 2 of the inverse of the one -tailed P value. For our 150 cm person this would provide positive evidence for being a woman of 5.7 bits of information, while for the 180 cm person this would be 6.0 bits of information in favour of a man. For completeness, we could do a Bayesian analysis for the 180 cm person. We need to assume that there are equal numbers of men and women in the population, giving us even prior odds. From the This article is protected by copyright. All rights reserved. LR of 10.17 we convert these odds to a probability. This gives us .91, which means that there is a 91% posterior probability of the person being a man. This is correct and perfectly legitimate, given of course that we know the prevalence beforehand. From this admittedly artificial example we can move to something more realistic. In the above example we assumed equal numbers of men and women in the population, so we were justified in calculating a posterior probability using the Bayesian approach. But what happens if we do not know the prevalence? Generally we do not know the prior probability of hypotheses we are interested in. For this reason it is problematic using a simple Bayesian approach when analysing data from experiments, although there are more sophisticated semi-Bayes and empirical-Bayes approaches which can be used (Greenland, 2006) Let's get real Consider a more realistic example. We are measuring blood levels of a protein in a group of volunteers. Our procedure is to get a baseline measurement and then another measurement after an intervention. The intervention is supposed to increase the blood level of the protein. A previously published study by another research group showed that the intervention increases the level of the protein by 3 units. This is our previous data (hypothesis H 2 ). For the intervention to be regarded as effective an expert clinician says it should cause an increase of at least 2 units. This is our minimal effect size (hypothesis H 1 ). We carry out our study in 11 women where the intervention causes a mean increase of 2.2 units with standard deviation of 1.75 units. We use these data to produce a likelihood function for the mean difference from baseline, as shown in Figure 3 . We already plotted likelihood functions for the heights of men and women. We used equation 1 to do this, indicating this can easily be done in Excel. To see how to get Figure 3 go to the How to create a likelihood function box. This article is protected by copyright. All rights reserved. In Excel label the first row of three columns with Mean, likelihood and t. In the first column enter 1 and 0.95, representing the mean increase in protein, as shown here: In the second and third columns type in the formulae exactly as shown above, pressing Enter after each one. Column C calculates t values, using equation 2, referencing the Column A value in the same row (cell A2). The respective values for the mean and standard error are included. Column B calculates the likelihood, using equation 3, referencing the t value calculated in Column C (C2). The degrees of freedom and N are within the formula. Next select the top 2 values in Column A and use the AutoFill feature to complete the sequence of values (each incrementing by 0.05) until you reach 5.0. Return to the top of the sheet and select the two cells containing formulae (B2 and C2) and use AutoFill to replicate the formulae to the end of the sheet (Row 122). Columns B and C will now contain the calculated likelihood and t values, and the first few, the middle few, and last few rows should look like this: Select the first two columns and click on Scatter plot from the INSERT tab. This will give you the basic likelihood plot, shown below left, that is used in Figure 3 . The same can be produced in the R statistics package using a single line of code, to give the below right plot: curve((1 + ((2.2-x)/0.5276448530)^2/10)^-(11/2), xlim = c(-1,5), ylab = "Likelihood") This article is protected by copyright. All rights reserved. The shape of the likelihood function is identical to sampling distribution of the mean around the mean of 2.2, except for its vertical axis maximum which, for convenience, is 1. If we were doing a null hypothesis test then this distribution would be situated over the null value, here 0, and we would be calculating the tail area from 2.2 and up. In the likelihood approach we situate the distribution over the observed mean of 2.2 units. The height of the curve at different hypothesis values are indicated by the thin lines. The minimal effect size is shown at 2 units, and .924 shown on the left side. This is the likelihood for the minimal effect size. We also see a line for the published study of 3 units, and the likelihood of .320 is shown to the right. There is a third likelihood shown at the bottom left of the graph. This likelihood of .0039 is for the null of 0 units. The line hitting the curve is so low that it is difficult to see. This article is protected by copyright. All rights reserved. The likelihood function for the observed data. The function is centred on the mean value of 2.2, which is the MLE indicated by the dashed vertical line. The fine lines hitting the curve represent the different hypothesis means. These are labelled with callouts for the null, the minimal and previous values. The lines for the null value are so low that they are not visible. Each has a horizontal line from the curve to the nearest vertical axis and the likelihood value given. These represent their height as a proportion of 1 (the maximum). The thicker rectangular lines represent the likelihood interval for S = -2, which reaches e -2 = .1353 on the vertical axis. The values illustrated in Figure 3 are presented more fully in Table 2 . The t value is calculated using the usual equation The hypothetical mean is represented by µ, from which the observed mean is subtracted. The observant reader will notice a minor difference here from the usual formula, which has the observed mean subtract the hypothetical mean. This is because in the likelihood approach the observed data is fixed and the hypotheses vary. The denominator in the equation is the usual standard error. The likelihoods in the 5 th column in Table 2 are calculated using t with this general equation ( ) Where df is the degrees of freedom for t, and N is the total sample size. For related, one sample and paired designs N = n. If there are two independent groups, N = n 1 + n 2 . The 6 th column of Table 2 gives the reciprocal of the likelihood. For example, this tells us that the mean value of 2.2 is 254.8 times more likely than the null value of 0. This is also known as the maximum likelihood ratio and resembles the P value in so far as it simply represents the evidence against the null hypothesis. Here it translates to a support of 5.5 in the last column, indicating more than extremely strong evidence (see Table 1 ). The frequentist analysis gets P = .002. The other S 1/L values tell us the strength of evidence for the mean versus the other hypotheses. Considering the minimal value (H 1 ) there is close to 0 evidence (a support value of 0.1), which might be expected as the value is so close to the mean of the sample, which is 2.2. Finally, considering the previous finding of 3 (H 2 ), the evidence is weak (1.1). These support values for reciprocal likelihoods will always be positive since the likelihoods are always  1. We can now compare these likelihoods by forming ratios of them. We can do this for any hypotheses we are interested in. Table 3 shows the calculations for different hypotheses that we wish to compare. This article is protected by copyright. All rights reserved. The top row gives the LR and S for the minimal effect size versus the null (which we can denote as LR 10 and S 10 respectively). The second row for the previous value of 3 versus the null (LR 20 and S 20 ). Since these support values both comfortably exceed 4, we can see that, given the data, there is extremely strong evidence for both the minimal effect size and the previous value versus the null. The bottom row gives the previous versus the minimal effect size ( LR 21 and S 21 ). The negative support value represents evidence in favour of the minimal effect size versus the previous value, but the evidence is weak as it is close to unity (see Table 1 ). This approach allows us to simply compare any two heights on the likelihood function using a ratio. This gives us an immediate sense of how likely one hypothesis is compared with another. We see from our data, for example, that the minimal effect value is 236 times more likely than the null value. The natural logarithm of the LR gives us the support S which represents the strength of evidence -consulting Table 1 for its interpretation. It is always useful to construct a likelihood intervals (also known as support intervals) for the same reasons that confidence intervals are useful in frequentist statistics (Cumming & Calin-Jageman, 2017) . A likelihood interval is also useful when we are not sure which particular hypotheses to compare. An S-2 likelihood interval gives us a range of values around the MLE that would produce weaker evidence than S = -2 (i.e. e -2 = .1353) when compared against our observed mean of 2.2. The S-2 likelihood interval is shown in Figure 3 between 1.10 and 3.30, and closely corresponds numerically to the 95% confidence interval from 1.02 to 3.38. Using the evidential approach we are free to add more data to existing data. Let's say we added another 22 participants. Assume that the observed mean and standard deviation remain the same at 2.2 and 1.75 units. The likelihood function, like the sampling distribution for the mean, narrows as the sample size increases. This strengthens the existing extremely strong evidence for the minimal effect and the previous data versus the null, so that the support values are now in do uble figures, see Table 4 . We also notice that the evidence in support of the previous data versus the previous data minimal effect size is weakened. This suggests an exaggerated earlier finding (as is often the case (Alahdab et al., 2018) ). In fact, there is a linear increase in the strength of evidence as the sample size increases (3 times the data gives us approximately 3 times the S value). We can see this in Figure 4 which is done for the last of the hypothesis comparisons (2 v 1), starting at N = 5 and incrementing by 5 until 100. This article is protected by copyright. All rights reserved. 2 v 1 0.0498 -3.00 Table 4 Likelihood ratios and support values when N = 33. Sample size increases the evidence in a linear fashion. This plot represents the S values calculated when comparing H 2 vs H 1 . The negative values for S are interpreted as evidence against H 2 relative to H 1 , i.e. weakening of evidence. Alternatively, the evidence for against H 1 relative to H 2 is increasing. Using z rather than t would have produced a line that went through zero: no data = no evidence. As indicated earlier, we would not be justified in adding further data to our initial sample if we used conventional frequentist statistics. This is tempting to do when the initial sample fails to produce P < .05 and the sample mean looks promising. If further data is added to existing data then eventually the investigator will always obtain P < .05, even if the null hypothesis is true. Indeed, continued long enough, any miniscule P value (P < .01 or .001 etc.) will be obtained 100% of the time. Such a process would produce an inferential error, specifically a Type I error. In frequentist statistics a second error can be made, a Type II error, which occurs i f insufficient data are collected and the null hypothesis is false. This contrasts with the evidential approach where no This article is protected by copyright. All rights reserved. such errors are made. This is difficult to believe, and will need a little explanation. The evidential approach provides evidence, and this evidence can either be misleading or it can be weak (Royall, 1997 (Royall, , 2000 . Misleading evidence is evidence that is exceeds some specified strength of evidence (e.g. 2 > S > 2) but in support of the wrong hypothesis. This can happen but the probability of it happening is related to the S value obtained, and is equal to That is, the probability is exponentially related to the absolute value of S. Let us say that we obtained S = 3, e 3 = .05. This means that the probability of misleading evidence would be less than or equal to 5%. The same probability of misleading evidence would be obtained for S = 3. Remember now, that as more data is collected the S value changes linearly in proportion to the amount of that data. Hence S would continue to increase beyond 2 or decrease below 2. The only exception being the unlikely situation where the observed mean was exactly between the two hypothesis values used for the LR. If there was any doubt here, then a likelihood interval would settle the matter. Weak evidence means that it is not strong enough, i.e. 2 < S < 2. What do we do? Well, we just need to collect more data! Thinking that this evidential approach can easily be abused, consider the f ollowing. I have a preferred hypothesis H 1 . I am interested in obtaining support for it against H 0 . I start collecting data. After each additional data point I calculate S 10 . Because of my bias, I will ignore any evidence in favour of H 0 , however strong it might be. I will only stop when I get sufficient evidence in favour of my preferred hypothesis H 1 . If H 0 is actually true then the probability is greater than 1  e S that my research would never end. Say that we used S 10 = 3 then the probability of carrying on forever would exceed 1  e 3 = .95. The stronger the evidence I would require for my false pet theory the less chance I have of ever getting it. This all seems better than using P values. Even if I follow the rules of not adding to my initial sample, the probability of making a Type I error is constant, regardless of the sample size I choose. With the evidential approach the probability of misleading evidence decreases with the chosen sample size. Likelihood is a complete approach to statistics. It can obtain evidence S for regression, correlation, ANOVA and nonparametric analyses, and any others you can think of. Categorical data is easily analysed. Consider the following COVID-19 data from South Korea (https://twitter.com/DrEricDing/status/1239226811185344517 accessed 29 March 2020) This article is protected by copyright. All rights reserved. Dead Alive Sex Men 41 3095 3136 Women 34 4992 5026 75 8087 8162 The grand total of 8162 represents those who had tested positive for COVID-19. Fewer men than women tested positive (3136 vs 5026). Despite this imbalance, fewer women died. The usual frequentist test of association would give χ 2 (1) = 8.44, P = .004. The evidential approach calculates the support using the general equation with natural logarithms Referring to Table 1 we see that this represents extremely strong evidence for the interaction between sex and health status versus no association. We can obtain an exact support function which informs us of what degree of departure from the observed values could be expected given the data. This article is protected by copyright. All rights reserved. The support function for differences in the dead male cell counts. The support intervals for 2 and 4 are shown by dashed lines. The changes assume fixed marginal totals. The term 'cell' here refers to one of the 4 values located within the body of the COVID-19 data shown above. Figure 5 shows the support function (log LR) for departures of cell counts assuming that the marginal totals are fixed. For example, if there were 9 fewer male deaths giving the cell count of 32, then this would result in moderate evidence for a difference from the observed data (the point falls below dashed line for support -2). The expected number of male deaths, given the marginal totals (for the null model), is 28.8165. This is 12.1835 difference from the observed of 41, falling just below the dashed support interval line for 4. More precise calculation with the function give us support 4.084. This is made positive to S = 4.084 since inversion of the likelihood, as we saw in Table 2 , changes the sign of the support. This value agrees with our earlier calculation for the interaction. The same support function can be calculated for each of the 4 cells. They look similar because the marginal totals are fixed. We can produce other support functions for these data, such as relative risk and gamma. A popular statistic is the odds ratio. For these data we get an odds ratio of 1.945. The support curve is shown in Figure 6 . This article is protected by copyright. All rights reserved. The support function for the odds ratio, which is maximal at the observed odds ratio of 1.945. The support intervals for 2 and 4 are shown by dashed lines. The function assumes fixed marginal totals. Consistent with the analysis of the interaction for these data, the observed odds ratio compared with the null value of 1 gives the same S = 4.084. The other components that can be analysed give main effects of S = 220.8238 for sex and a massive (as expected) S = 5231.0812 for health status. These total to 5455.9885, which is exactly the value obtained by an analysis where 8162/4 = 2040.5 is the expectation in each of the 4 cells. This total is constrained only by the grand total and therefore has 3 df. These analyses can be extended to tables of any size and dimension, as explored by log-linear analysis. The evidential approach requires a different way of thinking. We need to consider the weight of evidence for one hypothesis versus another, given our data. It does not involve probabilities or the probability tail areas that only provide evidence against a single hypothesis. It does not involve using a single threshold value such as .05, but a graded measure of evidence for two competing hypotheses using values from  to +. Using likelihoods for statistical inference is not without its detractors. Perhaps the most vocal is Deborah Mayo who has posted several blog pieces, this one appeari ng to show that the likelihood approach exaggerates evidence: https://errorstatistics.com/2014/11/25/how-likelihoodistsexaggerate-evidence-from-statistical-tests/ (accessed 9 May 2020). However, this is also a charge previously levelled against P values (Goodman, 1993; Goodman & Royall, 1988 ). This article is protected by copyright. All rights reserved. For many, the evidential approach is so very different that it is hard to think clearly about it. It's as if we've been brainwashed into the statistical testing P value approach. This is even though the evidential approach is conceptually easier to grasp, and the calculations more straightforward to perform. Moreover, in our calculations we do not need to worry about look-up tables for critical values or exact integrals for P values. To mix metaphors: likelihood is the real deal, it does exactly what it says on the tin, nothing more, nothing less. You're thinking, "This evidential approach is grand, sign me up. When can I start?" Well, that's the problem. There is virtually no support for the approach in any of the standard software packages, including R, Minitab, SPSS, SAS, GraphPad Prism and Stata. This means that calculations, though simple, have to be done by hand, in an Excel spreadsheet, or using a series of programming commands in a package like R. Moreover, the evidential approach hardly gets a mention in most statistical courses. Finally, there are relatively few books that explain how to use it. The best book s are those by Edwards (Edwards, 1992) and Royall (Royall, 1997) , although neither are very accessible to the average researcher. Then there is Clayton & Hills (Clayton & Hills, 2013) for epidemiology and Aitkin (Aitkin & Taroni, 2004) for forensic evidence, neither of which are appropriate texts for those doing laboratory experiments. An attempt has been made to provide a practical step-by-step guide for researchers (Cahusac, In Press) . This article is protected by copyright. All rights reserved. Common statistical terminology H 0 The null hypothesis, typically represents no effect of a treatment on our measurements. The alternative or experimental hypothesis, represents some/any effect produced by a treatment on our measurements. Roughly speaking, the average variability of individual data points. So it tells us how much data varies. It is the square root of the variance. Like the standard deviation but for a statistic, such as a mean. So it tells us how much a statistic, such as a mean, varies. It is the standard deviation for the sampling distribution -see below. Sampling distribution The distribution of a statistic, such as a mean, when repeated samples of specified size are taken from a large population. This is best demonstrated by computer simulation. For a sampling distribution of the mean (which we are most often interested in) the sampling distribution becomes more normally distributed the larger the sample size. The standard deviation of the sampling distribution gives us the standard error. Obtained from the standard normal distribution, this statistic represents the number of standard deviations a value is from a specified mean. If this concerns the sampling distribution then it represents the number of standard errors from the spe cified mean. It requires the population standard deviation to be known. The commonest statistical test produces t values which represent the number of standard errors difference of a mean from a null value or specified value. We use this instead of z when the population standard deviation is not known and has to be estimated from the sample. This is obtained in an analysis of variance (ANOVA) and represents the variance ratio from two estimates of population variance: the between-groups to the within-groups variance. If there are only two samples then the square root of the F value will give the t value (and identical p values). More generally F is used to assess the fit of data to a model, such as in regression. This is the sum of independent squared standard normal (z) values whose degrees of freedom depends upon how many terms there are. Pops up everywhere, and related to other statistics here (z, t and F) but most commonly associated with categorical analyses. In a bivariate correlation, it is the correlation coefficient r squared. More generally it represents the proportion of variability (variance) explained by a model. Maximum likelihood estimate is a statistic which, for a given model, has the highest probability of being predicted from the data. It is therefore the value with the highest likelihood given the data. As the data sample increases to infinity it has desirable properties such as statistical consistency (converges to the true value) and statistical efficiency (no other estimator is more efficient). What we obtain from a statistical test which tells us the probability of obtaining our data or more extreme data assuming the null hypothesis is true. Often misunderstood. Statistics and the Evaluation of Evidence for Forensic Scientists Treatment Effect in Earlier Trials of Patients With Chronic Medical Conditions: A Meta-Epidemiologic Study Statistics Notes -Absence of Evidence Is Not Evidence of Absence Statistical Methods in Medical Research Evidence-Based Statistics: An Introduction to the Evidential Approachfrom Likelihood Principle to Statistical Practice Statistical Models in Epidemiology Introduction to the New Statistics The p-value fallacy and how to avoid it A world beyond P: policies, strategies, tactics and advice Are Mendel's Results Really Too Close? Likelihood in Statistics Statistical Methods and Scientific Inference p Values, Hypothesis Tests, and Likelihood: Implications for Epidemiology of a Neglected Historical Debate Evidence and Scientific Research Bayesian perspectives for epidemiological research: I. Foundations and basic methods Valid P-Values Behave Exactly as They Should: Some Misleading Criticisms of P-Values and Their Resolution With S-Values On the Use and Interpretation of Certain Test Criteria for Purposes of Statistical Inference: Part I Statistical Evidence: a Likelihood Paradigm On the Probability of Observing Misleading Statistical Evidence