Philosophy of science and the replicability crisis University of Groningen Philosophy of Science and The Replicability Crisis Romero, Felipe Published in: Philosophy Compass DOI: 10.1111/phc3.12633 IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it. Please check the document version below. Document Version Publisher's PDF, also known as Version of record Publication date: 2019 Link to publication in University of Groningen/UMCG research database Citation for published version (APA): Romero, F. (2019). Philosophy of Science and The Replicability Crisis. Philosophy Compass, 14(11), 1-14. https://doi.org/10.1111/phc3.12633 Copyright Other than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons). Take-down policy If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim. Downloaded from the University of Groningen/UMCG research database (Pure): http://www.rug.nl/research/portal. For technical reasons the number of authors shown on this cover page is limited to 10 maximum. Download date: 06-04-2021 https://doi.org/10.1111/phc3.12633 https://research.rug.nl/en/publications/philosophy-of-science-and-the-replicability-crisis(1d61bae7-cc7a-4d37-9cf5-b9f17561f1e9).html https://doi.org/10.1111/phc3.12633 A R T I C L E Philosophy of science and the replicability crisis Felipe Romero University of Groningen Correspondence Felipe Romero, Department of Theoretical Philosophy, Faculty of Philosophy, University of Groningen, Groningen, Netherlands. Email: c.f.romero@rug.nl Abstract Replicability is widely taken to ground the epistemic author- ity of science. However, in recent years, important publi- shed findings in the social, behavioral, and biomedical sciences have failed to replicate, suggesting that these fields are facing a “replicability crisis.” For philosophers, the crisis should not be taken as bad news but as an opportunity to do work on several fronts, including conceptual analysis, his- tory and philosophy of science, research ethics, and social epistemology. This article introduces philosophers to these discussions. First, I discuss precedents and evidence for the crisis. Second, I discuss methodological, statistical, and social-structural factors that have contributed to the crisis. Third, I focus on the philosophical issues raised by the crisis. Finally, I discuss several proposals for solutions and highlight the gaps that philosophers could focus on. 1 | INTRODUCTION Replicability is widely taken to ground the epistemic authority of science: We trust scientific findings because experiments repeated under the same conditions produce the same results. Or so one would expect. However, in recent years, important published findings in the social, behavioral, and biomedical sciences have failed to replicate (i.e., when independent researchers repeat the original experiment, they do not obtain the original result). The failure rates are alarming, and the growing consensus in the scientific community is that these fields are facing a “replicability crisis.” Why should we care? The replicability crisis undermines scientific credibility. This, of course, primarily affects scientists. They should clean up their acts and revise entire research programs to reinforce their shaky foundations. However, more generally, the crisis affects all consumers of science. We can justifiably worry that scientific testi- mony might lead us astray if many findings that we trust unexpectedly fail to replicate later. And when we want to defend the epistemic value of science (e.g., against the increasing charges of partisanship in public and political Received: 14 April 2019 Revised: 26 July 2019 Accepted: 12 August 2019 DOI: 10.1111/phc3.12633 This is an open access article under the terms of the Creative Commons Attribution-NonCommercial License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited and is not used for commercial purposes. © 2019 The Author. Philosophy Compass published by John Wiley & Sons Ltd Philosophy Compass. 2019;14:e12633. wileyonlinelibrary.com/journal/phc3 1 of 14 https://doi.org/10.1111/phc3.12633 https://orcid.org/0000-0002-0858-7243 https://doi.org/10.1111/phc3.12633 http://creativecommons.org/licenses/by-nc/4.0/ http://wileyonlinelibrary.com/journal/phc3 https://doi.org/10.1111/phc3.12633 discussions), it certainly does not help that the reliability of several scientific fields is doubtable. Additionally, as members of the public, the high replication failure rates are disappointing as they suggest that scientists are wasting taxpayer funds. For philosophers, the replicability crisis also raises pressing issues. First, we need to address deceptively simple questions, such as “What is a replication?” Second, the crisis also raises questions about the nature of scientific error and scientific progress. Although philosophers of science often stress the fallibility of science, they also expect science to be self-corrective. Nonetheless, the replicability crisis suggests that some portions of science may not be self-correcting, or, at least, not in the way in which philosophical theories would predict. In either case, we need to update our philosophical theories about error correction and scientific progress. Finally, the crisis also urges philosophers to engage in discussions to reform science. These discussions are happening in scientific venues, but philosophers' theoretical work (e.g., foundations of statistics) can contribute to them. The purpose of this article is to introduce philosophers to the discussions about the replicability crisis. First, I introduce the replicability crisis, presenting important milestones and evidence that suggests that many fields are indeed in a crisis. Second, I discuss methodological, statistical, and social-structural factors that have contributed to the crisis. Third, I focus on the philosophical issues raised by the crisis. And finally, I discuss solution proposals emphasizing the gaps that philosophers could focus on, especially in the social epistemology of science. 2 | WHAT IS THE REPLICABILITY CRISIS? HISTORY AND EVIDENCE Philosophers (Popper, 1959/2002), methodologists (Fisher, 1926), and scientists (Heisenberg, 1975) take replicability to be the mark of scientific findings. As an often-cited quote by Popper (1959) observes, “non-replicable single occur- rences are of no significance to science” (p. 64). Recent discussions focus primarily on the notion of direct replication, which refers roughly to “repetition of an experimental procedure” (Schmidt, 2009, p. 91). Using this notion, we can state the following principle: Given an experiment E that produces some result F, F is a scientific finding only if in principle a direct replication of E produces F. That is, if we repeated the experiment we should obtain the same result. Strictly speaking, it is impossible to repeat an experimental procedure exactly. Hence, direct replication is more usefully understood as an experiment whose design is identical to an original experiment's design in all factors that are supposedly causally responsible for the effect. Consider the following example from Gneezy, Keenan, and Gneezy (2014). The experiment E compares the likelihood of choosing to donate to a charity when the donor is informed that (a) the administrative costs to run the charity have already been covered or (b) that her contribution will cover such costs. F is the finding that donors are more likely to donate to a charity in the first situation. Imagine we want to repli- cate this finding directly (as Camerer et al., 2018) did. Changing the donation amount might make a difference; hence, the replication would not be direct, but whether we conduct the replication in a room with gray or white walls should be irrelevant. A second notion that researchers often use is conceptual replication: “Repetition of a test of a hypothesis or a result of earlier research work with different methods” (Schmidt, 2009, p. 91). Conceptual replications are epistemi- cally useful because they modify aspects of the original experimental design to test its generalizability to other con- texts. For instance, a conceptual replication of Gneezy et al.'s (2014) experiment could further specify the goals of the charities in the vignettes, as these could influence the results as well. Additionally, methodologists distinguish replicability from a third notion: reproducibility (Patil, Peng, & Leek, 2016; Peng, 2011). This notion means obtaining the same numerical results when repeating the analysis using the original data and the same computer code. Some studies do not pass this minimal standard. 2 of 14 ROMERO Needless to say, these notions are controversial. Researchers disagree about how to best define them and the epistemic import of the practices that they denote (see Section 3 for further discussion). For now, these notions are useful to introduce four precedents of the replicability crisis: • Social priming controversy. In the early 2010s, researchers reported direct replication failures of Bargh, Chen, and Burrows's (1996) famous elderly-walking study in two (arguably better conducted) attempts (Doyen, Klein, Pichon, & Cleeremans, 2012; Pashler, Harris, & Coburn, 2011). Before the failures, Bargh's finding had been positively cited for years, it had been taught to psychology students, and it had inspired a big industry of “social priming” papers (e.g., many conceptual replications of Bargh's work). Several of these findings have also failed to replicate directly (Harris, Coburn, Rohrer, & Pashler, 2013; Klein et al., 2014; Pashler, Coburn, & Harris, 2012; Shanks et al., 2013). • Bem's (2011) extrasensory perception studies. Bem showed in nine experiments that people have extrasensory pow- ers to perceive the future. His paper was published in a prestigious psychology journal (Bem, 2011). Although the finding persuaded very few scientists, the controversy engendered mistrust in the ways psychologists conduct their experiments because Bem used procedures and statistical tools that many social psychologists use. (See Romero, 2017, for a discussion.) • Amgen and Bayer Healthcare reports. Two often-cited papers reported that scientists from the biotech companies Amgen (Begley & Ellis, 2012) and Bayer Healthcare (Prinz, Schlange, & Asadullah, 2011) were only able to replicate a small fraction (11–20%) of landmark findings in preclinical research (e.g., oncology), which suggested that replica- bility is a pervasive problem in biomedical research. • Studies on p-hacking and questionable research practices (QRPs). Several studies (Ioannidis, 2008; Simmons, Nelson, & Simonsohn, 2011; John, Loewenstein, & Prelec, 2012; Ioannidis, Munafò, Fusar-Poli, Nosek, & David, 2014) showed how some practices that exploit the flexibility in data collection could lead to the production of false posi- tives (see Section 2 for an explanation). These studies suggested that the published record across several fields could be polluted with nonreplicable research. Although the precedents above suggested that there was something flawed in social and biomedical research, the more telling evidence for the crisis comes from multisite projects that assess replicability systematically. In psychol- ogy, the Many Labs projects (Ebersole et al., 2016; Klein et al., 2014; Open Science Collaboration, 2012) have studied a variety of findings and whether they replicate across multiple laboratories. Moreover, the Reproducibility Project (Open Science Collaboration, 2015) studied a random sample of published studies to estimate the replicability of psy- chology more generally. Similar projects have assessed the replicability of cancer research (Nosek & Errington, 2017), experimental economics (Camerer et al., 2016), and studies from the prominent journals Nature and Science (Camerer et al., 2018). These studies give us an unsettling perspective. The Reproducibility Project, in particular, suggests that only a third of findings in psychology replicate. Now, it is worth noting that the concern about replicability in the social sciences is not new. What authors call the replicability crisis started around 2010, but researchers had been voicing concerns about replicability long before. As early as the late 1960s and early 1970s, authors worried about the lack of direct replications (Ahlgren, 1969; Smith, 1970). Also in the late 1970s, the journal Replications in Social Psychology was launched (Campbell & Jackson, 1979) to address the problem that replication research was hard to publish, but it went out of press after just three issues. Later in the 1990s, studies reported that editors and reviewers were biased against publishing replications (Neuliep & Crandall, 1990; Neuliep & Crandall, 1993). This history is instructive and triggers questions from the per- spective of the history and philosophy of science. If researchers have neglected replication work systematically, isn't it unsurprising that many published findings do not replicate? Also, why hasn't the concern about replicability led to sustainable changes? ROMERO 3 of 14 3 | CAUSES OF THE REPLICABILITY CRISIS Most likely, the replicability crisis is the result of the interaction of multiple methodological, statistical, and sociologi- cal factors (although authors often disagree about how much each factor contributes). Here I review the most discussed ones. Arguably, one of the strongest contributing factors to the replicability crisis is publication bias, that is, using the outcome of a study (in particular, whether it succeeds supporting its hypothesis, and especially if the hypothesis is surprising) as the primary criterion for publication. For users of Null Hypothesis Significance Testing (NHST), as most fields affected by the crisis, publication bias results from making statistical significance a necessary condition for publication. This leads to what Rosenthal (1979) in the late 1970s labeled the file-drawer problem. By chance, a false hypothesis is expected to be statistically significant 5% of the time (following the standard convention in NHST). If journals only publish statistically significant results, then they contain the 5% of the studies that show erroneous successes (false positives), whereas the other 95% of the studies (true negatives) remain in the researchers' file drawers. This produces a misleading literature and biases meta-analytic estimates. Publication bias is more worrisome when we consider that only a fraction of all the hypotheses that scientists test are true. In such a case, it is possible that most published findings are false (Ioannidis, 2005). Recently, methodologists have developed techniques to identify publication bias (Simonsohn, Nelson, & Simmons, 2014; van Aert, Wicherts, & van Assen, 2016). Publication bias fuels a second contributing factor to the replicability crisis, namely, QRPs. Since statistical sig- nificance determines publication, scientists have incentives to deviate (sometimes even unconsciously) to achieve it. For instance, scientists anonymously admit that they engage in a host of QRPs (John et al., 2012), such as reporting only studies that worked. A particularly pernicious practice is p-hacking, that is, exploiting the flexibility of data collec- tion to obtain statistical significance. This includes, for example, collecting more data or excluding data until you get your desired results. In an important computer simulation study, Simmons et al. (2011) show that a combination of p- hacking techniques can increase the false positive rate to 61%. QRPs and p-hacking are troublesome because (a) unlike clear instances of fraud, they are widespread, and (b) motivated reasoning can lead researchers to justify them (e.g., “I think that person did not quite understand the instructions of the experiment, so I should exclude her data.”). Also related to publication bias, the proliferation of conceptual replications is a third factor that contributes to the replicability crisis. As discussed by Pashler and Harris (2012), the problem of conceptual replications lies in their interaction with publication bias. Suppose a scientist conducts a series of experiments to test a false theory T. Suppose he fails in all but one of his attempts, the only one that gets published. Then, a second scientist gets interested in the publication. She tries to test T in modified conditions in a series of conceptual replications, without replicating the original conditions. Again, she succeeds in only one of her attempts, which is the only one published. In this process, none of the replication failures gets published given the file-drawer problem. But still, after some time, the literature will contain a diverse set of studies that suggest that T is a robust theory. In short, the proliferation of conceptual replications might misleadingly support theories. A fourth contributing factor to the replicability crisis is NHST itself. The argument can take two forms. On the one hand, scientists' literacy on NHST is low. Already before the replicability crisis, authors argued that practicing scientists misinterpret p values (Cohen, 1990), consistently misunderstand the inferential logic of the method (Fidler, 2006), and confuse statistical significance with scientific import (Ziliak, & McCloskey, 2008). Moreover, recently, the American Statistical Association explicitly listed the misunderstanding of NHST as a cause of the crisis (Wasserstein & Lazar, 2016). On the other hand, there are concerns about the limitations of NHST. Importantly, in NHST, nonstatistically significant results are typically inconclusive, so researchers cannot accept a null hypothesis (but see Machery, 2012, and Lakens, Scheel, & Isager, 2018). And if we cannot accept a null hypothesis, then it is harder to evaluate and publish failed replication attempts. 4 of 14 ROMERO A fifth and arguably more fundamental factor that contributes to the replicability crisis is the reward system of science. A central component of the reward system of science is the priority rule (Merton, 1957), that is, the practice of rewarding only the first scientist that makes a discovery. This reward system discourages replication (Romero, 2017). The argument concerns the interaction between the priority rule and the peer-review system. In present-day science, scientists establish priority over a finding via peer-reviewed publication. However, because peer review is insufficient to determine whether a finding replicates, many findings are rewarded with publication regardless of their replicability. The reward system also contributes to the production of nonreplicable research by exerting high career pressures on researchers. They need to fill their CVs with exciting, positive results to sustain and advance in their careers. This perverse incentive explains why many of them fall prey to QRPs, confirmation biases (Nuzzo, 2015), and post hoc hypothesizing (Kerr, 1998; Bones, 2012), leading to nonreplicable research. 4 | PHILOSOPHICAL ISSUES RAISED BY THE REPLICABILITY CRISIS Psychologists acknowledge the need for philosophical work in the context of the replicability crisis: They are publish- ing a large number of papers with conceptual work inspired by the crisis. Some authors indeed voice the need for phi- losophy explicitly (Spellman, 2015, p. 894). Philosophers, with few notable exceptions, are only recently joining these discussions. In this section, I review some of the more salient philosophical issues raised by the crisis and point out open research avenues. The first set of philosophical issues triggered by the crisis concerns the very definition of replication. What is a replication? Methodologists and practicing scientists often use the notions of direct (i.e., “repetition of an experimen- tal procedure,” Schmidt, 2009, p. 91) and conceptual (“repetition of a test of a hypothesis or a result of earlier research work with different methods,” Schmidt, 2009, p. 91) replication. Philosophers have made similar distinc- tions, albeit using different terminology (Cartwright, 1991; Radder, 1996). However, both notions are vague and require further specification. Although the notion of direct replication is intuitive, strictly speaking, no experiment can repeat the original study because there are always unavoidable changes in the setting, even if they are small (e.g., changes in time, weather, location, and participants). One amendment, as suggested above, is to reserve the term direct replication for experiments whose design is identical to an original experiment's design in all factors that are supposedly causally responsible for the effect. The notion of conceptual replication is even more vague. This notion denotes the practice of modifying an original experimental design to evaluate a finding's generalizability across laboratories, measurements, and contexts. Although this practice is fairly common, as researchers change an experiment's design, the resulting designs can be very different. These differences can lead researchers to disagree about what hypothesis the experiments are actually testing. Hence, labeling these experiments as replications can be controversial. Authors have attempted to refine the definitions of replication to overcome the problems of the direct/conceptual dichotomy. One approach is to view the difference between the original experiment and replica- tion as a matter of degree. The challenge is then to specify the possible and acceptable ways in which replications can differ. For instance, Brandt et al. (2014) suggest the notion of “close” replication. For them, the goal should be to make replications as close as possible to the original while acknowledging the inevitable differences. Similarly, LeBel, Berger, Campbell, and Loving (2017) identify a replication continuum of five types of replications that are classified according to their relative methodological similarity to the original study. And Machery (2019a) argues that the direct/conceptual distinction is confused and defines replications as experiments that can resample several experi- mental components. Having the right definition of replication is not only theoretically important but also practically pressing. Declaring that a finding fails to replicate depends on whether the replication attempt counts as a replication or not. In fact, the reaction of some scientists whose work fails to replicate is to emphasize that the replication attempts introduce sub- stantive variations that explain the failures and list a number of conceptual replications that support the underlying ROMERO 5 of 14 hypothesis (for examples of this response, see Carney, Cuddy, & Yap, 2015, and Schnall, 2014). The implicature in these responses is that the failed direct replication attempts are not genuine replications and the successful concep- tual replications are. The definitional questions trigger closely related epistemological questions. What is the epistemic function of rep- lication? How essential are replications to further the epistemic goals of science? An immediate answer is that repli- cations (i.e., direct or close replications) evaluate the reliability of findings (Machery, 2019a). So understood, conducting replications serves a crucial epistemic goal. But some authors disagree. For instance, Strobe and Strack (2014) argue that direct replications are uninformative because they cannot be exact and suggest to focus on conceptual replications instead. Similarly, Leonelli (2018) argues that in some cases, the validation of results does not require direct/close replications and nonreplicable research often has epistemic value. And Feest (2019) also argues that replication is only a very small part of what is necessary to improve psychological science, and hence, the concerns about replicability are overblown. These remarks urge researchers to reconsider their focus on replica- tion efforts. Another pressing set of philosophical questions triggered by the replicability crisis concerns the topic of scientific self-correction. For an important tradition in philosophy, science has an epistemically privileged position not because it gives us truth right away but because in the long run it corrects its errors (Peirce, 1901/1958; Reichenbach, 1938). Authors call this idea the self-corrective thesis (SCT) (Laudan, 1981; Mayo, 2005). (SCT) In the long run, the scientific method will refute false theories and find closer approximations to true theories. In the context of modern science, we can refine SCT to capture the most straightforward mechanism of scientific self-correction, which involves replication, statistical inference, and meta-analysis. (SCT*) Given a series of replications of an experiment, the meta-analytical aggregation of their effect sizes will converge on the true effect size (with a narrow confidence interval) as the length of the series of replications increases. SCT* is theoretically plausible, but its truth depends on the social-structural conditions that implement it. First, most findings are never subjected to even one replication attempt (Makel, Plucker, & Hegarty, 2012). It is true that scien- tists have recently identified particular findings that do not replicate, but this is a tiny step in the direction of self-cor- rection. If we trust the estimates of low replicability, these failures could be the tip of the iceberg, and the false positives under the surface may never be corrected. Second, the social-structural conditions in the fields affected by the crisis (which involve publication bias, confirmation bias, and limited resources) make the thesis false (Romero, 2016). Now, the falsity of SCT* does not entail that SCT is false but requires us to specify what other mechanisms could make SCT true. We can see the concern about SCT as an instance of a broader tension between the theory and practice of science. The replicability crisis reveals a gap between our image of science, which includes the ideal of self-correction via replication, and the reality (Longino, 2015). We can view this gap in several ways. One possibility is that the replicability crisis proves that the ideal is normatively inadequate (i.e., cannot implies not-ought). Hence, we have to change the ideal to close the gap, and this project requires philosophical work. Another possibility is that the ideal is adequate, and the gap is an implementation failure that results from bad scientists not doing their job. In this view, the gap is less philosophically significant and more a problem for science policymakers. In favor of the first possibility, however, it is worth stressing that many scientists succumb to practices that lead to nonreplicable research. That is, the gap is not due to a few bad apples but to systemic problems. This assessment invites social epistemological work. 6 of 14 ROMERO The replicability crisis also raises questions about confirmation, specifically regarding the variety of evidence the- sis (VET). This thesis states that ceteris paribus varied evidence (e.g., distinct experiments pointing to the same hypothesis from multiple angles) has higher confirmatory power than has less varied evidence. (This idea is also dis- cussed in philosophy under the labels of robustness analysis and triangulation.) VET has intuitive appeal and has been favorably appraised by philosophers (Wimsatt, 1981; see Landes, 2018, for a discussion). Take, for instance, the case for climate change, which we take to be robust as it incorporates evidence from a variety of different disciplines. Nonetheless, VET is not uncontroversial (Stegenga, 2009). In the context of the crisis, the virtues of VET need to be qualified, given the concern that conceptual replications have contributed to the problem (see Section 2). Since the 1990s, in line with VET, a model paper in psychology contains a series of distinct experiments testing the same hypothesis with conceptual replications. Although such a paper allegedly gives a robust understanding of the phenomenon, the conceptual replications in many cases have been conducted under the wrong conditions (e.g., confirmation bias, publication bias, and low statistical power) and are therefore not trustworthy (Schimmack, 2012). In these cases, having more direct replications (i.e., less varied evidence) could even be more epistemically desirable. Thus, the replicability crisis requires us to evaluate VET from a practical perspective and determine when conceptual replications confirm or mislead. Another concern that the replicability crisis raises for philosophers has to do with epistemic trust. Science requires epistemic trust to be efficient (Wilholt, 2013). But how much should you trust? Scientists cannot check all the findings they rely on. If they did, science would be at best inefficient. However, in light of the replicability crisis, scientists cannot be content trusting the findings of their colleagues only because they are published. Epistemic trust can also lead consumers of nonreplicable research from other disciplines astray. For example, empirically informed philosophers, and specifically moral psychologists, have relied heavily on findings from social psychology. They also need to clean up their act. (See Machery & Doris, 2017, for suggestions on how to do this.) Although the issues above are primarily epistemological, the replicability crisis also raises ethical questions that philosophers have yet to study. A first issue concerns research integrity to facilitate replicability. A second issue con- cerns the ethics of replication itself. Since the first replication failures of social psychological effects in the early 2010s, the psychological community has witnessed a series of unfortunate exchanges. Original researchers have questioned the competence of replicators and even accused them of ill-intent and bullying (Bohannon, 2014; Meyer & Chabris, 2014; Yong, 2012). What should we make of these battles? Although the scientific community has the right to criticize any published finding, replication failures can impact on original researchers' careers dramatically (e.g., affecting hiring and promotion). Replicators can make mistakes, too. In recent years, there has been a growing movement of scientists focused on checking the work of their colleagues. Although the crisis epistemically justifies their motivation, it is also fair to ask, who checks the checkers? 5 | WHAT TO DO? The big remaining question is normative: What should we do? Because the crisis is likely the result of multiple con- tributing factors, there is a big market of proposals. I classify them in three camps: statistical reforms, methodological reforms, and social reforms. I use this classification primarily to facilitate discussion. Indeed, there are few strict reformists of each camp. Most authors agree that science needs more than one kind of reform. Nonetheless, authors also tend to emphasize the benefits of particular interventions (in particular, the statistical reformists). I discuss some of the most salient proposals from each camp. 5.1 | Statistical reforms Statistical reformists are of two kinds. The first kind advocates for replacing frequentist statistics (in particular, NHST). One alternative is to completely get rid of NHST and use descriptive statistics instead (Trafimow & Marks, ROMERO 7 of 14 2015). A more prominent approach is Bayesian inference (Bernardo & Smith, 1994; Lee & Wagenmakers, 2013; Rouder, Speckman, Sun, Morey, & Iverson, 2009). The argument for Bayesian inference is foundational. The Bayesian researcher needs to be explicit about several assumptions in her or his tests—assumptions that remain under the hood of NHST inference (Romeijn, 2014; Sprenger, 2016). Additionally, Bayesian inference with Bayes factors (the most popular measure of evidence for Bayesian inference in psychology) gives the researcher a straightforward procedure to infer a null hypothesis. This is a great advantage when dealing with replication failures. In practice, however, authors disagree about how to specify the necessary assumptions to conduct Bayes factor analysis. The second kind of statistical reformist does not want to eliminate frequentist statistics but change the way we do frequentist statistics. There are philosophical motivations for this sort of reform. Long-run error control is a valuable goal of statistical inference, which is not clearly met outside frequentism. Hence, rather than replacing frequentist statistics, one may argue that we need to improve the way we use it (Mayo, 2018). The frequentist scien- tists have had tools in addition to p values to make inferences that practitioners could incorporate. For instance, equivalence tests allow researchers to test for the absence of effects (Lakens, Scheel, & Isager, 2018). Another possi- bility is to move away from the dichotomous inferential approach of NHST and focus on estimating effect sizes and confidence intervals (Cumming, 2012; Cumming, 2014; Fidler, 2007). There are also practical motivations to preserve frequentist statistics and in particular NHST. For instance, Benja- min et al. (2018) in a 72-author paper advocate for changing the p value threshold from the conventional p < .05 to the stricter p < .005. Although the authors acknowledge the problems of NHST, they argue that such a change would solve many of the problems that lead to low replicability (e.g., by making p-hacking and QRPs harder to work) and would be easy to implement. In response, other 88 authors argue for a more critical approach in which authors should be required to specify and justify the significance level that their project needs (Lakens et al., 2018). Although philosophers are less invested in developing new statistical tools, they can contribute to these dis- cussions at least in three ways: (a) evaluating the arguments and trade-offs involved in implementing statistical reforms (Machery, 2019b); (b) making foundational debates about statistical inference relevant and accessible to practitioners; and (c) studying how inference methods behave in different contexts, for example, by using computer simulations (Romero, 2016, Bruner & Holman, 2019, and Romero & Sprenger, 2019). 5.2 | Methodological reforms The methodological reformist proposes to improve scientific practices more generally by going beyond mere statis- tics. One type of reform is explicitly antistatistical. For instance, McShane, Gal, Gelman, Robert, and Tackett (2019) argue to make publication decisions considering statistical outcomes (e.g., p-values, confidence intervals, and Bayes factors) as just another piece of information among other factors such as “related prior evidence, plausibility of mech- anism, study design and data quality, real-world costs and benefits, and novelty of finding” (p. 235). In practice, how- ever, it would be hard to implement alternatives such as this because editors and reviewers are used to relying on statistical thresholds as heuristics to make publication decisions. The second type of methodological reform recommends making the scientific process more transparent. A popu- lar movement with this aim is open science. The rationale of open science practices is to increase transparency by asking researchers to share a variety of products from their work, ranging from experimental designs to software and raw data. Open science is epistemically desirable (but see Levin & Leonelli, 2016). Specifically, in the context of the crisis, open science practices have the potential to increase replicability as they greatly facilitate replication work by independent researchers. The open science movement has also defended preregistration enthusiastically. That is, uploading a timestamped uneditable research plan to a public archive. A preregistration states the hypotheses to be tested, target sample sizes, and so on. Preregistration greatly constrains the researcher degrees of freedom that make QRPs and p-hacking 8 of 14 ROMERO work. When authors submit their work to a journal, reviewers and editors can verify whether the authors did what they planned. Although preregistration increases transparency, we should not overstate its usefulness. First, preregistration does not fully counter publication bias as it does not guarantee that findings will be reported (Chen et al., 2016). Sec- ond, preregistration cannot be straightforwardly implemented in some research domains (Tackett et al., 2017). Two refinements on preregistration are the Registered Reports (Chambers, 2013) and Registered Reproducibility Reports (Simons, Holcombe, & Spellman, 2014) publication models. In these models, scientists submit a research proposal to a journal before data collection, which is evaluated on the basis of its methodological merits. The journal can give the proposal an in-principle acceptance, which means that the paper will be published regardless of its outcome (see Romero, 2018, for a discussion). Various authors have proposed changing publication practices to address the problem that replications are not rewarded. As discussed in Section 1, having dedicated outlets for replication work has not worked in the past. The reason is likely that having replication work published in secondary venues gives the impression that such a work is not very important, and hence, researchers would still relegate it. Instead, a more promising approach is opening the doors of prestigious journals for replication work (Cooper, 2016; Simons et al., 2014; Vazire, 2015). 5.3 | Social reforms The social reformist argues that changes in statistics and methodology are insufficient to address the replicability crisis because they treat the symptoms and not the disease, namely, the defective social structures of contemporary science. For the social reformist, it is too optimistic to expect scientists to follow good practices (in particular, to do replication work) if the right incentives are not in place. This is because science today is a professionalized activity. As such, scientists are constrained not only by the ethos of science but also by more mundane and arguably more forceful pressures, such as the requirement to produce many novel findings to have a career and continue playing the game. Social reforms attempt to align career incentives with statistical and methodological expectations. In particular, to incentivize replication work, multiple parties should intervene. Funding agencies can allocate funding specifically to replication projects (see Netherlands Organisation for Scientific Research, 2016, for an example). Universities and departments can create positions in which replication work is part of the responsibility of the researcher, and they can adapt promotion criteria according to quality metrics rather than to raw publication numbers (Schönbrodt, Heene, Maier, & Zehetleitner, 2015). Such interventions would create conditions where researchers do not perceive replication as second-class work. Further questions that the social reformist asks concern the adequate design of epistemic institutions: What is the best way to divide cognitive labor to ensure that science produces not only novel findings but also replicable results? What are the different trade-offs in terms of speed and reliability if we incorporate replication work as an essential part of the research process? Should all scientists in a community engage in replication work or only a selected group? Some authors answer these questions proposing different institutional arrangements (see Romero, 2018, for a discussion) and this is an area ripe for social epistemological investigation. 6 | CONCLUSION In this paper, I have reviewed core issues in discussions around the replicability crisis, including its history, causes, philosophical assessments, and proposed solutions. Many normative discussions about replicability focus on technical problems about statistical inference and experimental design. Philosophers with interest on the foundations of statistical inference and confirmation theory can play a more active role in them. But the replica- bility crisis is not exclusively (and not primarily) a statistical problem. As I have reviewed, we still need to clarify ROMERO 9 of 14 concepts about replication, understand how different practices impact on low replicability, and study how to intervene in the social structure of science. In these respects, the crisis demands work from the perspectives of the history and philosophy of science, social epistemology, and research ethics. That is, for philosophers, the cri- sis should not be taken as bad news but as an opportunity to update our theories and make them relevant to practice. ACKNOWLEDGMENTS I am grateful to Mike Dacey, Teresa Ai, the editor, and one anonymous reviewer for their useful comments on previ- ous drafts. ORCID Felipe Romero https://orcid.org/0000-0002-0858-7243 WORKS CITED Ahlgren, A. (1969). A modest proposal for encouraging replication. American Psychologist, 24, 471. Bargh, J. A., Chen, M., & Burrows, L. (1996). Automaticity of social behavior: Direct effects of trait construct and stereotype activation on action. Journal of Personality and Social Psychology, 71, 230–244. https://doi.org/10.1037/0022-3514.71. 2.230 Begley, C. G., & Ellis, L. M. (2012). Drug development: Raise standards for preclinical cancer research. Nature, 483, 531–533. https://doi.org/10.1038/483531a Bem, D. J. (2011). Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect. Journal of Personality and Social Psychology, 100, 407–425. https://doi.org/10.1037/a0021524 Benjamin, D. J., Berger, J. O., Johannesson, M., Nosek, B. A., Wagenmakers, E.-J., Berk, R., … Johnson, V. E. (2018). Redefine statistical significance. Nature Human Behaviour, 2, 6–10. https://doi.org/10.1038/s41562-017-0189-z Bernardo, J. M., & Smith, A. F. M. (1994). Bayesian theory. New York: Wiley. Bohannon, J. (2014). Replication effort provokes praise—and ‘bullying’ charges. Science, 344, 788–789. https://doi.org/10. 1126/science.344.6186.788 Bones, A. K. (2012). We knew the future all along: Scientific hypothesizing is much more accurate than other forms of precognition—A satire in one part. Perspectives on Psychological Science, 7, 307–309. https://doi.org/10.1177/ 1745691612441216 Brandt, M. J., IJzerman, H., Dijksterhuis, A., Farach, F. J., Geller, J., Giner-Sorolla, R., … van't Veer, A. (2014). The replication recipe: What makes for a convincing replication? Journal of Experimental Social Psychology, 50, 217–224. https://doi.org/ 10.1016/j.jesp.2013.10.005 Bruner, J. P., & Holman, B. (2019). Self-correction in science: Meta-analysis, bias and social structure. Studies in History and Philosophy of Science Part A, XXX. https://doi.org/10.1016/j.shpsa.2019.02.001 Camerer, C. F., Dreber, A., Forsell, E., Ho, T.-H., Huber, J., Johannesson, M., … Wu, H. (2016). Evaluating replicability of labo- ratory experiments in economics. Science, 351, 1433–1436. https://doi.org/10.1126/science.aaf0918 Camerer, C. F., Dreber, A., Holzmeister, F., Ho, T. -H., Huber, J., Johannesson, M., … Wu, H. (2018). Evaluating the replicabil- ity of social science experiments in Nature and Science between 2010 and 2015. Nature Human Behaviour, 2(9), 637–644. https://doi.org/10.1038/s41562-018-0399-z Campbell, K. E., & Jackson, T. T. (1979). The role and need for replication research in social psychology. Replications in Social Psychology, 1(1), 3–14. Carney, D. R., Cuddy, A. J. C., & Yap, A. J. (2015). Review and summary of research on the embodied effects of expansive (vs. contractive) nonverbal displays. Psychological Science, 26(5), 657–663. https://doi.org/10.1177/ 0956797614566855 Cartwright, N. (1991). Replicability, reproducibility and robustness: Comments on Harry Collins. History of Political Economy, 23(1), 143–155. https://doi.org/10.1215/00182702-23-1-143 Chambers, C. D. (2013). Registered Reports: A new publishing initiative at Cortex. Cortex, 49, 609–610. https://doi.org/10. 1016/j.cortex.2012.12.016 10 of 14 ROMERO https://orcid.org/0000-0002-0858-7243 https://orcid.org/0000-0002-0858-7243 https://doi.org/10.1037/0022-3514.71.2.230 https://doi.org/10.1037/0022-3514.71.2.230 https://doi.org/10.1038/483531a https://doi.org/10.1037/a0021524 https://doi.org/10.1038/s41562-017-0189-z https://doi.org/10.1126/science.344.6186.788 https://doi.org/10.1126/science.344.6186.788 https://doi.org/10.1177/1745691612441216 https://doi.org/10.1177/1745691612441216 https://doi.org/10.1016/j.jesp.2013.10.005 https://doi.org/10.1016/j.jesp.2013.10.005 https://doi.org/10.1016/j.shpsa.2019.02.001 https://doi.org/10.1126/science.aaf0918 https://doi.org/10.1038/s41562-018-0399-z https://doi.org/10.1177/0956797614566855 https://doi.org/10.1177/0956797614566855 https://doi.org/10.1215/00182702-23-1-143 https://doi.org/10.1016/j.cortex.2012.12.016 https://doi.org/10.1016/j.cortex.2012.12.016 Chen, R., Desai, N. R., Ross, J. S., Zhang, W., Chau, K. H., Wayda, B., … Krumholz, H. M. (2016). Publication and reporting of clinical trial results: Cross sectional analysis across academic medical centers. BMJ, 352, i637. https://doi.org/10.1136/ bmj.i637 Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45, 1304–1312. https://doi.org/10.1037/0003-066X. 45.12.1304 Cooper, M. L. (2016). Editorial. Journal of Personality and Social Psychology, 110, 431–434. https://doi.org/10.1037/ pspp0000033 Cumming, G. (2012). Understanding the new statistics: Effect sizes, confidence intervals, and meta-analysis. Multivariate Applica- tions Book Series. New York, NY, US: Routledge. Cumming, G. (2014). The new statistics: Why and how. Psychological Science, 25, 7–29. https://doi.org/10.1177/ 0956797613504966 Doyen, S., Klein, O., Pichon, C.-L., & Cleeremans, A. (2012). Behavioral priming: It's all in the mind, but whose mind? PLoS ONE, 7(1), e29081. https://doi.org/10.1371/journal.pone.0029081 Ebersole, C. R., Atherton, O. E., Belanger, A. L., Skulborstad, H. M., Allen, J. M., Banks, J. B., … Nosek, B. A. (2016). Many Labs 3: Evaluating participant pool quality across the academic semester via replication. Journal of Experimental Social Psychol- ogy, 67, 68–82. Feest, U. (2019). Why replication is overrated. Philosophy of Science, XXX. https://doi.org/10.1086/705451 Fidler, F. (2006). Should psychology abandon p values and teach CIs instead? Evidence-based reforms in statistics education. In Proceedings of the 7th International Conference on Teaching Statistics. Fidler, F. (2007). From statistical significance to effect estimation: Statistical reform in psychology, medicine and ecology. New York: Routledge. https://doi.org/10.1080/13545700701881096 Fisher, R. A. (1926). The arrangement of field experiments. Journal of the Ministry of Agriculture, 33, 503–513. Gneezy, U., Keenan, E. A., & Gneezy, A. (2014). Avoiding overhead aversion in charity. Science, 346(6209), 632–635. Harris, C. R., Coburn, N., Rohrer, D., & Pashler, H. (2013). Two failures to replicate high-performance-goal priming effects. PLoS ONE, 8, e72467. https://doi.org/10.1371/journal.pone.0072467 Heisenberg, W. (1975). The great tradition: End of an epoch? Encounter, 44(3), 52–58. Ioannidis, J. P. A. (2005). Why Most Published Research Findings Are False. PLoS Medicine, 2(8), e124. https://doi.org/10. 1371/journal.pmed.0020124 Ioannidis, J. P. A. (2008). Why most discovered true associations are inflated. Epidemiology, 19, 640–648. https://doi.org/10. 1097/EDE.0b013e31818131e7 Ioannidis, J. P. A., Munafò, M. R., Fusar-Poli, P., Nosek, B. A., & David, S. P. (2014). Publication and other reporting biases in cognitive sciences: Detection, prevalence, and prevention. Trends in Cognitive Sciences, 18, 235–241. https://doi.org/10. 1016/j.tics.2014.02.010 John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incen- tives for truth telling. Psychological Science, 23, 524–532. https://doi.org/10.1177/0956797611430953 Kerr, N. L. (1998). Harking: Hypothesizing after the results are known. Personality and Social Psychology Review, 2, 196–217. https://doi.org/10.1207/s15327957pspr0203_4 Klein, R. A., Ratliff, K. A., Vianello, M., Adams, R. B. Jr., Bahnik, Š., Bernstein, M. J., … Nosek, B. A. (2014). Investigating varia- tion in replicability: A “Many Labs” replication project. Social Psychology, 45, 142–152. https://doi.org/10.1027/1864- 9335/a000178 Lakens, D., Adolfi, F. G., Albers, C. J., Anvari, F., Apps, M. A. J., Argamon, S. E., … Zwaan, R. A. (2018). Justify your alpha. Nature Human Behaviour, 2, 168–171. https://doi.org/10.1038/s41562-018-0311-x Lakens, D., Scheel, A. M., & Isager, P. M. (2018). Equivalence Testing for Psychological Research: A Tutorial. Advances in Methods and Practices in Psychological Science, 1(2), 259–269. https://doi.org/10.1177/2515245918770963 Landes, J. (2018). Variety of evidence. Erkenntnis, XXX, 1–41. https://doi.org/10.1007/s10670-018-0024-6 Laudan, L. (1981). Science and hypothesis: Historical essays on scientific methodology. Netherlands: Springer. https://doi.org/ 10.1007/978-94-015-7288-0 LeBel, E. P., Berger, D., Campbell, L., & Loving, T. J. (2017). Falsifiability is not optional. Journal of Personality and Social Psy- chology, 113, 254–261. https://doi.org/10.1037/pspi0000106 Lee, M. D., & Wagenmakers, E.-J. (2013). Bayesian cognitive modeling: A practical course. Cambridge: Cambridge University Press. Leonelli, S. (2018). Re-thinking reproducibility as a criterion for research quality. [Preprint]. http://philsci-archive.pitt.edu/ 14352/1/Reproducibility_2018_SL.pdf Levin, N., & Leonelli, S. (2016). How does one “open” science? Questions of value in biological research. Science, Technology & Human Values, 42(2), 280–305. Longino, H. (2015). The social dimensions of scientific knowledge. In E. N. Zalta (Ed.), The Stanford encyclopedia of philosophy (Spring 2015 ed.). Stanford, CA: Stanford University. Machery, E. (2012). Power and negative results. Philosophy of Science, 79(5), 808–820. https://doi.org/10.1086/667877 ROMERO 11 of 14 https://doi.org/10.1136/bmj.i637 https://doi.org/10.1136/bmj.i637 https://doi.org/10.1037/0003-066X.45.12.1304 https://doi.org/10.1037/0003-066X.45.12.1304 https://doi.org/10.1037/pspp0000033 https://doi.org/10.1037/pspp0000033 https://doi.org/10.1177/0956797613504966 https://doi.org/10.1177/0956797613504966 https://doi.org/10.1371/journal.pone.0029081 https://doi.org/10.1086/705451 https://doi.org/10.1080/13545700701881096 https://doi.org/10.1371/journal.pone.0072467 https://doi.org/10.1371/journal.pmed.0020124 https://doi.org/10.1371/journal.pmed.0020124 https://doi.org/10.1097/EDE.0b013e31818131e7 https://doi.org/10.1097/EDE.0b013e31818131e7 https://doi.org/10.1016/j.tics.2014.02.010 https://doi.org/10.1016/j.tics.2014.02.010 https://doi.org/10.1177/0956797611430953 https://doi.org/10.1207/s15327957pspr0203_4 https://doi.org/10.1027/1864-9335/a000178 https://doi.org/10.1027/1864-9335/a000178 https://doi.org/10.1038/s41562-018-0311-x https://doi.org/10.1177/2515245918770963 https://doi.org/10.1007/s10670-018-0024-6 https://doi.org/10.1007/978-94-015-7288-0 https://doi.org/10.1007/978-94-015-7288-0 https://doi.org/10.1037/pspi0000106 http://philsci-archive.pitt.edu/14352/1/Reproducibility_2018_SL.pdf http://philsci-archive.pitt.edu/14352/1/Reproducibility_2018_SL.pdf https://doi.org/10.1086/667877 Machery, E. (2019a). What is a replication? [Preprint] Machery, E. (2019b). The alpha war. Review of Philosophy and Psychology, XXX, 1–25. Machery, E., & Doris, J. M. (2017). An open letter to our students: Doing interdisciplinary moral psychology. In B. G. Voyer, & T. Tarantola (Eds.), Moral psychology: A multidisciplinary guide (pp. 119–143). USA: Springer. https://doi. org/10.1007/978-3-319-61849-4_7 Makel, M. C., Plucker, J. A., & Hegarty, B. (2012). Replications in psychology research: How often do they really occur? Per- spectives on Psychological Science, 7, 537–542. https://doi.org/10.1177/1745691612460688 Mayo, D. (2005). Peircean induction and the error-correcting thesis. Transactions of the Charles S. Peirce Society, 41, 299–319. Mayo, D. (2018). Statistical inference as severe testing: How to get beyond the science wars. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781107286184 McShane, B. B., Gal, D., Gelman, A., Robert, C., & Tackett, J. L. (2019). Abandon statistical significance. The American Statisti- cian, 73(sup1), 235–245. https://doi.org/10.1080/00031305.2018.1527253 Merton, R. K. (1957). Priorities in scientific discovery: A chapter in the sociology of science. American Sociological Review, 22, 635–659. https://doi.org/10.2307/2089193 Meyer, M. N., & Chabris, C. (2014, July 31). Why psychologists' food fight matters. Slate. Retrieved from http://www.slate. com/articles/health_and_science/science/2014/07/replication_controversy_in_psychology_bullying_file_drawer_effect_ blog_posts.html Netherlands Organisation for Scientific Research (2016, July 16). NWO makes 3 million available for Replication Studies pilot. Retrieved from https://www.nwo.nl/en/news-and-events/news/2016/nwo-makes-3-million-available-for- replication-studies-pilot.html Neuliep, J. W., & Crandall, R. (1990). Editorial bias against replication research. Journal of Social Behavior and Personality, 5(4), 85–90. Neuliep, J. W., & Crandall, R. (1993). Reviewer bias against replication research. Journal of Social Behavior and Personality, 8 (6), 21–29. Nosek, B. A., & Errington, T. M. (2017). Reproducibility in cancer biology: Making sense of replications. eLife, 6, e23383. Nuzzo, R. (2015). How scientists fool themselves—and how they can stop. Nature, 526, 182–185. https://doi.org/10.1038/ 526182a Open Science Collaboration (2012). An Open, Large-Scale, Collaborative Effort to Estimate the Reproducibility of Psycholog- ical Science. Perspectives on Psychological Science, 7(6), 657–660. https://doi.org/10.1177/1745691612462588 Open Science Collaboration (2015). Estimating the reproducibility of psychological science. Science, 349, aac4716. https:// doi.org/10.1126/science.aac4716 Pashler, H., Coburn, N., & Harris, C. R. (2012). Priming of social distance? Failure to replicate effects on social and food judg- ments. PLoS ONE, 7, e42510. https://doi.org/10.1371/journal.pone.0042510 Pashler, H., & Harris, C. R. (2012). Is the replicability crisis overblown? Three arguments examined. Perspectives on Psychologi- cal Science, 7, 531–536. https://doi.org/10.1177/1745691612463401 Pashler, H., Harris, C. R., & Coburn, N. (2011). Elderly-related words prime slow walking. PsychFileDrawer. Retrieved from. http://www.PsychFileDrawer.org/replication.php?attempt=MTU%3D Patil, P., Peng, R. D., & Leek, J. T. (2016). A statistical definition for reproducibility and replicability. bioRxiv, XXX, 066803. https://doi.org/10.1101/066803 Peirce, C. S. (1958). The logic of drawing history from ancient documents. In A. W. Burks (Ed.), The collected papers of Charles Sanders Peirce (Vol. IV) (pp. 89–107). Cambridge, MA: Belknap Press. (Original work published 1901) Peng, R. D. (2011). Reproducible research in computational science. Science, 334, 1226–1227. https://doi.org/10.1126/ science.1213847 Popper, K. R. (1959/2002). The Logic of scientific discovery. Classics Series. London: Routledge. Prinz, F., Schlange, T., & Asadullah, K. (2011). Believe it or not: How much can we rely on published data on potential drug targets? Nature Reviews Drug Discovery, 10, 712. https://doi.org/10.1038/nrd3439-c1 Radder, H. (1996). In and about the world: Philosophical studies of science and technology. Albany, NY: State University of New York Press. Reichenbach, H. (1938). Experience and prediction. Chicago, IL: University of Chicago Press. Romeijn, J. -W. (2014). Philosophy of statistics. In E. Zalta (ed.), The Stanford encyclopedia of philosophy. Retrieved from https://plato.stanford.edu/archives/sum2018/entries/statistics/. Romero, F. (2016). Can the behavioral sciences self-correct? A social epistemic study. Studies in History and Philosophy of Sci- ence Part A, 60, 55–69. https://doi.org/10.1016/j.shpsa.2016.10.002 Romero, F. (2017). Novelty vs. replicability: Virtues and vices in the reward system of science. Philosophy of Science, 84(5), 1031–1043. Romero, F. (2018). Who should do replication labor? Advances in Methods and Practices in Psychological Science, 1(4), 516–537. 12 of 14 ROMERO https://doi.org/10.1007/978-3-319-61849-4_7 https://doi.org/10.1007/978-3-319-61849-4_7 https://doi.org/10.1177/1745691612460688 https://doi.org/10.1017/9781107286184 https://doi.org/10.1080/00031305.2018.1527253 https://doi.org/10.2307/2089193 http://www.slate.com/articles/health_and_science/science/2014/07/replication_controversy_in_psychology_bullying_file_drawer_effect_blog_posts.html http://www.slate.com/articles/health_and_science/science/2014/07/replication_controversy_in_psychology_bullying_file_drawer_effect_blog_posts.html http://www.slate.com/articles/health_and_science/science/2014/07/replication_controversy_in_psychology_bullying_file_drawer_effect_blog_posts.html https://www.nwo.nl/en/news-and-events/news/2016/nwo-makes-3-million-available-for-replication-studies-pilot.html https://www.nwo.nl/en/news-and-events/news/2016/nwo-makes-3-million-available-for-replication-studies-pilot.html https://doi.org/10.1038/526182a https://doi.org/10.1038/526182a https://doi.org/10.1177/1745691612462588 https://doi.org/10.1126/science.aac4716 https://doi.org/10.1126/science.aac4716 https://doi.org/10.1371/journal.pone.0042510 https://doi.org/10.1177/1745691612463401 http://www.PsychFileDrawer.org/replication.php?attempt=MTU%3D https://doi.org/10.1101/066803 https://doi.org/10.1126/science.1213847 https://doi.org/10.1126/science.1213847 https://doi.org/10.1038/nrd3439-c1 https://plato.stanford.edu/archives/sum2018/entries/statistics/ https://doi.org/10.1016/j.shpsa.2016.10.002 Romero, F., & Sprenger, J. (2019). Scientific self-correction: The Bayesian way. Retrieved from https://psyarxiv.com/daw3q/ Rosenthal, R. (1979). The “file drawer problem” and tolerance for null results. Psychological Bulletin, 86, 638–641. https://doi. org/10.1037/0033-2909.86.3.638 Rouder, J. N., Speckman, P. L., Sun, D., Morey, R. D., & Iverson, G. (2009). Bayesian t tests for accepting and rejecting the null hypothesis. Psychonomic Bulletin & Review, 16, 225–237. https://doi.org/10.3758/PBR.16.2.225 Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17(4), 551–566. https://doi.org/10.1037/a0029487 Schmidt, S. (2009). Shall we really do it again? The powerful concept of replication is neglected in the social sciences. Review of General Psychology, 13, 90–100. https://doi.org/10.1037/a0015108 Schnall, S. (2014). Further thoughts on replications, ceiling effects and bullying. Retrieved from http://www.psychol.cam.ac. uk/cece/blog Schönbrodt, F., Heene, M., Maier, M., & Zehetleitner, M. (2015). The replication-/credibility-crisis in psychology: Conse- quences at LMU? Retrieved from https://osf.io/nptd9/ Shanks, D. R., Newell, B. R., Lee, E. H., Balakrishnan, D., Ekelund, L., Cenac, Z., … Moore, C. (2013). Priming intelligent behav- ior: An elusive phenomenon. PLoS ONE, 8, e56515. https://doi.org/10.1371/journal.pone.0056515 Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359–1366. https://doi.org/10.1177/ 0956797611417632 Simons, D. J., Holcombe, A. O., & Spellman, B. A. (2014). An introduction to Registered Replication Reports at Perspectives on Psychological Science. Perspectives on Psychological Science, 9, 552–555. https://doi.org/10.1177/1745691614543974 Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014). P-curve: A key to the file-drawer. Journal of Experimental Psychology: General, 143, 534–547. https://doi.org/10.1037/a0033242 Smith, N. C. (1970). Replication studies: A neglected aspect of psychological research. American Psychologist, 25, 970–975. https://doi.org/10.1037/h0029774 Spellman, B. A. (2015). A short (personal) future history of Revolution 2.0. Perspectives on Psychological Science, 10, 886–899. https://doi.org/10.1177/1745691615609918 Sprenger, J. (2016). Bayesianism vs. frequentism in statistical inference. In The Oxford handbook of probability and philosophy (pp. 185–209). UK: Oxford University Press. Stegenga, J. (2009). Robustness, discordance, and relevance. Philosophy of Science, 75(5), 650–661. Strobe, W., & Strack, F. (2014). The alleged crisis and the illusion of exact replication. Perspectives on Psychological Science, 9, 59–71. Tackett, J. L., Lilienfeld, S. O., Patrick, C. J., Johnson, S. L., Krueger, R. F., Miller, J. D., … Shrout, P. E. (2017). It's time to broaden the replicability conversation: Thoughts for and from clinical psychological science. Perspectives on Psychological Science, 12, 742–756. https://doi.org/10.1177/1745691617690042 Trafimow, D., & Marks, M. (2015). Editorial. Basic and Applied Social Psychology, 37, 1–2. https://doi.org/10.1080/ 01973533.2015.1012991 van Aert, R. C. M., Wicherts, J. M., & van Assen, M. A. L. M. (2016). Conducting meta-analyses based on p values: Reserva- tions and recommendations for applying p-uniform and p-curve. Perspectives on Psychological Science, 11, 713–729. https://doi.org/10.1177/1745691616650874 Vazire, S. (2015). Editorial. Social Psychological and Personality Science, 7, 3–7. https://doi.org/10.1177/1948550615603955 Wasserstein, R. L., & Lazar, N. A. (2016). The ASA statement on p-values: Context, process, and purpose. The American Statis- tician, 70(2), 129–133. https://doi.org/10.1080/00031305.2016.1154108 Wilholt, T. (2013). Epistemic trust in science. British Journal for the Philosophy of Science, 24(2), 233–253. Wimsatt, W. C. (1981). Robustness, reliability, and overdetermination. In M. Brewer, & B. Collins (Eds.), Scientific inquiry and the social sciences (pp. 124–163). San Francisco: Jossey-Bass. Yong, E. (2012, March 10). A failed replication attempt draws a scathing personal attack from a psychology professor [Web log post]. Discover Magazine Blog. Retrieved from. http://blogs.discovermagazine.com/notrocketscience/2012/03/10/ failed-replication-bargh-psychology-study-doyen/ Ziliak, S., & McCloskey, D. (2008). The cult of statistical significance: How the standard error costs us jobs, justice, and lives. Ann Arbor, MI: University of Michigan Press. AUTHOR BIOGRAPHY Felipe Romero is currently an assistant professor in the Department of Theoretical Philosophy at the University of Groningen. He obtained his PhD in the Philosophy-Neuroscience-Psychology program, Department of Philos- ophy, at Washington University in St. Louis (2016) and received a BSc (2006) and MSc (2008) in Computer ROMERO 13 of 14 https://psyarxiv.com/daw3q/ https://doi.org/10.1037/0033-2909.86.3.638 https://doi.org/10.1037/0033-2909.86.3.638 https://doi.org/10.3758/PBR.16.2.225 https://doi.org/10.1037/a0029487 https://doi.org/10.1037/a0015108 http://www.psychol.cam.ac.uk/cece/blog http://www.psychol.cam.ac.uk/cece/blog https://osf.io/nptd9/ https://doi.org/10.1371/journal.pone.0056515 https://doi.org/10.1177/0956797611417632 https://doi.org/10.1177/0956797611417632 https://doi.org/10.1177/1745691614543974 https://doi.org/10.1037/a0033242 https://doi.org/10.1037/h0029774 https://doi.org/10.1177/1745691615609918 https://doi.org/10.1177/1745691617690042 https://doi.org/10.1080/01973533.2015.1012991 https://doi.org/10.1080/01973533.2015.1012991 https://doi.org/10.1177/1745691616650874 https://doi.org/10.1177/1948550615603955 https://doi.org/10.1080/00031305.2016.1154108 http://blogs.discovermagazine.com/notrocketscience/2012/03/10/failed-replication-bargh-psychology-study-doyen/ http://blogs.discovermagazine.com/notrocketscience/2012/03/10/failed-replication-bargh-psychology-study-doyen/ Science and a BA in Philosophy (2007) from the University of Los Andes. His research falls in the domains of phi- losophy of science, social epistemology, and philosophy of cognitive science. He works primarily on understand- ing how the social organization of research activities affects their epistemic outcomes. His recent publications inform debates about replicability in the social and behavioral sciences. How to cite this article: Romero F. Philosophy of science and the replicability crisis. Philosophy Compass. 2019;14:e12633. https://doi.org/10.1111/phc3.12633 14 of 14 ROMERO https://doi.org/10.1111/phc3.12633 Philosophy of science and the replicability crisis 1 INTRODUCTION 2 WHAT IS THE REPLICABILITY CRISIS? HISTORY AND EVIDENCE 3 CAUSES OF THE REPLICABILITY CRISIS 4 PHILOSOPHICAL ISSUES RAISED BY THE REPLICABILITY CRISIS 5 WHAT TO DO? 5.1 Statistical reforms 5.2 Methodological reforms 5.3 Social reforms 6 CONCLUSION ACKNOWLEDGMENTS REFERENCES << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles false /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Error /CompatibilityLevel 1.3 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends false /DetectCurves 0.1000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams true /MaxSubsetPct 100 /Optimize false /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage false /PreserveDICMYKValues true /PreserveEPSInfo false /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Remove /UsePrologue false /ColorSettingsFile () /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Bicubic /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Bicubic /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Bicubic /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /PDFX1a:2001 ] /PDFX1aCheck true /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError false /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (Euroscale Coated v2) /PDFXOutputConditionIdentifier (FOGRA1) /PDFXOutputCondition () /PDFXRegistryName (http://www.color.org) /PDFXTrapped /False /CreateJDFFile false /Description << /CHS /CHT /DAN /DEU /ESP /FRA /ITA (Utilizzare queste impostazioni per creare documenti Adobe PDF che devono essere conformi o verificati in base a PDF/X-1a:2001, uno standard ISO per lo scambio di contenuto grafico. Per ulteriori informazioni sulla creazione di documenti PDF compatibili con PDF/X-1a, consultare la Guida dell'utente di Acrobat. I documenti PDF creati possono essere aperti con Acrobat e Adobe Reader 4.0 e versioni successive.) /JPN /KOR /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken die moeten worden gecontroleerd of moeten voldoen aan PDF/X-1a:2001, een ISO-standaard voor het uitwisselen van grafische gegevens. Raadpleeg de gebruikershandleiding van Acrobat voor meer informatie over het maken van PDF-documenten die compatibel zijn met PDF/X-1a. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 4.0 en hoger.) /NOR /PTB /SUO /SVE /ENG (Modified PDFX1a settings for Blackwell publications) /ENU (Use these settings to create Adobe PDF documents that are to be checked or must conform to PDF/X-1a:2001, an ISO standard for graphic content exchange. For more information on creating PDF/X-1a compliant PDF documents, please refer to the Acrobat User Guide. Created PDF documents can be opened with Acrobat and Adobe Reader 4.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /ConvertToCMYK /DestinationProfileName () /DestinationProfileSelector /DocumentCMYK /Downsample16BitImages true /FlattenerPreset << /PresetSelector /HighResolution >> /FormElements false /GenerateStructure false /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles false /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /DocumentCMYK /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /UseDocumentProfile /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice