in Statistics

Unworkable and empty

Benjamin et (many) al recently proposed that the p-value for declaring a (new) result “statistically significant” should be divided by 10, reducing it from 0.05 to 0.005. Lakens et (many) al responded by arguing that “that researchers should transparently report and justify all choices they make when designing a study, including the alpha level.” Amrhein and Greenland, on the one hand, and McShane et al, on the other, responded with suggestions that we simply abandon statistical significance entirely (McShane et al pdf; blog post). Trafimow et (many) al also argue against Benjamin et al’s proposal and the scientific utility of p-values in general. A statistician named Crane recently wrote a narrower, more technical criticism of Benjamin et al, arguing that p-hacking (broadly construed) calls the substantive claims of Redefine Statistical Significance (RSS) into question.

The fifth author of RSS (Wagenmakers) and one of his PhD students (Gronau) recently posted an exceptionally disingenuous response to Crane’s paper. It’s exceptionally disingenuous for two reasons. First, Wagenmakers and Gronau just ignore Crane’s argument, defending a component of RSS that Crane isn’t arguing against. Second, the proposal in RSS – to reduce the set of p-values associated with (new) effects that deserve the label “statistically significant” from 0.05 to 0.005 – is explicitly non-Bayesian, even if it relies on some Bayesian reasoning, but most of Wagenmakers and Gronau’s post consists of a fanciful metaphor in which Crane is directly attacking Bayesian statistics. The non-Bayesian nature of RSS is made clear in the fourth paragraph, which begins with “We also restrict our recommendation to studies that conduct null hypothesis significance tests.” To Wagenmakers and Gronau’s credit, they published Crane’s response at the end of the post.

So, why am I chiming in now? To point out that the original RSS proposal is unworkable as stated and, ultimately, essentially free of substantive content. I think Crane makes a pretty compelling case that, even working within the general framework that RSS seems to assume, the proposal won’t do what Benjamin et al claim it will do (e.g., reduce false positive rates and increase reproducibility by factors of two or more). But I don’t even think you need to dig into the technicalities the way Crane does to argue against RSS.

To be clear, I think Benjamin et al are correct to point out that a p-value of just less than 0.05 is “evidentially weak” (as Wagenmakers and Gronau describe it in the Bayesian Spectacles post). Be that as it may, the allegedly substantive proposal to redefine statistical significance is all but meaningless.

Benjamin et al “restrict our recommendation to claims of discovery of new effects,” but they do not even begin to address what would or wouldn’t count as a new effect. Everyone agrees that exact replications are impossible. Even the most faithful replication of a psychology experiment will have, at the very least, a new sample of subjects. And, of course, even if you could get the original subjects to participate again, the experience of having participated once (along with everything else that has happened to them since participating) will have changed them, if only minimally. As it happens, psychology replications tend to differ in all sorts of other ways, too, often being carried out in different locations, with newly developed materials and changes to experimental protocols.

As you change bits and pieces of experiments, eventually you shift from doing a direct replication to doing a conceptual replication (for more on this distinction, see here, among many other places). It seems pretty clear to me that there’s no bright line distinction between direct and conceptual replications.

How does this bear on the RSS proposal? I think you could make a pretty compelling case that conceptual replications should count as “new” effects. I strongly suspect that Benjamin et al would disagree, but I don’t know for sure, because, again, they haven’t laid out any criteria for what should count as new. Without doing so, the proposal cannot be implemented.

But it’s not clear to me that it’s worth doing this (undoubtedly difficult) work. Here’s a sentence from the second paragraph and a longer chunk from the second-to-last paragraph in RSS (emphasis mine):

Results that would currently be called significant but do not meet the new threshold should instead be called suggestive….

For research communities that continue to rely on null hypothesis significance testing, reducing the P value threshold for claims of new discoveries to 0.005 is an actionable step that will immediately improve reproducibility. We emphasize that this proposal is about standards of evidence, not standards for policy action nor standards for publication. Results that do not reach the threshold for statistical significance (whatever it is) can still be important and merit publication in leading journals if they address important research questions with rigorous methods. This proposal should not be used to reject publications of novel findings with 0.005 < P < 0.05 properly labelled as suggestive evidence.

The proposal is explicitly not about policy action or publication standards. It is all and only about labels applied to statistical test results.

Researchers are now, and have always been, well within their rights and abilities not to consider p \approx 0.05 strong evidence of an effect. Anyone interested in (directly or conceptually) replicating an interesting original finding is free to do so only if the original finding meets their preferred standard of evidence, however stringent. (I note in passing that hardcore frequentists are exceedingly unlikely to be moved by their Bayesian argument for the evidential weakness of p \approx 0.05.)

To the extent that Benjamin et al are arguing that using more stringent standards of evidence is also researchers’ responsibility, then I agree. But Benjamin et al are manifestly not just arguing that p \approx 0.05 is weak evidence. They are arguing that reserving the label “statistically significant” for p \leq 0.005 (and the label “suggestive” for 0.05 \geq p \geq 0.005) will improve reproducibility and reduce false alarms.

The substance of the proposal, such as it is, is concerned entirely with changing how we use semi-esoteric statistical jargon to label different sets of test statistics.

I agree with Benjamin et al that important research questions addressed with rigorous methods can merit publication. In fact, I would go even further and argue that publication decisions should be based entirely on how interesting the research questions are and how rigorous the methods used to answer the questions are. This is at the heart of the meta-analytic mind-set I discussed in an earlier post.