Deborah Mayo and Richard Morey recently posted a very interesting criticism of the diagnostic screening model of statistical testing (e.g., this, this). Mayo & Morey argue that this approach to criticism of null hypothesis significance testing (NHST) relies on a misguided hybrid of frequentist and Bayesian reasoning. The paper is worth reading in its entirety, but in this post I will focus narrowly on a non-central point that they make.

At the beginning of the section *Some Well-Known Fallacies* (p. 6), Mayo & Morey write:

From the start, Fisher warned that to use p-values to legitimately indicate incompatibility (between data and a model), we need more than a single isolated low p-value: we must demonstrate an experimental phenomenon.

They then quote Fisher:

[W]e need, not an isolated record, but a reliable method of procedure. In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result. (Fisher 1947, p. 14)

It’s worth noting that in the Fisherian approach to NHST, p values provide a continuous measure of discrepancy from a null model (see also, e.g., the beginning of the second section in this commentary by Andrew Gelman). If a continuous measure of discrepancy is dichotomized (e.g., into the categories "statistically significant" and "not statistically significant"), the criterion for dichotomization is typically arbitrary; common p value thresholds like 0.05 and 0.01 are not generally given any kind of rationale, though there is a case for explicit justification of such thresholds.

(As a brief aside, in the Neyman-Pearson approach to NHST, dichotomization of test results is baked in from the start, and p values are not treated as continuous measures of discrepancy between model and data. A priori specification of *α* sets a hard, though still typically arbitrary, threshold. Here is a tutorial with detailed discussion of the differences between the Fisher and Neyman-Pearson approaches to NHST.)

The point here being that the Fisher quote above could (and probably should) be revised to say that we can describe a phenomenon as experimentally demonstrable when we know how to conduct an experiment that will rarely fail to produce a large discrepancy.

Clearly, in determining what counts as "large", we will run into some of the same problems that we run into in determining cutoffs for statistical significance. But focusing on discrepancies in the space in which our measurements are taken will force us to focus on "clinical significance" rather than "statistical significance". This will make it much easier to argue for or against any particular criterion (assuming you are okay with criteria for "large" vs "not large") or to take costs and benefits directly into account and use statistical decision theory (if you are not okay with such criteria).

To be as clear as I can be, I’m not in favor of "banning" p values (or the statistics that would allow a reader to calculate an otherwise unreported p value, e.g., means and standard errors). If you are concerned with error control, the information provided by p values is important. But the interested reader can decide for him- or herself how to balance false alarms and misses. There is no need for the researcher reporting a study to declare a result "statistically significant" or not.

I wholeheartedly agree with Lakens et al when they write:

[W]e recommend that the label "statistically significant" should no longer be used. Instead, researchers should provide more meaningful interpretations of the theoretical or practical relevance of their results.

In reading and thinking (and writing, e.g., see here, here, here, and here) about statistical significance lately, I feel like some important implications of frequentist statistics have really clicked for me.

A while back, it occurred to me that, at least under certain interpretations of confidence intervals (CIs), it doesn’t make much sense to actually report CI limits. The coverage of CIs is a property of the *procedure* of constructing CIs, but any particular set of observed CI limits do not tell you anything either probabilistic or useful about the location of a "true" parameter value (scare quotes because it’s not at all obvious to me that the notion of a true parameter value is of much use).

(Under a test-inversion interpretation of CIs, reporting a particular set of CI limits can be useful, since this indicates the range of parameter values that you cannot reject, given a particular test and statistical significance criterion. But, then again, the test-inversion interpretation is not without its own serious problems.)

Anyway, I bring all this up here at the end just to point out an important parallel between CIs and statistical significance. By the logic of frequentist probability – probabilities just *are* long-run relative frequencies of subsets of events – both CIs and statistical significance are only meaningful (and are only meaningfully probabilistic) across repeated replications. Given this, I am more and more convinced that individual research reports should focus on estimation (and quantification of uncertainty) rather than testing (and uncertainty laundering).

Strangely, and quite possibly incoherently, I think I may have convinced myself that frequentist statistical testing is inherently meta-analytic.