Truth, utility, and null hypotheses

Discussing criticisms of null hypothesis significance testing (NHST) McShane & Gal write (emphasis mine):

More broadly, statisticians have long been critical of the various forms of dichotomization intrinsic to the NHST paradigm such as the dichotomy of the null hypothesis versus the alternative hypothesis and the dichotomization of results into the different categories statistically significant and not statistically significant. … More specifically, the sharp point null hypothesis of \theta = 0 used in the overwhelming majority of applications has long been criticized as always false — if not in theory at least in practice (Berkson 1938; Edwards, Lindman, and Savage 1963; Bakan 1966; Tukey 1991; Cohen 1994; Briggs 2016); in particular, even were an effect truly zero, experimental realities dictate that the effect would generally not be exactly zero in any study designed to test it.

I really like this paper (and a similar one from a couple years ago), but this kind of reasoning has become one of my biggest statistical pet peeves. The fact that “experimental realities” generally produce non-zero effects in statistics calculated from samples is one of the primary motivations for the development of NHST in the first place. It is why, for example, the null hypothesis is typically – read always – expressed as a distribution of possible test statistics under the assumption of zero effect. The whole point is to evaluate how consistent an observed statistic is with a zero-effect probability model.

Okay, actually, that’s not true. The point of statistical testing is to evaluate how consistent an observed statistic is with a probability model of interest. And this gets at the more important matter. I agree with McShane & Gal (and, I imagine, at least some of the people they cite) that the standard zero-effect null is probably not true in many cases, particularly in social and behavioral science.

The problem is not that this model is often false. A false model can be useful. (Insert Box’s famous quote about this here.) The problem is that the standard zero-effect model is very often not interesting or useful.

Assuming zero effect makes it (relatively) easy to derive a large number of probability distributions for various test statistics. Also, there are typically an infinite number of alternative, non-zero hypotheses. So, fine, zero-effect null hypotheses provide non-trivial convenience and generality.

But this doesn’t make them scientifically interesting. And if they’re not scientifically interesting, it’s not clear that they’re scientifically useful.

In principle, we could use quantitative models of social, behavioral, and/or health-related phenomena as interesting and useful (though almost certainly still false) “null” models against which to test data or, per my preferences, to estimate quantitative parameters of interest. Of course, it’s (very) hard work to develop such models, and many academic incentives push pretty hard against the kind of slow, thorough work that would be required in order to do so.

Puzzle Pieces

[latexpage]

I don’t remember when or where I first encountered the distinction between (statistical) estimation and testing. I do remember being unsure exactly how to think about it. I also remember feeling like it was probably important.

Of course, in order to carry out many statistical tests, you have to estimate various quantities. For example, in order to do a t-test, you have to estimate group means and some version of pooled variation. Estimation is necessary for this kind of testing.

Also, at least some estimated quantities provide all of the information that statistical tests provide, plus some additional information. For example, a confidence interval around an estimated mean tells you everything that a single-sample t-test would tell you about that mean, plus a little more. For a given $\alpha$ value and sample size, both would tell you if the estimated mean is statistically significantly different from a reference mean $\mu_0$, but the CI gives you more, defining one set of all of the means you would fail to reject (i.e., the confidence interval) and another (possibly disjoint) set of the means you would reject (i.e., everything else).

The point being that estimation and testing are not mutually exclusive.

I have recently read a few papers that have helped me clarify my thinking about this distinction. One is Francis (2012) The Psychology of Replication and Replication in Psychology. Another is Stanley & Spence (2014) Expectations for Replications: Are Yours Realistic? A third is McShane & Gal (2016) Blinding Us to the Obvious? The Effect of Statistical Training on the Evaluation of Evidence.

One of the key insights from Francis’ paper is nicely illustrated in his Figure 1, which shows how smaller effect size estimates are inflated by the file drawer problem (non-publication of results that are not statistically significant) and data peeking (stopping data collection if a test is statistically significant, otherwise collecting more data, rinse, repeat):

One of the key insights from Stanley & Spence is nicely illustrated in their Figure 4, which shows histograms of possible replication correlations for four different true correlations and a published correlation of 0.30, based on simulations of the effects of sampling and measurement error:

Finally, a key insight from McShane & Gal is nicely illustrated in their Table 1, which shows how the presence or absence of statistical significance can influence the interpretation of simple numbers:

This one requires a bit more explanation, so: Participants were given a summary of a (fake) cancer study in which people in Group A lived, on average, 8.2 months post-diagnosis and people in Group B lived, on average, 7.5 months. Along with this, a p value was reported, either 0.01 or 0.27. Participants were asked to determine (using much less obvious descriptions) if 8.2 > 7.5 (Option A), if 7.5 > 8.2 (Option B), if 8.2 = 7.5 (Option C), or if the relative magnitudes of 8.2 and 7.5 were impossible to determine (Option D). As you can see in the table above, the reported p value had a very large effect on which option people chose.

Okay, so how do these fit together? And how do they relate to estimation vs testing?

Francis makes a compelling case against using statistical significance as a publication filter. McShane & Gal make a compelling case that dichotomization of evidence via statistical significance leads people to make absurd judgments about summary statistics. Stanley & Spence make a compelling case that it is very easy to fail to appreciate the importance of variation.

Stanley & Spence’s larger argument is that researchers should adopt a meta-analytic mind-set, in which studies, and statistics within studies, are viewed as data points in possible meta-analyses. Individual studies provide limited information, and they are never the last word on a topic. In order to ease synthesis of information across studies, statistics should be reported in detail. Focusing too much on statistical significance produces overly optimistic sets of results (e.g., Francis’ inflated effect sizes), absurd blindspots (e.g., McShane & Gal’s survey results), and gives the analyst more certainty than is warranted (as Andrew Gelman has written about many times, e.g., here).

Testing is concerned with statistical significance (or Bayesian analogs). The meta-analytic mind-set is concerned with estimation. And since estimation often subsumes testing (e.g., CIs and null hypothesis tests), a research consumer that really feels the need for a dichotomous conclusion can often glean just that from non-dichotomous reports.