Unworkable and empty

Benjamin et (many) al recently proposed that the p-value for declaring a (new) result “statistically significant” should be divided by 10, reducing it from 0.05 to 0.005. Lakens et (many) al responded by arguing that “that researchers should transparently report and justify all choices they make when designing a study, including the alpha level.” Amrhein and Greenland, on the one hand, and McShane et al, on the other, responded with suggestions that we simply abandon statistical significance entirely (McShane et al pdf; blog post). Trafimow et (many) al also argue against Benjamin et al’s proposal and the scientific utility of p-values in general. A statistician named Crane recently wrote a narrower, more technical criticism of Benjamin et al, arguing that p-hacking (broadly construed) calls the substantive claims of Redefine Statistical Significance (RSS) into question.

The fifth author of RSS (Wagenmakers) and one of his PhD students (Gronau) recently posted an exceptionally disingenuous response to Crane’s paper. It’s exceptionally disingenuous for two reasons. First, Wagenmakers and Gronau just ignore Crane’s argument, defending a component of RSS that Crane isn’t arguing against. Second, the proposal in RSS – to reduce the set of p-values associated with (new) effects that deserve the label “statistically significant” from 0.05 to 0.005 – is explicitly non-Bayesian, even if it relies on some Bayesian reasoning, but most of Wagenmakers and Gronau’s post consists of a fanciful metaphor in which Crane is directly attacking Bayesian statistics. The non-Bayesian nature of RSS is made clear in the fourth paragraph, which begins with “We also restrict our recommendation to studies that conduct null hypothesis significance tests.” To Wagenmakers and Gronau’s credit, they published Crane’s response at the end of the post.

So, why am I chiming in now? To point out that the original RSS proposal is unworkable as stated and, ultimately, essentially free of substantive content. I think Crane makes a pretty compelling case that, even working within the general framework that RSS seems to assume, the proposal won’t do what Benjamin et al claim it will do (e.g., reduce false positive rates and increase reproducibility by factors of two or more). But I don’t even think you need to dig into the technicalities the way Crane does to argue against RSS.

To be clear, I think Benjamin et al are correct to point out that a p-value of just less than 0.05 is “evidentially weak” (as Wagenmakers and Gronau describe it in the Bayesian Spectacles post). Be that as it may, the allegedly substantive proposal to redefine statistical significance is all but meaningless.

Benjamin et al “restrict our recommendation to claims of discovery of new effects,” but they do not even begin to address what would or wouldn’t count as a new effect. Everyone agrees that exact replications are impossible. Even the most faithful replication of a psychology experiment will have, at the very least, a new sample of subjects. And, of course, even if you could get the original subjects to participate again, the experience of having participated once (along with everything else that has happened to them since participating) will have changed them, if only minimally. As it happens, psychology replications tend to differ in all sorts of other ways, too, often being carried out in different locations, with newly developed materials and changes to experimental protocols.

As you change bits and pieces of experiments, eventually you shift from doing a direct replication to doing a conceptual replication (for more on this distinction, see here, among many other places). It seems pretty clear to me that there’s no bright line distinction between direct and conceptual replications.

How does this bear on the RSS proposal? I think you could make a pretty compelling case that conceptual replications should count as “new” effects. I strongly suspect that Benjamin et al would disagree, but I don’t know for sure, because, again, they haven’t laid out any criteria for what should count as new. Without doing so, the proposal cannot be implemented.

But it’s not clear to me that it’s worth doing this (undoubtedly difficult) work. Here’s a sentence from the second paragraph and a longer chunk from the second-to-last paragraph in RSS (emphasis mine):

Results that would currently be called significant but do not meet the new threshold should instead be called suggestive….

For research communities that continue to rely on null hypothesis significance testing, reducing the P value threshold for claims of new discoveries to 0.005 is an actionable step that will immediately improve reproducibility. We emphasize that this proposal is about standards of evidence, not standards for policy action nor standards for publication. Results that do not reach the threshold for statistical significance (whatever it is) can still be important and merit publication in leading journals if they address important research questions with rigorous methods. This proposal should not be used to reject publications of novel findings with 0.005 < P < 0.05 properly labelled as suggestive evidence.

The proposal is explicitly not about policy action or publication standards. It is all and only about labels applied to statistical test results.

Researchers are now, and have always been, well within their rights and abilities not to consider p \approx 0.05 strong evidence of an effect. Anyone interested in (directly or conceptually) replicating an interesting original finding is free to do so only if the original finding meets their preferred standard of evidence, however stringent. (I note in passing that hardcore frequentists are exceedingly unlikely to be moved by their Bayesian argument for the evidential weakness of p \approx 0.05.)

To the extent that Benjamin et al are arguing that using more stringent standards of evidence is also researchers’ responsibility, then I agree. But Benjamin et al are manifestly not just arguing that p \approx 0.05 is weak evidence. They are arguing that reserving the label “statistically significant” for p \leq 0.005 (and the label “suggestive” for 0.05 \geq p \geq 0.005) will improve reproducibility and reduce false alarms.

The substance of the proposal, such as it is, is concerned entirely with changing how we use semi-esoteric statistical jargon to label different sets of test statistics.

I agree with Benjamin et al that important research questions addressed with rigorous methods can merit publication. In fact, I would go even further and argue that publication decisions should be based entirely on how interesting the research questions are and how rigorous the methods used to answer the questions are. This is at the heart of the meta-analytic mind-set I discussed in an earlier post.

Truth, utility, and null hypotheses

Discussing criticisms of null hypothesis significance testing (NHST) McShane & Gal write (emphasis mine):

More broadly, statisticians have long been critical of the various forms of dichotomization intrinsic to the NHST paradigm such as the dichotomy of the null hypothesis versus the alternative hypothesis and the dichotomization of results into the different categories statistically significant and not statistically significant. … More specifically, the sharp point null hypothesis of \theta = 0 used in the overwhelming majority of applications has long been criticized as always false — if not in theory at least in practice (Berkson 1938; Edwards, Lindman, and Savage 1963; Bakan 1966; Tukey 1991; Cohen 1994; Briggs 2016); in particular, even were an effect truly zero, experimental realities dictate that the effect would generally not be exactly zero in any study designed to test it.

I really like this paper (and a similar one from a couple years ago), but this kind of reasoning has become one of my biggest statistical pet peeves. The fact that “experimental realities” generally produce non-zero effects in statistics calculated from samples is one of the primary motivations for the development of NHST in the first place. It is why, for example, the null hypothesis is typically – read always – expressed as a distribution of possible test statistics under the assumption of zero effect. The whole point is to evaluate how consistent an observed statistic is with a zero-effect probability model.

Okay, actually, that’s not true. The point of statistical testing is to evaluate how consistent an observed statistic is with a probability model of interest. And this gets at the more important matter. I agree with McShane & Gal (and, I imagine, at least some of the people they cite) that the standard zero-effect null is probably not true in many cases, particularly in social and behavioral science.

The problem is not that this model is often false. A false model can be useful. (Insert Box’s famous quote about this here.) The problem is that the standard zero-effect model is very often not interesting or useful.

Assuming zero effect makes it (relatively) easy to derive a large number of probability distributions for various test statistics. Also, there are typically an infinite number of alternative, non-zero hypotheses. So, fine, zero-effect null hypotheses provide non-trivial convenience and generality.

But this doesn’t make them scientifically interesting. And if they’re not scientifically interesting, it’s not clear that they’re scientifically useful.

In principle, we could use quantitative models of social, behavioral, and/or health-related phenomena as interesting and useful (though almost certainly still false) “null” models against which to test data or, per my preferences, to estimate quantitative parameters of interest. Of course, it’s (very) hard work to develop such models, and many academic incentives push pretty hard against the kind of slow, thorough work that would be required in order to do so.

Puzzle Pieces

I don’t remember when or where I first encountered the distinction between (statistical) estimation and testing. I do remember being unsure exactly how to think about it. I also remember feeling like it was probably important.

Of course, in order to carry out many statistical tests, you have to estimate various quantities. For example, in order to do a t-test, you have to estimate group means and some version of pooled variation. Estimation is necessary for this kind of testing.

Also, at least some estimated quantities provide all of the information that statistical tests provide, plus some additional information. For example, a confidence interval around an estimated mean tells you everything that a single-sample t-test would tell you about that mean, plus a little more. For a given \alpha value and sample size, both would tell you if the estimated mean is statistically significantly different from a reference mean \mu_0, but the CI gives you more, defining one set of all of the means you would fail to reject (i.e., the confidence interval) and another (possibly disjoint) set of the means you would reject (i.e., everything else).

The point being that estimation and testing are not mutually exclusive.

I have recently read a few papers that have helped me clarify my thinking about this distinction. One is Francis (2012) The Psychology of Replication and Replication in Psychology. Another is Stanley & Spence (2014) Expectations for Replications: Are Yours Realistic? A third is McShane & Gal (2016) Blinding Us to the Obvious? The Effect of Statistical Training on the Evaluation of Evidence.

One of the key insights from Francis’ paper is nicely illustrated in his Figure 1, which shows how smaller effect size estimates are inflated by the file drawer problem (non-publication of results that are not statistically significant) and data peeking (stopping data collection if a test is statistically significant, otherwise collecting more data, rinse, repeat):

One of the key insights from Stanley & Spence is nicely illustrated in their Figure 4, which shows histograms of possible replication correlations for four different true correlations and a published correlation of 0.30, based on simulations of the effects of sampling and measurement error:

Finally, a key insight from McShane & Gal is nicely illustrated in their Table 1, which shows how the presence or absence of statistical significance can influence the interpretation of simple numbers:

This one requires a bit more explanation, so: Participants were given a summary of a (fake) cancer study in which people in Group A lived, on average, 8.2 months post-diagnosis and people in Group B lived, on average, 7.5 months. Along with this, a p value was reported, either 0.01 or 0.27. Participants were asked to determine (using much less obvious descriptions) if 8.2 > 7.5 (Option A), if 7.5 > 8.2 (Option B), if 8.2 = 7.5 (Option C), or if the relative magnitudes of 8.2 and 7.5 were impossible to determine (Option D). As you can see in the table above, the reported p value had a very large effect on which option people chose.

Okay, so how do these fit together? And how do they relate to estimation vs testing?

Francis makes a compelling case against using statistical significance as a publication filter. McShane & Gal make a compelling case that dichotomization of evidence via statistical significance leads people to make absurd judgments about summary statistics. Stanley & Spence make a compelling case that it is very easy to fail to appreciate the importance of variation.

Stanley & Spence’s larger argument is that researchers should adopt a meta-analytic mind-set, in which studies, and statistics within studies, are viewed as data points in possible meta-analyses. Individual studies provide limited information, and they are never the last word on a topic. In order to ease synthesis of information across studies, statistics should be reported in detail. Focusing too much on statistical significance produces overly optimistic sets of results (e.g., Francis’ inflated effect sizes), absurd blindspots (e.g., McShane & Gal’s survey results), and gives the analyst more certainty than is warranted (as Andrew Gelman has written about many times, e.g., here).

Testing is concerned with statistical significance (or Bayesian analogs). The meta-analytic mind-set is concerned with estimation. And since estimation often subsumes testing (e.g., CIs and null hypothesis tests), a research consumer that really feels the need for a dichotomous conclusion can often glean just that from non-dichotomous reports.

Anti-trolley libertarianism?

This argument against the trolley problem (via) is amusing and makes some interesting points. I don’t totally buy the case that the trolley problem is useless (much less actively harmful), since I think that there are probably some important moral issues related to the action vs inaction distinction that the problem brings up, and that these are probably important for some of the society-level policy-related issues that the authors would prefer we all focus on.

The most amusing bit is a link to a comic making fun of the absurd variations on the trolley problem. Here’s one of the more interesting parts:

By thinking seriously about the trolley problem, i.e. considering what the scenario being described actually involves, we can see why it’s so limited as a moral thought experiment. It’s not just that, as the additional conditions grow, there are not any obvious right answers. It’s that every single answer is horrific, and wild examples like this take us so far afield from ordinary moral choices that they’re close to nonsensical.

I’m not completely convinced by the stretch that follows this, in which the authors argues that pondering these kinds of questions makes us more callous. Maybe it does, maybe it doesn’t. But I do think it’s worth pointing out that many, perhaps most, moral questions involve, on the one hand, uncertainty, and, on the other, both costs and benefits. Uncertainty is relevant because of risk aversion. Costs and benefits are relevant because of asymmetries in approach-avoidance conflicts (e.g., loss aversion). Moral questions involving choices of certain awfulness are inherently limited.

There are some other interesting bits, but the thing that really stood out to me was this, which reads like an argument for pretty hard core libertarianism (bolded emphasis mine):

The “who should have power over lives” question is often completely left out of philosophy lessons, which simply grant you the ability to take others’ lives and then instruct you to weigh them in accordance with your instincts as to who should live or die…. But what about situations where people are making high-level life-or-death decisions from a distance, and thus have the leisure to weigh the value of certain lives against the value of certain other lives? Perhaps the closest real-life parallels to the trolley problem are war-rooms, and areas of policy-making where “cost-benefit” calculuses are performed on lives. But in those situations, what we should often really be asking is “why does that person have that amount of power over others, and should they ever?” (answer: almost certainly not), rather than “given that X is in charge of all human life, whom should X choose to spare?” One of the writers of this article vividly recalls a creepy thought experiment they had to do at a law school orientation, based on the hypothetical that a fatal epidemic was ravaging the human population. The students in the room were required to choose three fictional people out of a possible ten to receive a newly-developed vaccine…. The groups were given biographies of the ten patients: some of them had unusual talents, some of them had dependents, some of them were children, and so on. Unsurprisingly, the exercise immediately descended into eugenics territory, as the participants, feeling that they had to make some kind of argument, began debating the worthiness of each patient and weighing their respective social utilities against each other. (It only occurred to one of the groups to simply draw lots, which would clearly have been the only remotely fair course of action in real life.) This is a pretty good demonstration of why no individual person, or small group of elites, should actually have decision-making authority in extreme situations like this: all examinations of who “deserves” to live rapidly become unsettling, as the decision-maker’s subjective judgments about the value of other people’s lives are given a false veneer of legitimacy through a dispassionate listing of supposedly-objective “criteria.”

Later in the essay, they ask:

Is being rich in a time of poverty justifiable? … Does capitalism unfairly exploit workers?

To which they answer “No” and “Yes,” so it’s clear that they’re not really making a case for libertarianism. Plus, the “about” page for Current Affairs has this unattributed quote placed fairly prominently:

“The Wall Street Journal of surrealistic left-wing policy journals.”

Then again, just below this, it has two more unattributed quotes:

“If Christopher Hitchens and Willy Wonka had edited a magazine together, it might have resembled Current Affairs.”

“The only sensible anarchist thinking coming out of contemporary print media.”

So maybe this apparent inconsistency is just wacky anarchism? Anyway, the whole essay is worth a read.

Off-Site Renewal, Episode 1

In addition to rebuilding this site, I am working on building my github presence. To that end, my github page is here. As noted on that page, it will mostly (exclusively?) be a place to organize Jupyter notebooks, like the one on odds ratios that’s already posted. I will be creating some repositories for research projects soon, too.

Malware & Restructuring

My old site (blog/content management system) was infected with malware in a recent past. I started the process of rebuilding the site, but, as you can see if you click on anything other than the CV link, I haven’t gotten very far.

I will be rebuilding and reorganizing shortly.