A frustrating discussion

I’m not totally sure what my opinion of Data Colada is. It’s an odd mix of thought-provoking and frustrating. The most recent example is yesterday’s post (by Uri Simonsohn) on the prejudice of (one particular type of) Bayesian t-test. Directly related, and similarly frustrating, is the response from one of the people (Jeff Rouder) who has been arguing for this particular Bayesian test.

(Disclaimer that’s less a disclaimer than a mildly amusing anecdote: Jeff Rouder and I once went to get a cup of coffee while were at a Math Psych meeting, and as we walked to the coffee shop, I mentioned a paper that I found interesting, incorrectly stating that it was a paper that he had (co-)authored, which he quickly pointed out was not the case. I can’t imagine he remembers this, but I felt like a knucklehead, and that moment of awkwardness has stuck with me. I have not yet had a chance to say anything stupid to Uri Simonsohn in person.)

There are quite a few issues one could dig into here, as I suppose would be obvious to anyone who’s been following the frequentist-v-Bayesian argument. In order to keep things orderly, I’ll focus on one small corner of the larger debate that’s playing an interesting role in this particular discussion.

Specifically, I’ll focus on the role of inferential errors in statistical reasoning. Simonsohn wants to know how often Rouder’s Bayesian t-test produces the incorrect result, so he ran some simulations. In the simulations, the true effect size is known, which allows us to evaluate the testing procedure in a way that we can’t when we don’t know the truth ahead of time. Simonsohn posted this figure to illustrate the results of his simulations:

The take-home point is that the Bayesian t-test in question erroneously supports the null hypothesis ~25% of the time when the effect size is “small.” It is perhaps worth noting in passing that the fact that power is 50% in these simulations means that the standard null hypothesis tests that Simonsohn favors are, by definition, wrong a full 50% of the time. Depending on what you do with the Bayesian “inconclusive” region, this is up to twice the error rate of the Bayesian test Simonsohn is arguing against.

Anyway, Rouder’s take on this is as follows:

Uri’s argument assumes that observed small effects reflect true small effects, as shown in his Demo 1 [the simulations illustrated above – NHS]. The Bayes factor is designed to answer a different question: What is the best model conditional on data, rather than how do statistics behave in the long run conditional on unknown truths?

Now, Rouder is right that Simonsohn seems to assume that observed (small) effects reflect true (small) effects, which is problematic. This isn’t obvious from the simulations, but Simonsohn’s second case makes it clear. Briefly, the second case is an(other) absurd Facebook study that found, with a sample size of ~7,000,000 (~6,000,000 + ~600,000), that 6 million people who saw pictures of friends voting (on Facebook) were 0.39% more likely to vote than 600k people that didn’t see pictures of friends voting (on Facebook). The result was statistically significant, but I don’t know that I’ve ever seen a better illustration of the difference between statistical and practical significance (I mean, aside from the other ridiculous Facebook study linked above). Who could possibly care that a 0.39% increase in voting was statistically significant? I’m having a hard time figuring out why anyone – authors, editors, or reviewers – would think this result is worth publishing. It’s just spectacularly meaningless. But I digress.


Simulations like those run by Simonsohn are useful, and I can’t figure out why so many Bayesians don’t like them. The point of simulations isn’t to understand how “statistics behave in the long run conditional on unknown truths,” it’s to leverage known truths to provide information about how much we can trust observed results when we don’t know the truth (i.e., in every non-simulated case).

That is, simulations are useful because they tell us about the probability that an inference about the state of the world is false given an observation, not because they tell us about the probability of sets of observations given a particular state of the world. I would have thought that Bayesians would recognize and appreciate this kind of distinction, given how similar it is to the distinction between the likelihood and the posterior in Bayes’ rule.

One argument for Bayesian reasoning is that frequentist reasoning can’t coherently assign probabilities to one-off events. I’ve never found this particularly compelling, in large part because I’m interested in statistics as applied to scientific research, which is to say that I’m interested in statistics as applied to cases in which we are explicitly concerned with prediction, replication, and anything other than one-off events. The point being that the frequentist properties of our statistical models are important. Or, put slightly differently, our models have frequentist properties, whether we’re Bayesian or not, and whether the models are Bayesian or not.

So, Simonsohn was right to run simulations to test Rouder’s Bayesian t-test, even if he was wrong to then act as if a statistically significant difference of 0.39% is any more meaningful than no effect at all, and even if he glosses over the fact that null hypothesis testing doesn’t really do any better than the Bayesian method he’s critiquing. And Rouder is right that it’s good to know what the best model is given a set of observations, even if he’s (also) wrong about what simulations tell us.

This entry was posted in SCIENCE!, statistical modeling. Bookmark the permalink.

Comments are closed.