Cosma Shalizi has a new post that takes the form of a (failed, as he describes it) dialogue expressing his frustration with a paper he was reviewing. If you are interested in statistical theory and how the statistics we use in research relate to the world, you should read the whole thing.

I’ve read it twice now, and may well go back to it again at some point. It’s thought provoking, for me in no small part because I like to use Bayesian model fitting software (primarily PyMC(3) these days), but I don’t think of myself as “a Bayesian,” by which I mean that I’m not convinced by the arguments I’ve read for Bayesian statistics being optimal, rational, or generally inherently better than frequentist statistics. I am a big fan of Bayesian *estimation*, for reasons I may go into another time, but I’m ambivalent about (much of) the rest of Bayesianism.

Which is not to say that I *am* convinced by arguments for any particular aspect of frequentist statistics, either. To be frank, for some time now, I’ve been in a fairly uncertain state with respect to how I think statistical models should, and do, relate to the world. Perhaps it’s a testament to my statistical training that I am reasonably comfortable with this uncertainty. But I’m not so comfortable with it that I want it to continue indefinitely.

So, my motivation for writing this post is to (at least partially) work through some of my thoughts on a small corner of this rather large topic. Specifically, I want to think through what properties of confidence and/or credible intervals are important and which are not, and how this relates to interpretation of reported intervals.

(I know that the more general notion is confidence/credible *set*, but everything I say here should apply to both, so I’ll stick with “interval” out of habit.)

Early in my time as a PhD student at IU, I took John Kruschke’s Intro Stats class. This was well before he started writing his book, so it was standard frequentist fare (though I will stress that, whatever one’s opinion on the philosophical foundations or everyday utility of the content of such a course, Kruschke is an excellent teacher).

I learned a lot in that class, and one of the most important things I learned was what I now think of as the only reasonable interpretation of a confidence interval. Or maybe I should say that it’s the best interpretation. In any case, it is this: a confidence interval gives you the range of values of a parameter that you cannot reject.

If I’m remembering correctly, this interpretation comes from David Cox, who wrote, in Principles of Statistical Inference (p. 40) “Essentially confidence intervals, or more generally confidence sets, can be produced by testing consistency with every possible value in ψ and taking all those values not ‘rejected’ at level c, say, to produce a 1 − c level interval or region.”

In Shalizi’s dialogue, A argues that the coverage properties of an interval over repetitions of an experiment are important. Which is to say that what makes confidence intervals worth estimating is the fact that, if the underlying reality stays the same, in the proportion 1-c of repetitions, the interval will contain the true value of the parameter.

But the fact that confidence intervals have certain coverage properties does not provide a reason for reporting confidence intervals in any single, particular case. If I collect some data and estimate a confidence interval for some statistic based on that data, the expected long run probability that the procedure I used will produce intervals that contain the true value of a parameter says *absolutely nothing* about whether the true value of the parameter is in the single interval I have my hands on right now.

Obviously, it’s good to understand the properties of the (statistical) procedures we use. But repetitions (i.e., direct, rather than conceptual, replications) of experiments are vanishingly rare in behavioral fields (e.g., communication sciences and disorders, where I am; second language acquisition, linguistics, and psychology, where I have, to varying extents, been in the past), so it’s not clear how relevant this kind of coverage is in practice.

More importantly, it’s not clear to me what “the true value of a parameter” means. The problem with this notion is easiest to illustrate with the old stand-by example of a random event, the coin toss.

Suppose we want to estimate the probability of “heads” for two coins. We could toss each coin times, observe occurrences of “heads” for the coin, and then use our preferred frequentist or Bayesian statistical tools for estimating the “true” probability of “heads” for each, using whatever point and/or interval estimates we like to draw whatever inferences are relevant to our research question(s). Or we could remove essentially all of the randomness, per Diaconis, et al’s approach to the analysis of coin tosses (pdf).

The point being that, when all we do is toss the coins *N* times and observe “heads,” we ignore the underlying causes that determine whether the coins land “heads” or “tails.” Or maybe it’s better to say that we partition the set of factors determining how the coins land into those factors we care about and those we don’t care about. Our probability model – frequentist, Bayesian, or otherwise – is a model of the factors we don’t care about.

In this simple, and somewhat goofy, example, the factors we care about are just the identity of the coins (Coin A and Coin B) or maybe the categories the coins come from (e.g, nickels and dimes), while the factors we don’t care about are the physical parameters that Diaconis, et al, analyzed in showing that coin tosses aren’t really random at all.

I don’t see how the notion of “true” can reasonably be applied to “value of the parameter” here. We might define “the true value of the parameter” as the value we would observe if we could partition all of the (deterministic) factors in the relevant way for all relevant coins and estimate the probabilities with very large values of *N*.

But the actual underlying process would still be deterministic. Perhaps this notion of “truth” here reflects a technical, atypical use of a common word (see, e.g., “significance” for evidence of such usage in statistics), but defining “truth” with respect to a set of decisions about which factors to ignore and which not to, how to model the ignored factors, and how to collect relevant data seems problematic to me. Whatever “truth” is, it doesn’t seem *a priori* reasonable for it to be defined in such a instrumental way.

The same logic applies to more complicated, more realistic cases, very likely exaggerated by the fact that we can’t fully understand, or even catalog, all of the factors influencing the data we observe. I’m thinking here of the kinds of behavioral experiments I do and that are at the heart of the “replication crisis” in psychology.

So, where does this leave us? My intuition is that it only really makes sense to interpret c{onfidence, redible} intervals with respect to whatever model we’re using, and treat them as sets of parameter values that are more or less consistent with whatever point estimate we’re using. Ideally, this gives us a measure of the precision of our estimate (or of our estimation procedure).

Ultimately, I think it’s best to give all of this the kind of instrumental interpretation described above (as long as we leave “truth” out of it). I like Bayesian estimation because it is flexible, allowing me to build custom models as the need arises, and I tend to think of priors in terms of regularization rather than rationality or subjective beliefs. But I’ll readily own up to the fact that my take on all this is, at least for now, far too hand-wavy to do much philosophically heavy lifting.