A post on Neuroskeptic yesterday reminded me of one of my biggest pet peeves in reporting the results of statistical tests. And just to be clear, the point here is not to pick on Neuroskeptic (whose blog I enjoy quite a bit), as this pet peeve is very widespread.

The issue is this: reporting statistical test results with small *p*-values as ‘highly (statistically) significant’. There is, of course, the conceptually closely related (and equally annoying) practice of calling non(-statistically)-significant results with *p*-values between 0.05 and 0.10 ‘marginally (statistically) significant’. Note that, for the purposes of this rant, I’m taking it as given that we’re operating in the world of frequentist statistics and null-hypothesis significance testing.

Okay, so I’m a linguist, which is to say that I understand the distinction between descriptive and prescriptive rules. If I were treating this topic as a linguist interested in language use in the reporting of statistical results, I could (maybe) find something interesting to say about when, how, and why researchers interpret statistical test results as ordinal or kinda-sorta continuous despite the fact that such results are, in fact, dichotomous.

And, really, it’s probably not even that simple. I don’t recall ever reading that a test result was ‘just barely significant’ or ‘very non-significant’. Which is to say that my impression is that it’s much more common for reports of ‘highly’ and ‘marginally’ significant tests to be oriented toward confirming statistical significance of some sort. This result isn’t just significant, it’s super-duper, extra significant. That result isn’t actually significant, but it’s soooooo close, and it tried soooooo hard, let’s give it a cookie.

But I’m not treating this topic as a linguist interested in the language of statistical writing. I’m coming at it as a reader of social science research and a(n occasional) user and reporter of statistical tests. In science writing, prescriptive linguistic rules promoting precision and accuracy are appropriate. A test statistic either exceeds the (predetermined) criterion or it doesn’t, full stop. At best, saying that a test result is ‘highly significant’ or ‘marginally significant’ or that it ‘approached significance’ provides exactly no information of value above and beyond simply saying that the result was significant or not.

Plot and describe the data, the fitted model parameters, and the model’s predictions. Let the reader know how the data was collected and analyzed. Draw reasonable inferences. And if you’re going to use null hypothesis tests, say what the critical *p*-value is and then say whether or not each of your tests had a *p*-value above or below that critical value. Trust (or hope, maybe) that the reader can interpret the test results appropriately.

There are plenty of problems with p-values (see here, and here, and here, and here, for a start). Even so, they can be useful. But there’s no reason to imbue them with properties they simply do not have.

The P value as Fisher described it is a measure of discrepancy. So it can and must be interpreted on a continuous scale.

The hypothesis testing is based on a predetermined alpha value and is indeed binary, making sure that the test will be wrong alpha% of the time, given that the null hypothesis is true.

The real problem is the combination of these two methods into one.

It can and must be interpreted on a continuous scale of what, though? Of degrees of reasonably extreme unlikelihood that said result was obtained by chance assuming the null hypothesis is true. Which doesn’t really mean much in terms of the decision to reject or accept the null, even if technically correct. So I take Noah’s point that the qualifying adjectives are silly, since they don’t correspond to this largely meaningless scale in any informative way. If you’ve decided to reject the null at 0.05, and you get a result less than 0.0001, it doesn’t strengthen your commitment to the alternative – it just affirms your prior suspicion.

“was obtained by chance” should of course be “is likely to occur”

Thanks for the mention!

I appreciate that when you do a statistical test, all you can strictly say about that test is that it’s significant or not. But the fact that the p value is 0.0001 rather than 0.049 surely tells us *something*. It tells us, for instance, that it would have passed a more stringent test, and some people might argue that a more stringent test would have been more appropriate… so doesn’t it provide us with useful information?

Sorry I haven’t joined in the discussion until now. I’ve been wanting to, and I’ve been thinking about what I want to say, but I’ve been very busy for the last few days, and so haven’t had a chance to write anything more.

One thought that occurs to me is that Gelman (the last ‘here’ link above) argues that even large differences between p-values frequently

aren’t(statistically) meaningful. So, even if a small p-value outcome would have passed a more stringent test than a larger p-value outcome, this may or may not be as informative as it seems to be.The biggest problem with p-values is, I think, linked to the goofiest parts of frequentist reasoning. A p-value tells you the probability of observing a test statistic equal to

or more extremethan the observed statistic, under the assumption of (nearly) infinite independent replications of the same procedure. But why should we care about more extreme values than the one we’ve observed?The piece of null-hypothesis testing that I think is reasonable is the bit where a criterial value of a test statistic is determined and compared to an observed statistic. The criterial value is chosen so that we can decide whether or not the observed statistic is sufficiently discrepant (to use yop/Fisher’s language) from the null model prediction to reject that model.

Now, the usual procedure is (allegedly) to decide that we only want to reject true nulls, say, 5% of the time. Once we have this number and a null hypothesis (say, a normal model with a mean of zero), our criterial statistic value is determined. So far, so good (more or less – there are also arguments about, e.g., how silly it is to worry about whether a mean is

exactlyequal to zero or not). It makes sense to worry about statistic values equal to or more extreme than a criterial value based on a model we want to evaluate and a satisfactory ‘false alarm’ rate.There are two redundant ways to evaluate an observed statistic in this framework. We can compare our observed statistic directly to the criterial value. Or we can see if the p-value for our observed statistic is smaller than the p-value of the statistic at the criterial value. Both comparisons just tell us if we’ve passed our test or not.

Whether or not we could have passed a more stringent test just gets us to a discussion of the general arbitrariness of p-values, I think. It’s easy to imagine that one person only trusts the tests that would have passed a test with a 1% alpha level, while another is fine with 5%, and a third requires 0.1%. It’s probably a good idea for people to state what the p-values are for just this reason. But having a bunch of different standards doesn’t make a difference between two p-values meaningful, I don’t think.

Anyway, I hope it’s at least somewhat clear what I’m getting at. Thanks for reading and commenting, everyone.