The first link in this week’s “This week in stats” (by Matt Asher) post leads to a fairly silly rant (by a Wesley) about p-values. I feel like it deserves a quick (but partial, because I don’t disagree with everything) fisking, in addition to reiterating the point made by Mr. Asher that, whatever problems p-values have, no solutions are on offer here (though I know at least a dozen or so people who would argue against his claim that no one has come up with a satisfactory substitute to p-values). Anyway:
Wesley: P-values … can also be used as a data reduction tool but ultimately it reduces the world into a binary system: yes/no, accept/reject.
Noah: Given that p-values are but one part of a statistical analysis in the frequentist hypothesis testing tradition, I have a hard time seeing why this is so problematic. A calculated test statistic either exceeds a criterion or it doesn’t. This doesn’t tell the whole story of an data set, but it’s not meant to.
W: Below is a simple graph that shows how p-values don’t tell the whole story. Sometimes, data is reduced so much that solid decisions are difficult to make. The graph on the left shows a situation where there are identical p-values but very different effects.
N: I don’t understand what Wesley means when he links data reduction and decision-making difficulty, so I’ll leave that one alone. I’ll also not go into depth about why I think these graphs kind of stink (to mention maybe the worst thing about the graphs: they’re mostly just white space, with the actual numbers of interest huddled up against the [unnecessary] box outline).
Anyway, it’s not at all clear how the two “effects” in the left panel could be producing the same p-value (and the code from the post isn’t working when I try to run it – the variable effect.match is empty, since the simulation with the minimum CI difference isn’t in the set of p-values that match, i.e., logical indexing fails to produce a usable index – so I can’t reproduce the plot). Contra my intuition when first seeing the graph, it is not illustrating a paired t-test, but, rather, two single-sample t-tests. I gather that each of these red dots is illustrating a mean, and each vertical line is illustrating an associated confidence interval, and that the means are being compared to zero. Given that one (CI) line covers zero and the other does not, the p-values shouldn’t be the same.
W: The graph on the right shows where the p-values are very different, and one is quite low, but the effects are the same.
N: I disagree that the effects are the same. Sure, the means are the same (by design), but the data illustrated on the right is much more variable than the data illustrated on the left.
W: P-values and confidence intervals have quite a bit in common and when interpreted incorrectly can be misleading.
N: I agree, but this is a pretty anodyne statement. Now back to the fisking.
W: Simply put a p-value is the probability of the observed data, or more extreme data, given the null hypothesis is true.
N: Close, but nope. A p-value is the probability of an observed or more extreme test statistic, not the data. It’s an important distinction, and it’s related to the conflation of “effects” with “means” and the different p-values for identical means with different variability around the means in the figure above.
So, anyway, none of this is meant to imply that p-values don’t have limitations. Of course they do. And understanding these limitations is worthwhile. But posts like Wesley’s don’t, in my opinion, do much to foster such understanding.