Closing meta-arguments

I don’t blog much, but something about the reaction to the first Bret Stephens NYT column got under my skin enough to write two posts about it (1,2). I woke up this morning thinking about why.

At first glance, this all seems like a very minor kerfuffle, given the broader political context. I’m thinking in particular of the fact that the president is willfully ignorant, thin-skinnedhighly impulsive, and has little to no regard for the constitutional foundations of the United States. This is much more important than widespread overreaction to a NY Times column, right?

Well, sure. But yet another voice shouting into the void about Trump’s shortcomings is extremely unlikely to contribute anything useful to the world. Plus, it’s easy to be concerned about both the broader political context and the specifics of a relatively minor controversy.

Most importantly for my purposes, I think this relatively minor controversy is representative of the sorry state of political discourse in general, and the sorry state of political discourse in general is directly related to the fact that we elected Donald Trump president.

In Authority and American Usage, one of my favorite David Foster Wallace essays (pdf), he argues for the value of a Democratic Spirit. He writes (on p. 72 of the book the linked pdf is scanned from, and sorry in advance for the length of the quote, but hopefully it will be obvious why it’s so long):

A Democratic Spirit is one that combines rigor and humility, i.e., passionate conviction plus a sedulous respect for the convictions of others. As any American knows, this is a difficult spirit to cultivate and maintain, particularly when it comes to issues you feel strongly about. Equally tough is a DS’s criterion of 100 percent intellectual integrity – you have to be willing to look honestly at yourself and at your motives for believing what you believe, and to do it more or less continually.

This kind of stuff is advanced US citizenship. A true Democratic Spirit is up there with religious faith and emotional maturity and all those other top-of-the-Maslow-Pyramid-type qualities that people spend their whole lives working on. A Democratic Spirit’s constituent rigor and humility and self-honesty are, in fact, so hard to maintain on certain issues that it’s almost irresistibly tempting to fall in with some established dogmatic camp and to follow that camp’s line on the issue and to let your position harden within the camp and become inflexible and to believe that the other camps are either evil or insane and to spend all your time and energy trying to shout over them.

I submit, then, that it is indisputably easier to be Dogmatic than Democratic, especially about issues that are both vexed and highly charged.

This essay is about language and usage, not climate change, and it was originally published (in a much shorter form) as Tense Present in Harper’s in 2001 and in the linked form in the book Consider the Lobster in 2005. (Since I’m trained, in part, as a linguist, I’m fully aware that there are… let’s say problems with some of Wallace’s ideas about language. Nonetheless, I think there is a lot of value in the essay, but if you want a long rebuttal, here’s languagehat “demolishing” Wallace at length.)

Linguistic specifics aside, it’s difficult to imagine a better diagnosis of the ills of modern political discourse than the quote above. The response to the Bret Stephens column was unabashedly, sometimes gleefully, Dogmatic.

With me so far? Good. Let’s turn to peer review. (Sorry.)

When I read reviews of my papers, I put a lot of effort into maintaining a Democratic Spirit. As Wallace says, it’s not easy. Sometimes my initial reaction to a reviewer’s comments is intensely emotional. I can get remarkably worked up. I’ve developed strategies to prevent this from influencing my response to the reviewers, the main one being that I read the reviews, then I set them aside for a few days, and then I only begin responding and revising a paper after a more dispassionate re-read. It takes real effort to let my initial emotional response dissipate, to let the less important issues settle out and the more important issues clarify.

This allows me to approach the process of revising and resubmitting papers with the rigor, humility, and self-honesty that constitute a Democratic Spirit.

In the interest of rigor, humility, and (self-)honesty, though, I have to admit that I do this more for the sake of my own mental health than for the sake of Democratic Spirit. I stumbled onto this approach to responding to my peer’s reviews after I got really absurdly worked up about a particular reviewer’s responses to a recent paper of mine (possibly not 100% current arXiv pdf).

For any number of reasons, it’s typical to not sit down and revise a paper immediately when you get the reviews. I wrote that paper over a very short period of time, largely in response to two papers, the latter of which I reviewed (non-anonymously), only to find that my reviews were, as far as I can tell, completely ignored. The first author of those two papers reviewed the recent paper of mine, and I put aside pretty much everything else I was working each time I got his (and the other) reviews back.

The utility of putting reviews aside for a while before responding became very clear to me with this paper. I would normally do this just out of necessity, because there’s always a large number of other things to deal with. But my heightened response to these particular reviews made it clear to me that I would benefit personally from a deliberate delay.

Okay, so, at a personal level, I have good reasons to act this way. But this experience also made it very clear to me that the personal benefit of calming down before responding improves the scientific quality of a paper. It allows me to maintain a high level of rigor, humility, and self-honesty. It allows me to approach the whole process in a Democratic Spirit.

The effects on the scientific quality of my work illustrate, indirectly at least, that there are interesting Kantian implications here. Maybe I’m way off base, but it seems to me that you could make a case for this general approach to responding to peer review as a categorical imperative. Would the modern scientific enterprise be sustainable if responding immediately and emotionally were a universal rule governing researchers’ behavior?

Okay, maybe modern science wouldn’t totally collapse, but it’s easy to see that a universal rule to approach the process in a Democratic Spirit would work out just fine.

How does this relate to the reaction to the Bret Stephens column? Scott Alexander recently made a pretty strong case that the Democratic Spirit has been eroding for a long time in the realm of political discourse. It seems inarguable that the rapid churn of the news cycle and the widespread use of social media make it even more difficult than it normally is to maintain a Democratic Spirit.

The response to the Stephens column is, I submit, symptomatic, and emblematic, of this general failure of the Democratic Spirit in the US (and probably elsewhere).

The Gizmodo piece I linked to in my last post is, by current standards, pretty even-handed and reasonable. But it is also distinctly lacking in rigor (see my last post for some illustrations of this). Similarly, the third segment in yesterday’s Culture Gabfest was one of the more reasonable, calm, and collected discussions I’ve heard about this controversy, yet it included fairly effusive praise for the Matthews piece that got me started writing about this. It also included a description of the first part of Stephens’ column as being all about scoring points with conservatives by illustrating how much “we all hate Hillary,” claims that the second part of the column insidiously employs post-modernist philosophy to argue that nothing is truly knowable (in climate science in particular), and it ended with calls for Stephens to be fired.

Trump got elected because millions of people were willing to overlook his numerous, serious flaws and vote for him. I didn’t vote for Trump, and I very much wish he hadn’t won, and I can’t claim to understand the myriad reasons people had for voting for him. Probably lots of Trump voters would never have been persuaded not to vote for him, but I imagine quite a few had to hold their nose in order to do so. And the kinds of failures of the Democratic Spirit exhibited in the reactions to Stephens’ column may well have provided a convenient excuse for some number of these voters to vote against smug liberalism rather than voting affirmatively for whatever they thought Trump represents. (There is no small irony in the fact that Stephens’ column was trying to make more or less this exact point with respect to climate science and environmental activism.)

Perhaps unfortunately, the only idea I have for how to combat this apparently global failure of the Democratic Spirit is to simply insist on maintaining it locally myself.

Posted in language, SCIENCE! | Comments Off on Closing meta-arguments

A probably reasonable reaction

I stand by my argument that calling Bret Stephens a “denier” is asinine.

I had never heard of Bret Stephens before reading the Slate piece at issue in that post. I am not defending Bret Stephens’ record in general. But the reaction to this particular column, very little of which invokes his prior record, continues to seem ridiculous. I can’t read anyone’s mind, so I don’t know why so many people seem to have so badly misread Stephens’ column, but I continue to be kind of astonished by the whole hullabaloo.

Let’s recap what I would have (naïvely) taken to be the obvious points of Stephens’ essay, as written by Stephens (note that this first quote has been changed, since he originally wrote that the temperature change was only in the northern hemisphere – see the correction at the end of the column):

  • “…while the modest (0.85 degrees Celsius, or about 1.5 degrees Fahrenheit) warming of the earth since 1880 is indisputable, as is the human influence on that warming, much else that passes as accepted fact is really a matter of probabilities. That’s especially true of the sophisticated but fallible models and simulations by which scientists attempt to peer into the climate future.”
  • Claiming total certainty about the science traduces the spirit of science and creates openings for doubt whenever a climate claim proves wrong.”

To recap the recap, with respect to the science, I read Stephens as saying that anthropogenic climate change is true without a doubt, but that the magnitude and exact nature of the changes we should expect to see is less certain. I don’t see how you could read him as saying anything else.

With respect to the advocacy, I read Stephens as saying that too much certainty about the science can provide a handy excuse for skeptics to cast doubt on even the parts of climate science that we are most certain about. This seems to be a pretty unobjectionable point to me.

Gizmodo has a post summarizing the situation. It doesn’t help me see the response to Stephens as more reasonable (emphasis in original):

As several scientists contacted by Gizmodo explained, it is reasonable to be skeptical about the exact magnitude, timing and breadth of the impacts of climate change, and the appropriate societal response. In fact, few in the scientific community would claim certitude about the impacts, as Stephens suggests.

But the existential threat itself? That’s undeniable.

I don’t understand this at all (put aside the dog-whistley invocation of denial). The first bit seems entirely consistent with Stephens’ claim about the effects of climate change being probabilistic rather than utterly certain. But if it is reasonable to be uncertain about “the exact magnitude, timing, and breadth” of the effects of climate change, how is it undeniable that climate change is an “existential threat”?

I ask this question with total sincerity. These two statements are not consistent with one another. It cannot be simultaneously reasonable to be skeptical about how bad the effects of climate change will be and unreasonable to argue that the threat might be less than existential.

Here’s another bit from the Gizmodo post:

Richard Alley, a climate scientist at Penn State University, pointed out that in his area of research, glaciology and Antarctic ice sheet dynamics, the uncertainty boils down to whether the world will have to prepare for three feet of sea level rise over the coming century, or more than ten.

“The impacts of warming may be slightly better or worse than we expect. Or much worse,” he told Gizmodo. “Averaging over all possible futures, more uncertainty makes the costs much higher than if we were certain we would get the most-likely IPCC projections.”

Again, this is completely consistent with a claim that the effects of climate change are probabilistic, not certain. In fact, the quote that begins the second paragraph just is a claim that the effects of climate change are probabilistic, not certain. (Tangentially, as bad as 3-10 feet of sea rise will be, especially for the large number of often very poor people who live in coastal regions around the world, it’s not an existential threat to the planet or humanity as a whole.)

One more quote from the Gizmodo piece, if only to be clear that I’m not just trying to cherry pick bits that support my argument:

“Reasonable people can disagree about the best way to avoid the dangers,” Oreskes told Gizmodo. “We can also disagree about exactly how bad things are going to get. But there is no substantive, reasonable, evidence-based argument that climate change is not a substantial danger. To suggest otherwise is to misrepresent the current state of knowledge.”

This is also clearly consistent with claims about probabilistic effects, and I would emphasize the difference between claiming that the effects of climate change represent a “substantial danger” is different from claiming (with total certainty) that it represents an “existential threat.” Multiple feet of sea level rise is indeed a substantial danger, particularly to people who live near to ocean, and particularly if these people are poor.

But once again already one more time, the claim is manifestly, plainly not that climate change isn’t real, nor that it won’t have any negative effects. The claim is, again obviously if you read Stephens’ column, that there is uncertainty about the effects of climate change.


As I mentioned in my last post, I’m far from an expert in climate science. So, while I’m glad that Gizmodo reached out to some actual climate scientists to ask what they thought, I figured it would also be worth checking more direct sources of the science. As far as I can tell, these are also entirely consistent with the argument that the effects are probabilistic.

Here are explicitly probabilistic temperature projections. Here are explicitly probabilistic precipitation projections. Here are explicitly probabilistic projections about droughts. Look in my last post for links to explicitly probabilistic claims about hurricanes and climate change.

Granted, these aren’t links to the primary scientific literature on climate change. But I’m skeptical (additional dog-whistling unintentional) that if I had the time and energy to really dig into this, I would find non-probabilistic predictions about the effects of climate change.

This whole thing leaves me kind of despondent about the state of scientific and political discourse, even putting aside the civil discourse garbage disposal that is Twitter (link from the Gizmodo piece, since I try to just avoid Twitter anymore). It doesn’t help that it’s consistent with a much broader, rather long-term change in how people deal with political disagreements.

Posted in SCIENCE! | Comments Off on A probably reasonable reaction

An empty epithet

This essay on Slate is astonishing. Perhaps it’s naïve of me to be astonished, but there you go. I am astonished.

The very short version: Susan Matthews, Slate’s Science Editor, writes that the NY Times’ new columnist Bret Stephens’ first columns is “nothing more than textbook denialism.” In the column, Stephens writes, among other things, that “[a]nyone who has read the 2014 report of the Intergovernmental Panel on Climate Change knows that, while the modest (0.85 degrees Celsius, or about 1.5 degrees Fahrenheit) warming of the Northern Hemisphere since 1880 is indisputable, as is the human influence on that warming, much else that passes as accepted fact is really a matter of probabilities.”

To recap, and to emphasize the absurdity: Stephens describes anthropogenic global warming as indisputable, and Matthews calls him a climate change denier.

I have long disliked the “denier” epithet. It is fine, of course, to apply the label to someone who actually denies something. But more or less from the get go, there has been “mission creep,” with “denier” doing extra duty as a way to discredit people who agree that climate change is real and caused by human activity but who disagree with one or another implication of these facts.

For example, Judith Curry, who agrees that climate change is real and caused by human activity, and whose CV (pdf) shows that she is a climate scientist with an impressive set of climate-science credentials, but who argues that there is substantial uncertainty with respect to the magnitude of the changes we should expect in the future. So she’s a climate science denier.

For another example, Matt Ridley, who agrees that climate change is real and caused by human activity, but who argues that future changes are likely to be slow and modest, describing himself as a lukewarmer. So he’s a climate change denier.

My point here is not to argue for or against Curry’s or Ridley’s positions (though given Curry’s credentials, I’m very much inclined to give her the benefit of the doubt). My point here is to illustrate how empty the “denier” epithet is.

But it’s still kind of amazing to see something as brazen as Matthews’ use of it in her Slate piece.

Again, just because I kind of can’t get my head wrapped around it, Stephens says that human-caused climate change is indisputable – his word, and it’s quoted by Matthews, who also writes that “Technically, he doesn’t get any facts wrong” – and this is “textbook denialism.”

As far as I can tell, Matthews has just badly misread Stephens’ column. She takes his (alleged) point to be that simply explaining the facts won’t change anyone’s minds, but then asserts that his (real) point is that climate change skeptics have a point. She writes that “He is telling readers that the experts’ wrongness during the 2016 election is a good justification for doubting other established facts,” and that “he’s telling his readers that their decision not to trust the entire institution of science that supports the theory of climate change might actually be reasonable.” She conflates the reality of climate change and the much less certain dangers of climate change. She writes that Stephens’ logic implies that “…the only way to be reasonable about this topic is to give in to those who are unreasonable about it,” and that Stephens is “spewing complete bullshit.”

It seem obvious to me that Stephens’ column is actually about how too much certainty can plant seeds of doubt when overly confident claims turn out to be wrong. It’s about how the motivations behind policy advocacy can justifiably be questioned. It’s about how tossing epithets like “climate change denier” around is counterproductive.

Do you want to know why I think that this is what his column is about? It’s because that’s what he says it’s about:

Claiming total certainty about the science traduces the spirit of science and creates openings for doubt whenever a climate claim proves wrong. Demanding abrupt and expensive changes in public policy raises fair questions about ideological intentions. Censoriously asserting one’s moral superiority and treating skeptics as imbeciles and deplorables wins few converts.

Just to put some icing on the cake, the very next sentence in Stephens’ column is (emphasis mine): “None of this is to deny climate change or the possible severity of its consequences.” Matthews somehow misses the important distinction between the justifiable certainty of anthropogenic climate change as a phenomenon, on the one hand, and the substantially less certain consequences of climate change, on the other.

To pick a topic that I happen to have read a bit about (to be clear, this is not to imply that I have read all that much – I am very much not an expert on this), some quick googling provides pretty strong evidence that the relationship between hurricanes and climate change involves a lot of uncertainty. See, for example, this from NOVA, or this from NOAA, or this from the Union of Concerned Scientists, or this Washington Post column, or this post from Judith Curry.

The piece from the Union of Concerned Scientists nicely illustrates the role of uncertainty with quotes like this (emphasis mine): “Recent research in this area suggests that hurricanes in the North Atlantic region have been intensifying over the past 40 years.”

Since I am a scientist myself, I feel pretty well-qualified to comment on the use of a word like “suggests” in this context. It means they aren’t certain. That’s how I use it when I write scientific papers, and that’s the only reasonable way to read it here.

To be as clear as possible, I’m not making any claims about the relationship between climate change and hurricane frequency or intensity. As I wrote above, I’m not an expert on this. I’m just trying to illustrate that, even if we accept that climate change is real and that human activity is influencing climate change, it is very, very easy to find evidence that there can be, and is, substantial uncertainty about important related issues, issues like the (possible) danger of (certain) climate change.

The fact that the Science Editor at Slate is unable or unwilling to recognize this is, to come full circle, kind of astonishing.

Posted in SCIENCE! | Comments Off on An empty epithet

New page, new notebook

I added a new page, and on that page, I added a notebook (in html format). I suppose I could have provided the same basic information in a blog post, but I like the Jupyter Notebook format, so I figured I would try something new.

If you don’t care to read the linked notebook, here’s the very short version: odds ratios provide unambiguous ordinal information and essentially no useful interval- or ratio-scale information. Because of this, I think they make lousy measures of effect size.

Posted in uncategorized | Comments Off on New page, new notebook

Probability can be hard

On Probably Overthinking It, Allen Downey poses the following probability question:

Suppose I have a six-sided die that is red on 2 sides and blue on 4 sides, and another die that’s the other way around, red on 4 sides and blue on 2.

I choose a die at random and roll it, and I tell you it came up red. What is the probability that I rolled the second die (red on 4 sides)?  And if I do it again, what’s the probability that I get red again?

He provides links to a Jupyter notebook with his answer, but I’m going to write my answer here before I read the notebook.

The first part of the problem is a pretty straightforward Bayes’ rule/signal detection theory. Denote the four-blue-sides die B, the four-red-sides die R, and the observation of a red side red. Then we have:

    \begin{align*}\Pr(red|R) &= 2/3\\\Pr(red|B) &= 1/3\end{align*}

Assuming that by “choose a die at random” we mean that \Pr(R) = \Pr(B) = 1/2, then we can plug these into Bayes’ rule to get:

    \begin{align*} \Pr(R|red) &= \frac{\Pr(red|R)\Pr(R)}{\Pr(red|R)\Pr(R) + \Pr(red|B)\Pr(B)}\\ &= \frac{\frac{2}{3}\times\frac{1}{2}}{\frac{2}{3}\times\frac{1}{2} + \frac{1}{3}\times\frac{1}{2}}\\ &= \frac{\frac{1}{3}}{\frac{1}{3} + \frac{1}{6}}\\ &= \frac{2}{3} \end{align*}

Note that this implies that \Pr(B|red) = 1/3.

It’s maybe a bit trickier to figure out the next part, but, unless I’m mistaken, it’s not all that tricky:

    \begin{align*} \Pr(red|red) &= \Pr(red|R)\pr(R|red) + \Pr(red|B)\Pr(B|red)\\ &= \frac{2}{3}\times\frac{2}{3} + \frac{1}{3}\times\frac{1}{3}\\ &= \frac{4}{9} + \frac{1}{9}\\ &= \frac{5}{9} \end{align*}

Of course, you’ll have to take my word for it that I haven’t yet looked at Downey’s notebook. I don’t know if my answer agrees with his or not, but I can’t think of any reason why the equations above are wrong.

I’m going to publish this post to put it on the record, then read the notebook, and report back as needed.

Addendum: Well, I maybe misread the problem. In Downey’s “Scenario A”, by “do it again” he means “pick a die at random and roll it.” In his “Scenario B”, he means what I originally interpreted it to mean, namely “take the die that produced red the first time and roll it a second time.”

I’m happy to see that his simulations and my analytic solution agree when the word problem is interpreted the same.

It seems to me that part of what makes probability hard, when it is hard, is translating ambiguous words into unambiguous mathematical statements.

Posted in mildly informative filler, probabiity | Comments Off on Probability can be hard

Reflections on Reasons for Reduced Rates of Replicability

Scott Alexander links to an interesting paper by Richard Kunert that aims to test two plausible explanations for the low replication rates in the Open Science Foundation’s project aimed at estimating the reproducibility of psychological science. The paper is short and open access, and I would argue that it’s worth reading, even if, as I’ll describe below, it’s flawed.

The logic of Kunert’s paper is as follows:

If psychological findings are highly context-dependent, and if variation in largely unknown contextual factors drives low rates of replication, then paper-internal (mostly conceptual) replications should correlate with independent replication success. The idea is that in conceptual replications (which constitute most paper-internal replications), contextual factors are intentionally varied, so effects that show up in repeated conceptual replications should be robust to the smaller amount of contextual variation found in independent, direct (i.e., not conceptual) replications. Kunert calls this the unknown moderator account. Crucially, Kunert argues that the unknown moderator account predicts that the studies with internal replications will be replicated more successfully than are studies without internal replications.

On the other hand, if low replication rates are driven by questionable research practices – optional stopping, the file drawer effect, post-hoc hypothesizing – then studies with internal replications will not be replicated more (or less) successfully than are studies without internal replications.

Kunert analyzes p values and effect sizes in the OSF replication studies. Here’s his Figure 1, illustrating that (a) there’s not much difference between studies with internal replications and those without, and (b) what difference there is points toward studies with internal replications having lower rates of replication (as measured by statistically significant findings in the independent replications, see left panel) and greater reductions in effect size (right panel):

p values (left), effect size reduction (right)
replication p values (left), effect size reduction (right)

I think there’s a flaw in the reasoning behind the unknown moderator account, though. Specifically, I don’t think the unknown moderator account predicts a difference in replication rates between studies with and without internal replications.

The logic underlying the prediction is that if internal replications, then successful independent replications. But modus ponens does not license the conclusion that if no internal replications, then not successful independent replications. Studies without internal replications could lack internal replications for any number of reasons. In order for the unknown moderator account to predict a difference in independent replication rates between studies with and without internal replications, the absence of internal replications has to directly reflect less robust effects. Kunert doesn’t make a case for this, and it’s not clear to me what such a case would be or if it could be made.

So, the unknown moderator account is, I think, consistent with equal independent replication rates (on average) across studies with and without internal replications.

It’s possible, for example, that the unknown moderator account is true while all of the OSF studies probed (approximately) equally robust effects, with only a subset of them including internal replications. Or while some proportion of the findings from the OSF studies without internal replications are as robust as the findings from the studies with internal replications, while the rest are not.

The upshot is that the unknown moderator account predicts equal or greater independent replication rates for studies with internal replications than for those without. Given this, I think it’s noteworthy that Kunert reports lower replication rates and greater reductions in effect size for studies with internal replications than for those without. These effects aren’t statistically signif… er, don’t have sufficiently large Bayes’ factors or sufficiently shifted posterior distributions to license particularly strong conclusions, but they do both point in the direction that is inconsistent with even my modified version of the unknown moderator effect.

I think the unknown moderator account probably also predicts greater variation in independent replication rates for studies without internal replications than in those with. I’m not sure if this prediction holds or not, but based on Kunert’s Figure 1, it doesn’t seem likely.

It’s also worth remembering that modus ponens also implies that if not successful independent replications, then no internal replications. So the unknown moderator account also predicts that sufficiently low independent replication rates should correspond to studies without internal replications.

But it’s not at all clear “sufficiently low” means here. The replication rates in the OSF project that Kunert analyzed seem pretty low to me (and the whole point of Kunert’s analysis to test two explanations for such low rates), but I have no idea if they’re low enough to confirm this prediction.

And, of course, this logic is just as asymmetrical as the logic described above, since the presence of successful independent replications is consistent with the presence and the absence of internal replications. If even the low rates reported in the OSF project count as successful replication, then we can’t really infer much from approximately equal rates across these two categories of psychology study.

Posted in SCIENCE!, statistical decision making | Comments Off on Reflections on Reasons for Reduced Rates of Replicability

Python > R > Matlab

Contrary to what the title may seem to imply, I’m not making any grand proclamations here. Rather, inspired by a discussion with a friend and co-author on Facebook this morning, I’m going to note one fairly common data analysis case in which Python (NumPy) behaves in a totally straightforward manner, R in a similar but slightly less straightforward manner, and Matlab in an annoying and not particularly straightforward manner.

The case is the calculation of means (or other functions) along specified axes of multidimensional arrays.

In IPython (with pylab invoked), you specify the axis along which you want to apply the function of interest, and, at least in my opinion, you get output arrays that are exactly the shape you would expect. If you have a 3 \times 4 \times 5 array and you calculate the mean along the first axis, you get a 4 \times 5 array, and analogously for the second and third axes:

In [1]: X = random.normal(size=(3,4,5))

In [2]: X.mean(axis=0).shape
Out[2]: (4, 5)

In [3]: X.mean(axis=1).shape
Out[3]: (3, 5)

In [4]: X.mean(axis=2).shape
Out[4]: (3, 4)

In R, you get similarly sensible results, but you have to specify the axes along which you don’t want to apply the function (which I find much more confusing than the Python approach shown above):

> X = array(rnorm(60),dim=c(3,4,5))
> dim(apply(X,c(2,3),mean))
[1] 4 5
> dim(apply(X,c(1,3),mean))
[1] 3 5
> dim(apply(X,c(1,2),mean))
[1] 3 4

And here’s what Matlab does:

>> X = random(makedist('Normal'),3,4,5);
>> size(mean(X,1))
ans =
     1     4     5
>> size(mean(X,2))
ans =
     3     1     5
>> size(mean(X,3))
ans =
     3     4

Ugh. For some reason, Matlab preserves the dimension if you apply the function on the first or second axis, but drops it if you apply it on the third (or greater) axis. This is annoying.

So, in summary, I continue to be happy using Python for almost everything, R for most of what’s left over, and Matlab only very rarely anymore.

Posted in Matlab, Python, R, SCIENCE! | Comments Off on Python > R > Matlab

On the interpretation of interval estimates

Cosma Shalizi has a new post that takes the form of a (failed, as he describes it) dialogue expressing his frustration with a paper he was reviewing. If you are interested in statistical theory and how the statistics we use in research relate to the world, you should read the whole thing.

I’ve read it twice now, and may well go back to it again at some point. It’s thought provoking, for me in no small part because I like to use Bayesian model fitting software (primarily PyMC(3) these days), but I don’t think of myself as “a Bayesian,” by which I mean that I’m not convinced by the arguments I’ve read for Bayesian statistics being optimal, rational, or generally inherently better than frequentist statistics. I am a big fan of Bayesian estimation, for reasons I may go into another time, but I’m ambivalent about (much of) the rest of Bayesianism.

Which is not to say that I am convinced by arguments for any particular aspect of frequentist statistics, either. To be frank, for some time now, I’ve been in a fairly uncertain state with respect to how I think statistical models should, and do, relate to the world. Perhaps it’s a testament to my statistical training that I am reasonably comfortable with this uncertainty. But I’m not so comfortable with it that I want it to continue indefinitely.

So, my motivation for writing this post is to (at least partially) work through some of my thoughts on a small corner of this rather large topic. Specifically, I want to think through what properties of confidence and/or credible intervals are important and which are not, and how this relates to interpretation of reported intervals.

(I know that the more general notion is confidence/credible set, but everything I say here should apply to both, so I’ll stick with “interval” out of habit.)

Early in my time as a PhD student at IU, I took John Kruschke’s Intro Stats class. This was well before he started writing his book, so it was standard frequentist fare (though I will stress that, whatever one’s opinion on the philosophical foundations or everyday utility of the content of such a course, Kruschke is an excellent teacher).

I learned a lot in that class, and one of the most important things I learned was what I now think of as the only reasonable interpretation of a confidence interval. Or maybe I should say that it’s the best interpretation. In any case, it is this: a confidence interval gives you the range of values of a parameter that you cannot reject.

If I’m remembering correctly, this interpretation comes from David Cox, who wrote, in Principles of Statistical Inference (p. 40) “Essentially confidence intervals, or more generally confidence sets, can be produced by testing consistency with every possible value in ψ and taking all those values not ‘rejected’ at level c, say, to produce a 1 − c level interval or region.”

In Shalizi’s dialogue, A argues that the coverage properties of an interval over repetitions of an experiment are important. Which is to say that what makes confidence intervals worth estimating is the fact that, if the underlying reality stays the same, in the proportion 1-c of repetitions, the interval will contain the true value of the parameter.

But the fact that confidence intervals have certain coverage properties does not provide a reason for reporting confidence intervals in any single, particular case. If I collect some data and estimate a confidence interval for some statistic based on that data, the expected long run probability that the procedure I used will produce intervals that contain the true value of a parameter says absolutely nothing about whether the true value of the parameter is in the single interval I have my hands on right now.

Obviously, it’s good to understand the properties of the (statistical) procedures we use. But repetitions (i.e., direct, rather than conceptual, replications) of experiments are vanishingly rare in behavioral fields (e.g., communication sciences and disorders, where I am; second language acquisition, linguistics, and psychology, where I have, to varying extents, been in the past), so it’s not clear how relevant this kind of coverage is in practice.

More importantly, it’s not clear to me what “the true value of a parameter” means. The problem with this notion is easiest to illustrate with the old stand-by example of a random event, the coin toss.

Suppose we want to estimate the probability of “heads” for two coins. We could toss each coin N times, observe k_i occurrences of “heads” for the i^{th} coin, and then use our preferred frequentist or Bayesian statistical tools for estimating the “true” probability of “heads” for each, using whatever point and/or interval estimates we like to draw whatever inferences are relevant to our research question(s). Or we could remove essentially all of the randomness, per Diaconis, et al’s approach to the analysis of coin tosses (pdf).

The point being that, when all we do is toss the coins N times and observe k_i “heads,” we ignore the underlying causes that determine whether the coins land “heads” or “tails.” Or maybe it’s better to say that we partition the set of factors determining how the coins land into those factors we care about and those we don’t care about. Our probability model – frequentist, Bayesian, or otherwise – is a model of the factors we don’t care about.

In this simple, and somewhat goofy, example, the factors we care about are just the identity of the coins (Coin A and Coin B) or maybe the categories the coins come from (e.g, nickels and dimes), while the factors we don’t care about are the physical parameters that Diaconis, et al, analyzed in showing that coin tosses aren’t really random at all.

I don’t see how the notion of “true” can reasonably be applied to “value of the parameter” here. We might define “the true value of the parameter” as the value we would observe if we could partition all of the (deterministic) factors in the relevant way for all relevant coins and estimate the probabilities with very large values of N.

But the actual underlying process would still be deterministic. Perhaps this notion of “truth” here reflects a technical, atypical use of a common word (see, e.g., “significance” for evidence of such usage in statistics), but defining “truth” with respect to a set of decisions about which factors to ignore and which not to, how to model the ignored factors, and how to collect relevant data seems problematic to me. Whatever “truth” is, it doesn’t seem a priori reasonable for it to be defined in such a instrumental way.

The same logic applies to more complicated, more realistic cases, very likely exaggerated by the fact that we can’t fully understand, or even catalog, all of the factors influencing the data we observe. I’m thinking here of the kinds of behavioral experiments I do and that are at the heart of the “replication crisis” in psychology.

So, where does this leave us? My intuition is that it only really makes sense to interpret c{onfidence, redible} intervals with respect to whatever model we’re using, and treat them as sets of parameter values that are more or less consistent with whatever point estimate we’re using. Ideally, this gives us a measure of the precision of our estimate (or of our estimation procedure).

Ultimately, I think it’s best to give all of this the kind of instrumental interpretation described above (as long as we leave “truth” out of it). I like Bayesian estimation because it is flexible, allowing me to build custom models as the need arises, and I tend to think of priors in terms of regularization rather than rationality or subjective beliefs. But I’ll readily own up to the fact that my take on all this is, at least for now, far too hand-wavy to do much philosophically heavy lifting.

Posted in philosophy of science, statistical modeling | Comments Off on On the interpretation of interval estimates

Sensitivity to mis-specification

I’ve encountered two potentially problematic uses of sensitivity and specificity recently. One is simply an egregious error. The other is a combination of, on the one hand, an oversimplification of the relationship between these and the area under a receiver operating characteristic curve and, on the other, an illustration of one of their important limitations.

Just in case you don’t want to read the wikipedia entry linked above, here are some quick definitions. Suppose you have a test for diagnosing a disease. The sensitivity of the test is the proportion of people with the disease that the test correctly identifies as having the disease (i.e., the hit rate, henceforth H). The specificity of the test is the proportion of people without the disease that the test correctly identifies as not having the disease (i.e., the correct rejection rate, henceforth CR).

H and CR are useful measures, to be sure, but they obscure some important properties of diagnostic tests (and of binary classifiers in probabilistic decision making in general). Rather than H and CR, we can (and should) think in terms of d’ – the “distance” between the “signal” and “noise” classes – and c – the response criterion. Here’s a figure from an old post to illustrate:optimal_model

In this illustration, the x-axis is the strength of evidence for disease according to our test. The red curve illustrates the distribution of evidence values for healthy people, and the blue curve illustrates the distribution of evidence values for people with the disease. The vertical dashed/dotted lines are possible response criteria. So, in this case, d’ would be \displaystyle{\frac{\mu_2 - \mu_1}{\sigma}}, where \sigma is some measure of the variability in the two distributions. It is useful to define c as the signed distance of the response criterion with respect to the crossover point of the two distributions. I’ll note in passing that I’m eliding a number of important details here for the sake of simplicity (e.g., the assumption of equal variances in the two distributions, the assumption of normality in same), which I’ll come back to below.

H and CR are determined by d’ and c. H is defined as the integral of the blue curve to the right of the criterion, and CR as the integral of the red curve to the left of the criterion.

So, one important property of a binary classifier that H and CR obscure but that d’ and c illuminate is the fact that, for a given d’, as you shift c, H and CR trade off with one another. Shift c leftward, and H increases while CR decreases. Shift c rightward, and H decreases while CR increases. In the figure above, you can see how the areas under the red and blue curves differ for the dashed and dotted vertical lines – H is lower and CR higher for the dotted line than for the dashed line.

Another important property of a binary classifier is that, as you increase d’, either H increases, CR increases, or both increase, depending on where the response criterion is. In the above figure, if we increased the distance between \mu_1 and \mu_2 (without changing the variances of the distributions) by shifting \mu_1 to the left by some amount \delta and by shifting mu_2 to the right by \delta, both H and CR would increase.

The egregious error I encountered is in the “Clinical Decision Analysis Regarding Test Selection” section of the ASHA technical report on (Central) Auditory Processing Disorder, or (C)APD (I’ll quote the report as it currently stands – I plan to email someone at ASHA to point this out, after which, I hope it will be fixed):

The sensitivity of a test is the ratio of the number of individuals with (C)APD detected by the test compared to the total number of subjects with (C)APD within the sample studied (i.e., true positives or hit rate). Specificity refers to the ability to identify correctly those individuals who do not have the dysfunction. The specificity of the test is the ratio of normal individuals (who do not have the disorder) who give negative responses compared to the total number of normal individuals in the sample studied, whether they give negative or positive responses to the test (i.e., 1 – sensitivity rate). Although the specificity of a test typically decreases as the sensitivity of a test increases, tests can be constructed that offer high sensitivity adequate for clinical use without sacrificing a needed degree of specificity.

The egregious error is in stating that CR is equal to 1-H. As illustrated above, it’s not.

The oversimplification-and-limitation-illustration was in Part II of a recent Slate Star Codex (SSC) post. Here’s the oversimplification:

AUC is a combination of two statistics called sensitivity and specificity. It’s a little complicated, but if we assume it means sensitivity and specificity are both 92% we won’t be far off.

AUC here means “area under the curve,” or, as I called it above, area under the receiver operating characteristic curve (or AUROC). Here’s a good Stack Exchange answer describing how AUROC relates to H and CR, and here’s a figure from that answer:


The x axis in this figure is 1-CR, and the y axis is H. The basic idea here is that the ROC is the curve produced by sweeping across all possible values of c for a test with a given d’. If we set c as far to the right as we can, we get H = 0 and CR = 1, so 1-CR = 0 (i.e., we’re in the bottom left of the ROC figure). As we shift c leftward, H and 1-CR increase. Eventually, when c is as far to the left as we can go, H = 1 and 1-CR = 1 (i.e., we’re at the top right of the ROC figure).

The AUROC can give you a measure of something like d’ without requiring, e.g., assumptions of equal variance Gaussian distributions for your two classes. Generally speaking, as d’ increases, so does AUROC.

So, the oversimplification in the quote above consists in the fact that the AU(RO)C does not correspond to single values of H and CR.

Which brings us to the illustration of the limitations of H and CR. To get right to the point, H and CR don’t take the base rate of the disease into account. Let’s forget about the conflation of AUROC and H and CR and just assume we have H = CR = 0.92. Per the example in the SSC post, if you have a 0.075 proportion of positive cases, H and CR are problematic: you have 92% accuracy, but less than half of the people identified by the test as diseased actually have the disease!

The appropriate response here is to shift the criterion to take the base rate (and the costs and benefits of each combination of true disease state and test outcome) into account. Given how often Scott Alexander (i.e., the SSC guy) argues for Bayesian reasoning and utility maximization, I am a bit surprised (and chagrined) he didn’t go into this, but the basic idea is to respond “disease!” if the following inequality holds:

    \begin{equation*} \frac{Pr(x|+)}{Pr(x|-)} \geq \frac{(U_{--} - U_{+-})Pr(-)}{(U_{++} - U_{-+})Pr(+)} \end{equation*}

Here, Pr(x|+) is the probability of a particular strength of evidence of disease given that the disease is present, Pr(x|-) is the probability given that the disease is not present, Pr(+) and Pr(-) are the prior probabilities of the disease being present or not, respectively, and U_{or} is the utility of test outcome o and reality r (e.g., U_{+-} is the cost of the test indicating “disease” while in reality the disease is not present).

The basic idea here is that the relative likelihood of a given piece of evidence when the disease is present vs when it’s absent needs to exceed the ratio on the right. By “piece of evidence” I mean something like the raw, pre-classification score on our test, which corresponds to the position on the x axis in the first figure above.

The ratio on the right takes costs and benefits into account and weights them with the prior probability of the presence or absence of the disease. We can illustrate a simple case by setting U_{--} = U_{++} = 1 and U_{+-} = U_{-+} = 0 and just focusing on prior probabilities. In this case, the inequality is just:

    \begin{equation*} \frac{Pr(x|+)}{Pr(x|-)} \geq \frac{Pr(-)}{Pr(+)} \end{equation*}

In the SSC case, Pr(+) = 0.075 and Pr(-) = 0.925, so the criterion for giving a “disease!” diagnosis should be \approx 12. That is, we should only diagnose someone as having the disease if the strength of evidence given the presence of the disease is 12+ times the strength of evidence given the absence of the disease.

Posted in statistical decision making | Comments Off on Sensitivity to mis-specification

Pondering proportional pizza prices

In my last post, I illustrated how much better a deal it is to get large pizzas rather than medium or small pizzas from Dewey’s. A friend pointed out that I hadn’t taken crust into account in that analysis. I dismissed the idea at first, thinking, incorrectly, that it wouldn’t matter. I dismissed it in part because I like to eat the crust, and so don’t tend to think of it as qualitatively different than the rest of the pizza.

As it happens, it matters for the analysis, since it actually makes small and medium pizzas even worse deals. A one inch crust is, proportionally, 33%, 28%, or 22% of the area of a small, medium, or large, respectively.

Here is a graph showing the square inches per dollar as a function of number of toppings, taking a one inch crust into account:


And here’s a graph showing dollars per square inch as a function of number of toppings,  taking a one inch crust into account (see previous post for plots that don’t take the crust into account):


By taking the crust into account, we see that the large is an even better deal than before. It’s also (very, very nearly) interesting to note that the crossover between smalls and mediums has shifted leftward a couple toppings. Without taking crust into account, smalls were a better deal than mediums for 0, 1, or 2 toppings, but with a one inch crust, small just barely beats medium, and then only if you get a plain cheese pizza.

In addition, here’s a plot showing square inches per dollar for small and medium pizzas relative to square inches per dollar for a large:


For no toppings, you get ~60% as many square inches per dollar for either small or medium as you do for a large. This ratio stays fairly constant for mediums, but drops substantially for smalls, approaching 50% for 10 gourmet toppings.

And, finally, here’s a plot showing dollars per square inch for small and medium pizzas relative to dollars per square inch for a large:


With respect to dollars per square inch, you spend ~150% for a no-topping small or medium relative to a large. The ratio stays more or less constant for mediums, while it increases quite a bit for smalls. If, for some reason, you decided to buy a 10-gourmet-topping small pizza, you’d be spending almost twice as much per square inch as you would if you bought a 10-gourmet-topping large.

I have way more real work to do than you might think.


Posted in mildly informative filler, SCIENCE! | Comments Off on Pondering proportional pizza prices