Multivariate normal CDF values in Python

I was very happy to realize recently that a subset of Alan Genz’s multivariate normal CDF functions are available in Scipy. I first learned of Dr. Genz’s work when I started using the mnormt R package, which includes a function called sadmvn that gives very precise, and very accurate, multivariate normal CDF values very quickly.

In case you don’t know, this is quite an achievement, since there is not a closed form solution. I’ve spent far too much time reading strange, complicated papers found in the deepest recesses of google (i.e., the third page of search results) that claim to provide fast, accurate approximations to multivariate normal CDFs. As far as I can tell, none of these claims hold any water. None other than Genz’s, anyway.

Okay, so Scipy has two relevant functions, but they’re kind of buried, and it might not be obvious how to use them (at least if you don’t know to look at Genz’s Fortran documentation). So, for the benefit of others (and myself, in case I need a refresher), here’s where they are and how to use them.

First, where. In the Scipy stats library, there is a chunk of compiled Fortran code called I’ve copied it here, just in case it disappears from Scipy someday. Should that come to pass, and should you want this file, just save that ‘plain text’ file and rename it and you should be good to go.

Otherwise, if you’ve got Scipy, you can just do this:

from scipy.stats import mvn

Now, mvn will have three methods, two of which – mvndst and mvnun – are what we’re looking for here.

The first works like this:

error,value,inform = mvndst(lower,upper,infin,correl,...)

Which is to say that it takes, as arguments, lower and upper limits of integration, ‘infin’ (about which more shortly), and correl (as well as some optional arguments). This is, in turn, to say that it assumes that your multivariate normal distribution is centered at the origin and that you’ve normalized all the variances.

This function is straightforward to use, except for, perhaps, the ‘infin’ argument. From Genz’s documentation:

*     INFIN  INTEGER, array of integration limits flags:
*           if INFIN(I) < 0, Ith limits are (-infinity, infinity);
*           if INFIN(I) = 0, Ith limits are (-infinity, UPPER(I)];
*           if INFIN(I) = 1, Ith limits are [LOWER(I), infinity);
*           if INFIN(I) = 2, Ith limits are [LOWER(I), UPPER(I)].

Which is to say that you put a negative number in if you want, on dimension I, to integrate from -Inf to Inf, 0 if you want to integrate from -Inf to your designated upper bound, 1 if you want to integrate from your designated lower bound to Inf, and 2 if you want to use both of your designated bounds.

Also from Genz’s documentation:

*     INFORM INTEGER, termination status parameter:
*          if INFORM = 0, normal completion with ERROR < EPS;
*          if INFORM = 1, completion with ERROR > EPS and MAXPTS 
*                         function vaules used; increase MAXPTS to
*                         decrease ERROR;
*          if INFORM = 2, N > 500 or N < 1.

Here, N seems to be the number of dimensions (and is the first argument in Genz’s MVNDST Fortran function, but is not in the similar/corresponding R or Python functions).  In any case, it’s the 0 and 1 that seem most informative, and the MAXPTS variable is one of the optional arguments I mentioned above.

The other function allows for non-zero means and covariance (as opposed to correlation) matrices, but it doesn’t, technically speaking, allow for integration to or from +/-Infinity:

value,inform = mvnun(lower,upper,means,covar,...])

As it happens, and as shouldn’t be too surprising, you can give it large magnitude bounds and get essentially the same answer. As long as you’re sufficiently far away from the mean (meaning as long as you’re more than a few standard deviation units away), the difference between the +/-Inf bound and the finite bound will only show up quite a few decimal places into your answer.

If you’ve got numpy imported as np, you could, for example, do this:

In [54]: low = np.array([-10, -10])

In [55]: upp = np.array([.1, -.2])

In [56]: mu = np.array([-.3, .17])

In [57]: S = np.array([[1.2,.35],[.35,2.1]])

In [58]: p,i = mvn.mvnun(low,upp,mu,S)

In [59]: p
Out[59]: 0.2881578675080012

With more extreme values for low, we get essentially the same answer (with a difference only showing up in the 12th decimal place):

In [60]: low = array([-20, -20])

In [61]: p,i = mvncdf(low,upp,mu,S)

In [62]: p
Out[62]: 0.2881578675091007

Still more extreme values doesn’t change it at all:

In [63]: low = array([-100, -100])

In [64]: p,i = mvncdf(low,upp,mu,S)

In [65]: p
Out[65]: 0.2881578675091007

All of this is important to me because I’m working on building a Bayesian GRT (e.g.) model in PyMC, and I’m hoping I’ll be able to use this function to get fast and accurate probabilities, given a set of mean and covariance parameters.

Posted in Python, R, statistical modeling | Comments Off

Visualizing confusion matrices

For my Current Topics in Communication Sciences course, we read So & Best’s 2010 paper on non-native tone perception. I won’t go much into the paper’s implications for language or speech, though these are interesting and worth thinking about. Rather, I want to focus (as is my wont) on data analysis and illustrating some models and data analysis tools that I wish were used more often in this kind of research.

The data analysis in So & Best is not so good. The data consist of confusion matrices for Cantonese, Japanese, and English native listeners’ identification of Mandarin tones. The first part of the analysis focuses on ‘tone sensitivity’ and uses A’, a non-parametric measure of perceptual sensitivity, which I assume is (something like) the A’ define by Grier [pdf] (So & Best cite a textbook rather than a paper, so I don’t know for sure how they calculated A’).

It’s probably good that they use A’ rather than d’, given that, for each tone, they’re lumping all three incorrect responses together, which certainly violates pretty much all of the assumptions underlying Gaussian signal detection theory (though, not surprisingly, A’ has a downside, too). But, then, even if they’ve avoided violating one set of assumptions by using A’, A’ values violate pretty much all of the assumptions underlying ANOVA, as do the confusion counts they (also) crank through the ANOVA machine.

Which brings me to my point, namely that there are better statistical tools for analyzing confusion matrices. Some such tools are pretty standard, plug-and-play models like log-linear analysis (a.k.a. multiway frequency analysis). Others are less standard and perhaps less easy to use, but they are far superior with respect to providing insight and understanding of the patterns in the data.

I’ve written about the Similarity Choice Model (SCM) before; as I wrote in the linked post:

In the SCM, the probability of giving response r to stimulus s is (where \beta_r is the bias to give response r and \eta_{sr} is the similarity between stimulus s and stimulus r, and N is the number of stimuli):

(1)   \begin{equation*}p_{sr} = \frac{\beta_r\eta_{sr}}{\sum_{i=1}^{N}\beta_i\eta_{si}}\end{equation*}

You might want to use this model because it has convenient, closed-form solutions for the similarity and bias parameters:

(2)   \begin{align*}\eta_{sr} &= \sqrt{\frac{p_{sr}p_{rs}}{p_{ss}p_{rr}}}\\ \quad \\\beta_r &= \frac{1}{\sum_{k=1}^{N}\sqrt{\frac{p_{rk}p_{kk}}{p_{kr}p_{rr}}}}\end{align*}

To illustrate the ease and utility of using the SCM, I estimated parameters for the confusion matrices reported by So & Best (in R and Python – note that the Python code is in a .txt file, not a .py file, since my website host seems to think I’m up to no good if I try to use the latter).

Here’s the data file assumed by the Python script (the matrices are hard-coded in the R script), if you want to play along at home. The first four rows contain the confusion matrix for Cantonese listeners, the next four for Japanese listeners, and the last four for English listeners.

Both the R and Python scripts do essentially the same thing. I’ve been using Python more than R lately for various reasons, so that’s what I’ll focus on here.

I wrote a function that adjusts for any zeros in the confusion matrices then calculates response bias, similarities, and distances (d_{ij} = \sqrt{-log(\eta_{ij})}) for an input matrix:

def sbd(Mt):
    # zeros are bad
    Mt = Mt + .01
    # renormalize
    for ri in range(4):
        Mt[ri,:] = Mt[ri,:]/np.sum(Mt[ri,:])
    # initialize similarity matrix
    St = np.zeros((4,4))
    # calculate similarities
    for ri in range(4):
        for ci in range(4):
            St[ri,ci] = np.sqrt(Mt[ri,ci]*Mt[ci,ri]/(Mt[ri,ri]*Mt[ci,ci]))
    # distances
    Dt = np.abs(np.sqrt(-np.log(St)))
    # bias
    Bt = np.zeros(4)
    for ri in range(4):
        Bk = np.zeros(4)
        for ki in range(4):
            Bk[ki] = np.sqrt(Mt[ri,ki]*Mt[ki,ki]/(Mt[ki,ri]*Mt[ri,ri]))
        Bt[ri] = 1/np.sum(Bk)
    Bt = Bt/sum(Bt)
    return St, Dt, Bt

I also wrote a function for calculating predicted confusion probabilities for a given set of parameters so that I could see how well the model fits the data:

def scm(s,b):
    nr = len(b)
    Mp = np.zeros((nr,nr))
    for ri in range(4):
        for ci in range(4):
            Mp[ri,ci] = b[ci]*s[ri,ci]
        Mp[ri,:] = Mp[ri,:]/np.sum(Mp[ri,:])
    return Mp

You can look at either script to see how to use the estimated similarities/distances to fit hierarchical clustering or MDS models.

Here’s a plot showing the observed and predicted confusion probabilities (closer to the diagonal = better fit):

so_best_predobsOverall, the fit seems to be pretty good, though it’s not perfect. There aren’t any really huge discrepancies between the observed and predicted probabilities. For whatever reason, the model seems to fit the Cantonese listeners’ data best, with the largest discrepancies for a couple data points from the Japanese listeners.

As a side note, the plots generated by the R script look a bit better than the plots generated by the Python script, but I’ve been fiddling with R plots for a few years now, while I’m still figuring out how to do this kind of thing with Python. I’m pretty happy with Python so far, though I’d really like to be able to remove the box and just have x- and y-axes. But I digress…

Here are dendrograms for the hierarchical cluster models fit to the estimated distances. In each of these, the y-axis indicates (estimated) distance, with the relative heights of the clusters indicating the relative dissimilarity of the tones (indicated by the numbers at the bottom):

so_best_cant_dend so_best_jpns_dend so_best_engl_dend


There are two obvious things to note about these plots. The most obvious similarity is that tones 1 and 4, on the one hand, and tones 2 and 3, on the other, form the bottom two clusters for each language group, indicating that 1 and 4 are more similar (less distant) to one another than either is to tone 2 or 3, and vice versa. The most obvious difference is that tones 1 and 4 are more similar than tones 2 and 3 for the English listeners, but tones 2 and 3 are more similar than tones 1 and 4 for the Japanese and Cantonese listeners. (This would be more obvious if I knew how to make the labels and colors consistent across these plots, so that 1 and 4 were always on the right and constituted the red cluster, with 2 and 3 on the left in the green cluster, but, again, I’m still figuring all this out in Python, so this will have to do for now.) It’s also pretty clear that the 1-4 and 2-3 clusters are less similar for the Cantonese listeners than for either other group.

Here’s an MDS plot with all three groups’ data presented together (the letters and colors indicate the language group, and the numbers indicate the tones):



As with the cluster analysis, it’s clear that tones 1 and 4 pattern together as do tones 2 and 3. The differences in 1-4 vs 2-3 similarity across groups is also evident here.

Finally, here’s a plot of each groups’ bias parameters (colors are as in the MDS plot, tones are, again, indicated by the numbers on the x-axis):

so_best_biasI was a bit surprised by how similar the bias parameters are for all three groups, though there are some potentially interesting differences. The English listeners mostly just didn’t want to label anything “4”, while the Japanese listeners seemed to be more biased toward “1” and, to a greater degree, “2” responses than either “3” or “4” responses. The Cantonese listeners exhibit a similar pattern, though with a weaker bias toward “2” responses.

Okay, so where does all this leave us? None of what I’ve done here actually provides statistical tests of any patterns in the data, though the SCM (and related models) can be elaborated and rigorous tests can be carried out by constraining similarities and biases in various ways and comparing fitted models. And, as mentioned above, log-linear analysis is a better out-of-the-box method for analyzing this kind of data than is ANOVA.

Statistical tests aside, though, I would argue that the SCM and clustering/scaling methods are far better than the presentation and visualization of the data presented by So & Best. The SCM allows us to look separately at pairwise similarity between stimuli and response bias, but it also allows us to readily generate easy to interpret figures that illustrate patterns that are not at all obvious in the raw data (or in other figures; e.g., I find So & Best’s Figure 3 to be rather difficult to interpret or draw any kind of generalization from).

As much as I’d like tools like these to be more widely used, I’m not terribly hopeful that they will be any time soon. But I’ll keep promoting them anyway.

Posted in statistical description, statistical graphics, statistical modeling | Comments Off

The Demarcation Game

About a week and a half ago, I posted a summary of Larry Laudan’s essay The Demise of the Demarcation Problem in order to set the stage for some posts on the first few essays in a new book on The Philosophy of Pseudoscience, edited by Massimo Pigliucci and Maarten Boudry. This post addresses the first chapter, The Demarcation Problem, A (Belated) Response to Laudan, Mr. Pigliucci’s (non-editorial) contribution (the entirety of which seems to be available via the ‘look inside’ feature on Amazon, for what it’s worth).

All of this will make more sense if you have read the original Laudan essay already. If you can’t find a copy of that, you could read my earlier post. And if you don’t feel like bothering with that, here’s an extremely concise summary: How and why we label things science, non-science, or pseudo-science isn’t philosophically interesting, whereas it is philosophically interesting to (try to) understand how, when, and why claims about the world are epistemically warranted and how, when, and why this or that methodology licenses such claims.

Of course, it’s more complicated than that, but this is the nub. The labeling problem might be semantically or sociologically interesting, but it doesn’t have much, if any, bearing on whether or not certain claims about the world are belief-worthy.

To elaborate a bit more on Laudan’s position, he outlines three metaphilosophical points that constrain any putative demarcation criterion. First, it should accurately label paradigmatic cases of science, non-science, or pseudo-science by virtue of the epistemic and/or methodological features that science has and that its complement does not, and it should be precise enough so that this labeling can actually be carried out. Second, it should supply necessary and sufficient conditions for appropriate application of such labels. Third, because it will have potentially important social and political consequences, it should be especially compelling.

Pigliucci’s essay starts with a nice discussion of Popper’s take on demarcation, noting its relation to the problem of induction as well as some of the well-known problems with it. He follows the section on Popper with a discussion of Laudan’s recounting of the history of demarcation and a discussion of Laudan’s metaphilosophical interlude. He finishes with a sketch of how a new demarcation project might proceed.

The whole essay, though, simply assumes that the labeling problem is important (and that it’s closely related to problems that are more obviously of interest in philosophy of science; more below). Because Pigliucci assumes this, he doesn’t go about making a case for it, and so he doesn’t really engage with the main thrust of Laudan’s essay.

In addition, everything kind of goes off the rails after the introductory discussion of Popper. Throughout his discussion of Laudan’s essay, Pigliucci conflates issues that should be kept distinct, presents ideas that are consistent with (or even identical to) Laudan’s position as arguments against him, and misreads Laudan in ways that are, at best, lazy, and, at worst, willfully negligent.

And as good as it is, even the section on Popper has a humdinger of specious reasoning. Pigliucci writes:

Regardless of whether one agrees with Popper’s analysis of demarcation, there is something profoundly right about the contrasts he sets up between relativity theory and psychoanalysis or Marxist history: anyone who has had even a passing acquaintance with both science and pseudoscience cannot but be compelled to recognize the same clear difference that struck Popper as obvious. I maintain in this essay that, as long as we agree that there is indeed a recognizable difference between, say, evolutionary biology on the one hand and creationism on the other, then we must also agree that there are demarcation criteria – however elusive they may be at first glance.

It seems to me that the most obvious recognizable difference between evolutionary biology and creationism is that the former has mountains of evidence supporting its claims while the latter has none (and, in fact, has mountains of evidence against it). That is, these two examples are recognizably different with respect to their epistemic warrant, whether we label them both science or not. I’ll even go one better and say that it is only by placing evolutionary biology and creationism on the same playing field that we know that one is well-supported and the other is not.

Aside from this, as noted above, the section on Popper is pretty good. On the other hand, the section on Laudan’s history of demarcation is kind of a mess. Pigliucci implies that Laudan’s conclusion that the demarcation project has failed is inconsistent with the idea that philosophy makes progress. But, of course, Laudan doesn’t have a problem with progress in philosophy; in The Demise of the Demarcation Problem, he writes that “cognitive progress is not unique to the ‘sciences.’ Many disciplines (e.g., literary criticism, military strategy, and perhaps even philosophy) can claim to know more about their respective domains than they did 50 or 100 years ago.” In fact, I would guess that Laudan considers his essay at least a small contribution to the progress of philosophy.

A few pages later, Pigliucci laments that Laudan “reads this history [of Mill and Whewell’s treatment of induction] in an entirely negative fashion” and complains that these philosophers’ works “are milestones in our understanding of inductive reasoning and the workings of science, and to dismiss them as “ambiguous” and “embarrassing” is both presumptuous and a disservice to philosophy as well as to science.” Of course, Laudan’s point in this essay is that, with respect to demarcation, such negativity is justified. In other works, Laudan has quite a lot to say about the role of induction in the history and philosophy of science, but, somehow, Pigliucci forgets the scope of the essay in question and isn’t aware of Laudan’s other treatments of this history.

But misreading Laudan as being presumptuous, overly negative, and inconsistent with philosophical progress are fairly minor problems.

As mentioned above, Pigliucci conflates various issues that should be kept distinct. He conflates probability and reliability of scientific hypotheses (p. 14), quoting Laudan’s claim that “several nineteenth century philosophers of science” responded to fallibilism “by suggesting that scientific opinions were more probable or more reliable than non-scientific ones,” and then snarkily noting that “surely Laudan is not arguing that scientific “opinion” is not more probable than “mere” opinion. If he were, we should count him amongst postmodern epistemic relativists, a company that I am quite sure he would eschew.”

Of course, one can coherently and reasonably reject the philosophical position that scientific ideas are evaluated in terms of their probability without immediately becoming an epistemic relativist; see, e.g., Deborah Mayo. Or see the other option that Pigliucci himself included in his quote from Laudan, namely reliability.

Pigliucci later conflates theory comparison and demarcation (p. 15) as well as theories and the proponents of theories (p. 16).

Ultimately, the only real substance of Pigliucci’s argument against Laudan boils down to the idea that necessary and sufficient conditions for demarcation are outdated, since sciences are related by family resemblances, and science is a “cluster concept.”

Given that Laudan describes science as having substantial “epistemic heterogeneity,” I don’t imagine he would take issue with the application of family resemblances to science as a category. But whereas Laudan takes this as an indication that any (epistemic or methodological) demarcation project is futile, Pigliucci wants it to be the basis of demarcation.

This idea as applied to science is illustrated in Figures 1.1 and 1.3 in the Pigliucci essay (1.2 illustrates ‘games’ and family resemblance). Figure 1.1 is presented in relation to Laudan’s first metaphilosophical point, but Pigliucci adds a lot more structure than is implied by Laudan’s ‘paradigmatic cases’, and neither the figure nor the corresponding text do much in the way of stating any key epistemic or methodological features of science nor or providing a precise set of demarcation criteria:

Figure 1.3 provides more substance, but in so doing, it also nicely (if implicitly) illustrates the distinction between the philosophically uninteresting labeling problem and the sorts of issues that are at the heart of philosophy of science:

It seems to me that the real work of philosophy of science (and the kind of thing explicitly endorsed by Laudan in his essay) consists of carefully defining things like ‘theoretical understanding’ and ‘empirical knowledge’ and then figuring out how, and to what degree, different theories provide these. If we can define this kind of space and then precisely locate fields of inquiry in it, what does it matter if we call this field “science” but not that one?

To reiterate the point made above, it might be semantically or sociologically interesting to figure how how (people’s intuitions about) these labels work, but it’s superfluous to the substance of philosophy of science. Pigliucci assumes that the labeling problem is both interesting and closely tied to issues like theoretical understanding and empirical knowledge, but he doesn’t make any kind of case for it. If anything, his invocation of fuzzy logic and fuzzy set theory militates against the importance of the labeling problem, since if you’re arguing for the utility of gradient membership in the set science, you’ve pretty much given up on analogous discrete labels.

Pigliucci ends his essay with “reasonable answers to Laudan’s three “metaphilosophical” questions” (Pigliucci uses scare quotes around “metaphilosophical” throughout, claiming not to understand why the “meta” prefix is necessary):

(1) What conditions of adequacy should a proposed demarcation criterion satisfy?

A viable demarcation criterion should recover much (though not necessarily all) of the intuitive classification of sciences and pseudosciences generally accepted by practicing scientists and many philosophers of science, as illustrated in figure 1.1.

(2) Is the criterion under consideration offering necessary or sufficient conditions, or both, for scientific status?

Demarcation should not be attempted on the basis of a small set of individually necessary and jointly sufficient conditions because “science” and “pseudoscience” are inherently Wittgensteinian family resemblance concepts (fig. 1.2). A better approach is to understand them via a multidimensional continuous classification based on degrees of theoretical soundness and empirical support (fig. 1.3), an approach that, in principle, can be made rigorous by the use of fuzzy logic and similar instruments.

(3) What actions or judgments are implied by the claim that a certain belief or activity is “scientific” or “unscientific”?

Philosophers ought to get into the political and social fray raised by discussions about the value (or lack thereof) of both science and pseudoscience. This is what renders philosophy of science not just an (interesting) intellectual exercise, but a vital contribution to critical thinking and evaluative judgment in the broader society.

Pigliucci’s answer to (1) is just a subset of Laudan’s answer to (1).

Pigliucci’s answer to (2) conflates the labeling problem with philosophically interesting problems, as discussed above, and all but gives the demarcation game away. Recall that Laudan’s motivation for requiring necessity and sufficiency in a demarcation criterion is that necessity alone does not allow us to label something scientific and sufficiency alone does not allow us to label something non-scientific. Whatever “multidimensional continuous classification” is, if it’s not providing necessary and sufficient conditions, it’s going to make classification errors. This isn’t fatal to a classification scheme, of course (see, e.g, detection theory), but Pigliucci doesn’t seem to be thinking of demarcation as noisy classification; there’s no mention of classification error or how it might be minimized, for example. And it bears noting that, for all his talk of fuzzy boundaries and gradient set membership and the like, Pigliucci seems to be ready, able, and willing to definitively classify different fields as established scienceproto-sciencesoft science, or pseudo-science – his Figures 1.1 and 1.3 look to me to have rather sharp boundaries.

Pigliucci’s answer to (3) is pretty much exactly Laudan’s answer to (3). Bizarrely, Pigliucci writes (on p. 20) that “I also markedly disagree with Laudan in answer to his question 3,” and then he quotes Laudan saying, in part:

Philosophers should not shirk from the formulation of a demarcation criterion merely because it has these judgmental implications associated with it. Quite the reverse, philosophy at its best should tell us what is reasonable to believe and what it not. But the value-loaded character of the term “science” (and its cognates) in our culture should make us realize that the labeling of a certain activity as “scientific” or “unscientific” has social and political ramifications which go well beyond the taxonomic task of sorting beliefs into two piles.

Pigliucci seems oddly confused about (3), stating one page later that “there simply is no way, nor should there be, for the philosopher to make arguments to the rest of the world concerning what is or is not reasonable to believe without not just having, but wanting political and social consequences.” Aside from the “wanting” bit, this is pretty much what he just quoted Laudan saying.

It’s maybe also worth noting, contra both Laudan and Pigliucci, that it’s not that difficult to think of cases in which science tells use something about what we should believe with essentially no attendant social or political consequences. To mention just the most prominent recent example, the big results from the LHC in 2012 tell us we should probably believe in (some version of) the Higgs boson, but for the vast, vast majority of humanity, this work has exactly no substantive social or political consequences at all.

To sum up, Chapter 1 of TPoP does little, if anything, to establish that Laudan was wrong in declaring the demarcation problem dead. I’ll come back to some of the subsequent chapters in the near future.

Posted in philosophy of science | 3 Comments

Where are the numbers?

The unfortunate death of Philip Seymour Hoffman seems to have been a catalyst for lots of panicky articles about the heroin epidemic, and from the (admittedly small) sample of articles I’ve seen, calling it an “epidemic” is absurd hyperbole.

I heard a report on NPR last night with statements like “This is not the first time heroin use has skyrocketed in the United States,” descriptions of heroin “flooding” across the border from Mexico and declaration of “stark” statistics.

In the NPR piece, there’s just one number you can call a statistic, and it’s presented without enough context to allow it to support the overblown claims:

“If you look at just the raw statistics,” he says, “over the last four or five years, heroin deaths went up 45 percent.”

Of course, you could have a 45% increase if there were 100 deaths before and 145 more recently. Or if there were 100,000,000 deaths before and 145,000,000 more recently. The percentage increase alone tells us exactly nothing about whether or not there is an epidemic of heroin (ab)use in the US.

This morning, I saw that Kottke links to two stories, neither of which do much better, numbers-wise. The PBS News Hour story says things like

GIL KERLIKOWSKE, U.S. Drug Policy Director: It is a serious problem. We are seeing an increase. I think the concern is always that data usually lacks one or two or sometimes three years, depending on what the survey or what the measure is. But I can tell you, in my travels across the country, and I spoke to the national narcotics officers today at lunch, there is no question we are seeing a resurgence of heroin.

The only numbers that have even a hint of a chance at supporting claims of a serious problem or a heroin resurgence come later, when Mr. Kerlikowske mentions 22 deaths in Western Pennsylvania from heroin laced with other opiates. Of course, his point here isn’t to support the claim of a resurgence, it’s to point out that black market drugs can vary wildly in their contents. In any case, since we don’t have any idea how many people normally die from heroin in Western Pennsylvania, 22 deaths is an utterly uninformative statistic.

The article about the Vermont Governor’s State of the State speech on drug abuse gives us the hardest, and most interpretable, numbers I’ve seen, and they don’t make a good case for claims of an epidemic or floods of heroin washing over the country:

Last year, he said, nearly twice as many people here died from heroin overdoses as the year before. Since 2000, Vermont has seen an increase of more than 770 percent in treatment for opiate addictions, up to 4,300 people in 2012.

The bit about twice as many deaths is, as noted above, not very informative. But the statement that 4,300 people received treatment for opiate addictions (opiates being a superset of heroin, please note) in 2012 finally gives us something to work with.

A quick google search tells us that the population of Vermont is 626,011, so that 770% increase up to 4,300 people in treatment makes a grand total of 0.6% – that’s sixth tenths of one percent – of the population of Vermont.

In the article about Vermont’s Governor, there’s a link to another New York Times article, and it gives us similar numbers for a few more states:

Heroin killed 21 people in Maine last year, three times as many as in 2011, according to the state’s Office of Substance Abuse and Mental Health Services. New Hampshire recorded 40 deaths from heroin overdoses last year, up from just 7 a decade ago. In Vermont, the Health Department reported that 914 people were treated for heroin abuse last year, up from 654 the year before, an increase of almost 40 percent.

Maine has 1,329,000 people, so 0.002% – two thousandths of one percent – of the population died due to heroin in 2012. New Hampshire has 1,321,000 people, so 0.0003% – three thousandths of one percent – of the population there overdosed on heroin and died in 2012. The Vermont numbers in this article show that the people receiving treatment for heroin specifically make up less than 25% of the people receiving treatment for opiates (0.14% of the population, if you do the math).

This is not an epidemic. It’s tragic for the people involved, of course, and I would very much prefer a world in which no one died of heroin overdoses. I can’t imagine the pain of being addicted to heroin, or of having a loved one struggle with or lose their life to such an addiction.

But simply saying it’s an epidemic doesn’t make it true. Nor should it convince anyone that prohibition is the solution to the problem.

Update: Somehow, I had missed this other relevant post by Jacob Sullum, which makes pretty much the same point I made, with similar numbers from other unreliable sources.

Posted in statistical description | 1 Comment

A partially problematic paragraph

I just read an interesting new paper (Hoekstra, et al., 2014) on how people – even those with substantial training in inferential statistics – consistently misinterpret confidence intervals (CIs). Reading this paper got me thinking about CIs in general, and I’ll probably return to this paper and topic again soon, but for now I want to highlight the antepenultimate paragraph of the paper (I’m irrationally happy that the paragraph I want to talk about really did just happen to be the third from the last).

In discussing the possibility of treating CIs as Bayesian credible intervals, Hoekstra, et al., write (with bracketed numbers inserted by me and corresponding to my subsequent notes):

First, treating frequentist and Bayesian intervals as inter- changeable is ill-advised and leads to bad “Bayesian” thinking [1]. Consider, for instance, the frequentist logic of rejecting a pa- rameter value if it is outside a frequentist CI. This is a valid frequentist procedure with well-defined error rates within a frequentist decision-theoretic framework [2]. However, some Bayesians have adopted the same logic (e.g., Kruschke, Aguinis, & Joo, 2012; Lindley, 1965): They reject a value as not credible if the value is outside a Bayesian credible interval [3]. There are two problems with this approach. First, it is not a valid Bayesian technique; it has no justification from Bayes’s theorem (Berger, 2006) [4]. Second, it relies on so-called “noninformative” priors, which are not valid prior distributions [5]. There are no valid Bayesian prior distributions that will yield correspondence with frequentist CIs (except in special cases), and thus inferences resulting from treating CIs as credible intervals must be incoherent [6]. Confusing a CI for a Bayesian interval leaves out a critically important part of Bayesian analysis—choosing a prior—and, as a result, leaves us with a non-Bayesian technique that researchers believe is Bayesian [7].


[1] The immediately preceding discussion isn’t about treating frequentist and Bayesian intervals as interchangeable, it’s about, as mentioned above, treating CIs as Bayesian intervals. Even if you were to treat all frequentist CIs as Bayesian intervals, this still leaves open the possibility that some Bayesian intervals are not frequentist CIs. Not a big error, granted, but this is a basic logical issue: if A implies B, this does not imply that A and B are equivalent.

[2] It’s kind of nice to see hardcore Bayesians give a straightforward description of null hypothesis significance testing that grants that the approach has at least some positive properties (e.g., “well-defined error rates”). I mean, there’s a parenthetical later that describes NHST as “pernicious,” but still.

[3] If this isn’t kosher, maybe a different name would be better for the Bayesian analogs to CIs? Less snarkily, I’m sincerely curious how the authors propose discrete decisions be made based on Bayesian data analysis. What function does a credible interval have if it doesn’t tell us something about parameter (or predicted data) values we should consider inconsistent with our data and model? Even if we treat a credible interval’s measure of uncertainty with regard to a parameter as explicit (and primary), we can’t escape the fact that we’re implicitly (and perhaps secondarily) specifying a set of incredi… er, not credible values, too.

[4] I don’t know what all Berger says in the cited 2006 paper, but I find it hard to believe that even the most hardcore Bayesians do all and only that which is justified by Bayes’ theorem.

[5] This is nonsense. Or at least needs to be spelled out in more detail and justified. If Hoekstra, et al., are defining “noninformative” priors as invalid, then, sure, but then it’s just a tautology. If “noninformative” means something else, then a statement this strong needs to be given some support rather than just asserted. I suppose it’s possible they’re thinking along the lines of Andrew Gelman on this issue, but, for various reasons, I would guess not (e.g., the less-hardcore-Bayesian-than-some position he takes in this paper).

[6] Speaking of incoherent, in the preceding paragraph, the authors write that “frequentist CIs can be numerically identical to Bayesian credible intervals, if particular prior distributions are used.” These must be very special cases indeed if they logically imply incoherence. As with the bit about “noninformative” priors, this kind of assertion needs to be backed up, if only with citations, preferably with a brief discussion in situ. The paper is only eight pages long, after all.

[7] If we can assume, for a moment, that we’re dealing with a coherent case in which a CI and a credible interval are identical, then it’s not clear to me why treating the CI as equivalent to the credible interval is so bad. Why is the act of choosing the prior so important, if, in at least some cases, you can arrive at the same conclusions (with respect to whatever inferences you can or cannot draw from estimated intervals – see [3] above) without having explicitly chosen a prior? Or, from a different angle, if the two are equivalent, and it’s only certain priors that induce this equivalence, haven’t you implicitly chosen a prior by constructing the CI in the first place?

Okay, so the antepenultimate paragraph in this paper is pretty awful. The paper as a whole is interesting, though, so I’ll try to come back to it (and some general thoughts about CIs as a statistical tool) soon.

Posted in SCIENCE!, statistical modeling | Comments Off

Resuscitation of the Demarcation Problem?

An edited volume on (and called) The Philosophy of Pseudoscience (TPoP) came out last year. I would be hard pressed to think of a topic better suited to get me to pony up a few bucks and then spend time reading and thinking about something not directly related to my work.

The basic idea motivating the book is that Larry Laudan was wrong (or at least premature) in announcing The Demise of the Demarcation Problem. The Demarcation Problem is, in case you don’t know, the problem of differentiating between science and non-science or between science and pseudoscience (and maybe also between non-science and pseudoscience). As it happens, I’ve blogged about this Laudan essay before, though in a way that isn’t directly addressed at this new book, so in this post, I’ll review the basics of Laudan’s argument. I’ll follow up in later posts with reviews of the first few chapters of TPoP.

Spoiler alert: I remain unconvinced by the arguments in TPoP that Laudan is wrong. In order to see why (I think) they’re wrong, it will be useful to make reference to Laudan’s original essay. I’ve got it as a hard copy of a book chapter, and I can’t find it in freely-available digital format (here it is behind a rather pricey paywall, and here is most, but not all, of it on google books), so I will cover the main points made in a highly abbreviated format (and organized using a subset of Laudan’s section headers).

The Old Demarcationist Tradition

Ancient concerns with knowledge/reality vs opinion/appearance lead to Aristotle arguing that scientific knowledge is certain, deals in causes, and follows logically from first principles, which are themselves derived directly from sensory input. Thus, Laudan writes that, according to Aristotle, “science is distinguished from opinion and superstition by the certainty of its principles; it is marked off from the crafts by its comprehension of first causes.” (p. 212)

In the 17th century, the latter criterion fell out of favor. Many of the people we think of as the founding fathers of modern science (e.g., Galileo, Newton) explicitly repudiated the idea that a science necessarily addresses causes. By the 19th century, certainty was discarded, as well: “the unambiguous implication of fallibilism is that there is no difference between knowledge and opinion: within a fallibilist framework, scientific belief turns out to be just a species of the genus opinion.” (p. 213)

With certainty and causation no longer able to demarcate science from non-science, folks turned methodology to (try to) do the job. In order for methodology to do the job, philosophers had to establish that there is a single, unified scientific method and that this method is epistemically better than other, non-scientific methods. Attempts to establish the unity and epistemic superiority of method led to disagreement about what the one, true method is and ambiguity or outright falsity with respect to whether or not practicing scientists actually employ any particular proposed method.

A Metaphilosophical Interlude

Laudan says that we should ask three questions (quoted verbatim from the essay):

  1. What conditions of adequacy should a proposed demarcation criterion satisfy?
  2. Is the criterion under consideration offering necessary or sufficient conditions, or both, for scientific status?
  3. What actions or judgments are implied by the claim that a certain belief or activity is ‘scientific’ or ‘nonscientific’?

With respect to #1, Laudan argues that a demarcation criterion should (a) accord with common usage of the label ‘science’ – it should capture the paradigmatic cases of science and non-science, regardless of how it deals with more difficult, borderline cases; (b) identify the epistemic and/or methodological properties that science has and that non-science does not; and (c) be precise enough so that we can, in fact, use it to demarcate science and non-science.

With respect to #2, Laudan argues that a demarcation criterion must provide both necessary and sufficient conditions. Necessary conditions alone will only allow us to say if something isn’t science, but do not allow us to say if something is, while sufficient conditions alone only allow us to say if something is science, but do not allow us to determine what is not scientific.

With respect to #3, Laudan points out that, because it will have numerous social, political, and, in general, practical implications, “any demarcation criterion we intend to take seriously should be especially compelling.”

The New Demarcationist Tradition

More recent efforts to develop demarcation criteria have focused on what Laudan calls potential epistemic scrutability rather than actual epistemic warrant. Verificationist, falsificationist, and various approaches based on testability, well-testedness, the production of surprising predictions, and so forth, all serve very poorly as demarcation criteria for various reasons, including their frequent failure to serve as both necessary and sufficient conditions for demarcation. Plenty of obviously scientific claims aren’t verifiable, and plenty of obviously non-scientific claims are; plenty of obviously scientific claims aren’t falsifiable, and plenty of obviously non-scientific claims are. And so on.

It’s worth quoting from Laudan’s conclusion at length (emphasis in the original):

Some scientific theories are well-tested; some are not. Some branches of science are presently showing high rates of growth; others are not. Some scientific theories have made a host of successful predictions of surprisingly phenomena; some have made few if any such predictions. Some scientific hypotheses are ad hoc; others are not. Some have achieved a ‘consilience of inductions'; others have not. (Similar remarks could be made about several nonscientific theories and disciplines.) The evident epistemic heterogeneity of the activities and beliefs customarily regarded as scientific should alert us to the probable futility of seeking an epistemic version of a demarcation criterion. Where, even after detailed analysis, there appear to be no epistemic invariants, one is well advised not to take their existence for granted. But to say as much is in effect to say that the problem of demarcation… is spurious, for that problem presupposes the existence of just such invariants.

Laudan ends the essay with a brief discussion of the sorts of things that philosophy of science should be focused on, given his dismissal of the demarcation problem. The last couple pages are available in google books, but I’ll quote him here again:

It remains as important as it ever was to ask questions like: When is a claim well confirmed? When can we regard a theory as well tested? What characterizes cognitive progress?

He is, at least in part, making a semantic argument that the class of things appropriately labeled ‘science’ (and its counterparts in the classes of things labeled ‘non-science’ and ‘pseudoscience’) just isn’t particularly (philosophically) interesting. One last quote, from the concluding paragraph:

…we have managed to conflate two quite distinct questions: What makes a belief well founded (or heuristically fertile)? And what makes a belief scientific? The first set of questions is philosophically interesting and possibly even tractable; the second question is both uninteresting and, judging by its checkered past, intractable.

It’s interesting to note, given his conclusions here, that Laudan has more recently been focusing on legal epistemology, which is to say that he’s been pursuing his ‘first set of questions’ in a non-scientific area. It was also interesting to note that the essay immediately following the Demise essay in the book I have (a critique of a judicial decision about teaching creationism in Arkansas schools) kind of foreshadows this move. But I digress.

Next up: summaries and discussion of the first few essays in TPoP.

Posted in philosophy of science | Comments Off

More on (moron!) bad pop science

Blogging clearly isn’t my priority lately (I have some great posts planned, though, I swear!), but nothing beats dumb pop science writing for generating a quick post. I’m sure I’ve missed some since my last post on the topic a month ago, but I just saw a new example (via the science subreddit) that I can’t resist responding to.

From the pop science article:

Although more research is necessary, the results suggest that spirituality or religion may protect against major depression by thickening the brain cortex and counteracting the cortical thinning that would normally occur with major depression.

And from the Results and Conclusions and Relevance sections of the publicly available journal page for the publication:

… We note that these findings are correlational and therefore do not prove a causal association between [religious/spiritual] importance and cortical thickness….

A thicker cortex associated with a high importance of religion or spirituality may confer resilience to the development of depressive illness in individuals at high familial risk for major depression, possibly by expanding a cortical reserve that counters to some extent the vulnerability that cortical thinning poses for developing familial depressive illness.

It’s a subtle issue, I know, but there is a logical distinction between, on the one hand, spirituality protecting against depression by making brains thick and, on the other, thick brains protecting against depression and simultaneously being statistically associated with (self-reported importance of) spirituality.

Posted in SCIENCE! | Comments Off

On bad pop science

I just love this kind of writing about abstruse, abstract physics for a lay audience:

A team of physicists has provided some of the clearest evidence yet that our Universe could be just one big projection.

In 1997, theoretical physicist Juan Maldacena proposed that an audacious model of the Universe in which gravity arises from infinitesimally thin, vibrating strings could be reinterpreted in terms of well-established physics. The mathematically intricate world of strings, which exist in nine dimensions of space plus one of time, would be merely a hologram: the real action would play out in a simpler, flatter cosmos where there is no gravity.


In two papers posted on the arXiv repository, Yoshifumi Hyakutake of Ibaraki University in Japan and his colleagues now provide, if not an actual proof, at least compelling evidence that Maldacena’s conjecture is true.


Neither of the model universes explored by the Japanese team resembles our own, Maldacena notes.


Posted in SCIENCE! | Comments Off

A quick pfisking

The first link in this week’s “This week in stats” (by Matt Asher) post leads to a fairly silly rant (by a Wesley) about p-values. I feel like it deserves a quick (but partial, because I don’t disagree with everythingfisking, in addition to reiterating the point made by Mr. Asher that, whatever problems p-values have, no solutions are on offer here (though I know at least a dozen or so people who would argue against his claim that no one has come up with a satisfactory substitute to p-values). Anyway:

Wesley: P-values … can also be used as a data reduction tool but ultimately it reduces the world into a binary system: yes/no, accept/reject.

Noah: Given that p-values are but one part of a statistical analysis in the frequentist hypothesis testing tradition, I have a hard time seeing why this is so problematic. A calculated test statistic either exceeds a criterion or it doesn’t. This doesn’t tell the whole story of an data set, but it’s not meant to.

W: Below is a simple graph that shows how p-values don’t tell the whole story.  Sometimes, data is reduced so much that solid decisions are difficult to make. The graph on the left shows a situation where there are identical p-values but very different effects.

N: I don’t understand what Wesley means when he links data reduction and decision-making difficulty, so I’ll leave that one alone. I’ll also not go into depth about why I think these graphs kind of stink (to mention maybe the worst thing about the graphs: they’re mostly just white space, with the actual numbers of interest huddled up against the [unnecessary] box outline).

Anyway, it’s not at all clear how the two “effects” in the left panel could be producing the same p-value (and the code from the post isn’t working when I try to run it – the variable effect.match is empty, since the simulation with the minimum CI difference isn’t in the set of p-values that match, i.e., logical indexing fails to produce a usable index – so I can’t reproduce the plot). Contra my intuition when first seeing the graph, it is not illustrating a paired t-test, but, rather, two single-sample t-tests. I gather that each of these red dots is illustrating a mean, and each vertical line is illustrating an associated confidence interval, and that the means are being compared to zero. Given that one (CI) line covers zero and the other does not, the p-values shouldn’t be the same.

W: The graph on the right shows where the p-values are very different, and one is quite low, but the effects are the same.

N: I disagree that the effects are the same. Sure, the means are the same (by design), but the data illustrated on the right is much more variable than the data illustrated on the left.

W: P-values and confidence intervals have quite a bit in common and when interpreted incorrectly can be misleading.

N: I agree, but this is a pretty anodyne statement. Now back to the fisking.

W: Simply put a p-value is the probability of the observed data, or more extreme data, given the null hypothesis is true.

N: Close, but nope. A p-value is the probability of an observed or more extreme test statistic, not the data. It’s an important distinction, and it’s related to the conflation of “effects” with “means” and the different p-values for identical means with different variability around the means in the figure above.

So, anyway, none of this is meant to imply that p-values don’t have limitations. Of course they do. And understanding these limitations is worthwhile. But posts like Wesley’s don’t, in my opinion, do much to foster such understanding.

Posted in statistical description, statistical graphics, statistical modeling | Comments Off

Something old, something new, two somethings under review

I’ve recently revised a manuscript that had been posted on my CV page for a while. It’s a rather technical paper on optimal response selection and model mimicry in Gaussian general recognition theory. In its previous state, it was all and only technical information on these topics. Now, thanks to the urging (and encouraging) of my co-author, it has an introduction and conclusion that actually relate the technical information to a (slightly) wider body of work. It’s here, if you’re interested.

I’ve also recently revised a manuscript that, until today, had never been exposed to public scrutiny. It’s from work I did while at CASL on individual differences in non-native perceptual abilities and how these abilities relate to second language learning of (slightly) higher-level linguistic structure. It feels very nice to finally have it written up and ready for public consumption. It’s here, if you’re interested.

Both papers are under review at, I hope, suitable journals. I mean, the first is under review at pretty much the only journal I can even begin to imagine it being published in. I’m reasonably confident that it will, eventually, be published there. On the other hand, I could see the second fitting in okay in a number of different journals. The one I submitted it to is fairly high-profile (and high impact factor!), though, so it would be nice to get it published there.

That’s all for now.

Posted in SCIENCE! | Comments Off