A critical assessment of GRT wIND

I’ve blogged about GRT (General Recognition Theory) at least a few times before (e.g., here, here, here). The third of those links goes to an introductory GRT post (#1 of an unknown number of planned posts). Here’s the super-abbreviated version: GRT is a multidimensional extension of signal detection theory (SDT). The simplest GRT model consists of four bivariate Gaussian densities (perceptual distributions) and two decision bounds. These are used to model identification confusion patterns for stimuli consisting of a factorial combination of two levels on each of two dimensions (e.g., square or octagon on a shape dimension, red or purple on a hue dimension).

Robin Thomas and I have done some theoretical work with this model recently. More specifically, we wrote a paper in which we described how failures of decisional separability (i.e., decision bounds that are not parallel to the coordinate axes) are not identifiable in the 2 \times 2 model. The basic issue is that, with linear bounds, affine transformations of the modeled perceptual space can always map a model with failure of DS onto a model in which DS holds. (This is illustrated in the second ‘here’ link above.)

More recently, Soto, et al. developed a multilevel GRT model (called GRT wIND – GRT with Individual Differences) and claimed to have solved the identifiability problem that Robin and I detailed (pdf). Still more recently, Soto and Ashby wrote a second paper with GRT wIND (pdf), which I reviewed (twice) for the journal Cognition. Yesterday, I got an email (as a member of the Society for Mathematical Psychology) announcing an R package for fitting GRT wIND (and carrying out various other GRT-related analyses).

The paper I reviewed has since been accepted for publication, and I confess some measure of frustration with the fact that Soto and Ashby (and the editor who handled the paper for Cognition) seem to have completely ignored the comments and suggestions from my second (and signed) review.

One key issue I raised was a concern that the GRT wIND model is over-parameterized, i.e., that given a data set, the parameters are not uniquely identifiable. In part motivated by my frustration, and in (larger) part because I actually want to know if my concern is valid or not, and in part because they made the model fitting software available, I decided to ry to validate (or not) my concern about over-parameterization in GRT wIND.

Some GRT wIND background first: The group level model consists of four bivariate Gaussian densities. Each has a mean vector and a covariance matrix. The individual subject level for subject k consists of four bivariate Gaussian densities with the same means and transformed covariance matrices as well as two decision bounds, the latter possibly not parallel with the coordinate axes.

If the group level covariance matrix corresponding to stimulus A_i,B_j (i.e., the i^{th} level of the A dimension and the j^{th} level of the B dimension) is given by:

    \begin{equation*}\mathbf{\Sigma}_{A_i,B_j} = \left[\begin{array}{cc}{\sigma_{xx}&\sigma_{xy}\\\sigma_{yx}&\sigma_{yy}\end{array} \right]\end{equation*}

then the corresponding individual level matrix for subject k is given by:

    \begin{equation*}\mathbf{\Sigma}_{k,A_i,B_j} = \left[\begin{array}{cc}{\frac{\sigma_{xx}}{\kappa_k \lambda_k}&\frac{\sigma_{xy}}{\kappa_k\sqrt{\lambda_k(1-\lambda_k)}}\\\ \frac{\sigma_{yx}}{\kappa_k\sqrt{\lambda_k(1-\lambda_k)}}&\frac{\sigma_{yy}}{\kappa_k (1-\lambda_k)}\end{array} \right]\end{equation*}

The two additional parameters - \kappa and \lambda - scale the subject’s space overall (\kappa) and scale the two dimensions relative to one another (\lambda). As \kappa increases, the marginal variances decrease, thereby increasing the salience of the stimuli and reducing the number of (predicted) errors. As \lambda increases, the marginal variance on the x axis decreases and the marginal variance on the y axis increases, shifting the relative salience of the two dimensions. As Soto, et al., note, equivalent changes in response probabilities could be obtained by leaving the covariance matrices alone and shifting the means of the group level densities, instead.

As a brief aside, it’s interesting (to me) to note that you can represent the individual level matrix as a linearly-transformed group level matrix:

    \begin{eqnarray*}\mathbf{\Sigma}_{k,A_i,B_j} &=& \mathbf{T\Sigma}_{A_i,B_j}\mathbf{T}^t\\ &=&\left[\begin{array}{cc}{\frac{1}{\sqrt{\kappa_k\lambda_k}}&0\\ 0 &\frac{1}{\sqrt{\kappa_k(1-\lambda_k)}}\end{array} \right]\left[\begin{array}{cc}{\sigma_{xx}&\sigma_{xy}\\\sigma_{yx}&\sigma_{yy}\end{array} \right]\left[\begin{array}{cc}{\frac{1}{\sqrt{\kappa_k\lambda_k}}&0\\ 0 &\frac{1}{\sqrt{\kappa_k(1-\lambda_k)}}\end{array} \right]\end{eqnarray*}

The point here being that the Silbert & Thomas results concerning linear transformations and identifiability of decisional separability apply to the individual subject level models in GRT wIND. Soto, et al., say that the individual subjects share a perceptual space, but in reality they share some, but not all, properties of the group level perceptual space. In GRT wIND, expansion/contraction of the space and stretching/shrinking the group space along one or another coordinate axis are acceptable transformations, but rotations, shears, etc…, are not. So, GRT wIND solves the decisional separability identifiability issue if this distinction is valid for a given stimulus set and experimental subject group, but it doesn’t solve the problem in general.

But I digress.

A little more background about GRT: In some of the oldest papers on GRT, there are mentions of the fact that the means and marginal variances of the perceptual distributions are not both uniquely identifiable (this is closely related to the note above about scaling the variances vs the means in GRT wIND). I long assumed that this was true, but to the best of my knowledge, no one had proven it. I figured out that an affine transformation can be applied to each perceptual distribution (with arbitrary means and covariance matrices) in a GRT model to force it to have unit variances and appropriately shifted means (where by “appropriately” I mean that the shift in the means exactly balances the scaled variances, producing the same predicted response probabilities for a given set of decision bounds, assuming decisional separability holds, which see above with regard to identifiability). More specifically, if you have decision criteria c_x and c_y, the following transformation gives you unit variances and appropriately shifted means:

    \begin{equation*} \mathbf{T} + \Delta = \left[\begin{matrix}\frac{1}{\sqrt{\sigma_{xx}}} & 0\\ 0 & \frac{1}{\sqrt{\sigma_{yy}}}\end{matrix}\right] + \left[\begin{matrix}c_x -\frac{c_x}{\sqrt{\sigma_{xx}}}\\c_y - \frac{c_y}{\sqrt{\sigma_{yy}}}\end{matrix}\right] \end{equation*}

Application of this transformation produces a new covariance matrix \mathbf{R} = \mathbf{T}\Sigma\mathbf{T}^T and new mean vector \boldsymbol{\eta} = \mathbf{T}\boldsymbol{\mu} + \Delta. The transformed covariance matrix is the correlation matrix:

    \begin{eqnarray*} \mathbf{T}\Sigma\mathbf{T}^T &=& \left[\begin{matrix}\frac{1}{\sqrt{\sigma_{xx}}} & 0\\ 0 & \frac{1}{\sqrt{\sigma_{yy}}}\end{matrix}\right]\left[\begin{matrix}\sigma_{xx} & \sigma_{xy}\\ \sigma_{yx} & \sigma_{yy} \end{matrix} \right]\left[\begin{matrix}\frac{1}{\sqrt{\sigma_{xx}}} & 0\\ 0 & \frac{1}{\sqrt{\sigma_{yy}}}\end{matrix}\right]\\ &=&\left[\begin{matrix}\frac{\sigma_{xx}}{\sqrt{\sigma_{xx}\sigma_{xx}}} & \frac{\sigma_{xy}}{\sqrt{\sigma_{xx}\sigma_{yy}}}\\\frac{\sigma_{yx}}{\sqrt{\sigma_{xx}\sigma_{yy}}} & \frac{\sigma_{yy}}{\sqrt{\sigma_{yy}\sigma_{yy}}}\end{matrix}\right]\\ &=& \left[\begin{matrix}1 & \rho_{xy} \\ \rho_{yx} & 1\end{matrix}\right] \end{eqnarray*}

And the transformed mean vector is shifted by the signed distance (in standard deviation units) between the mean and the decision criterion (on each dimension):

    \begin{eqnarray*} \boldsymbol{\eta} &=& \mathbf{T}\boldsymbol{\mu} + \Delta\\ &=& \left[\begin{matrix}\frac{1}{\sqrt{\sigma_{xx}}} & 0\\ 0 & \frac{1}{\sqrt{\sigma_{yy}}}\end{matrix}\right]\left[\begin{matrix}\mu_x\\\mu_y\end{matrix}\right] + \left[\begin{matrix}c_x -\frac{c_x}{\sqrt{\sigma_{xx}}}\\c_y - \frac{c_y}{\sqrt{\sigma_{yy}}}\end{matrix}\right]\\ &=& \left[\begin{matrix}c_x + \frac{\mu_x-c_x}{\sqrt{\sigma_{xx}}}\\c_y + \frac{\mu_y-c_y}{\sqrt{\sigma_{yy}}}\end{matrix}\right] \end{eqnarray*}

Note that this transformation is invertible. The upshot of this is that a standard 2 \times 2 GRT model can be scaled arbitrarily. Or, to repeat the point made above, the means and marginal variances are not both identifiable.

So, what does this all have to do with GRT wIND? My intuition was that this identifiability issue applies to GRT wIND, too. I didn’t feel like working through all the algebra to prove that this is the case, so I did the next best thing. Specifically, I simulated individual subject data using two GRT wIND models. In one, the group level perceptual means describe a square centered on the origin of the space with sides of length four, and the group level covariance matrices have randomly generated variances and correlations. In the second model, I applied the “unit-variance” transformation to the model (with the coordinate axes as the ‘decision criteria’).

I also randomly generated a set of 50 \kappa, \lambda, \phi, and \omega values, where \kappa and \lambda are the GRT wIND parameters described above, and \phi specifies the angle between c_y and the x axis, and \omega specifies the angle between c_x and c_y:

Linear failure of decisional separability
Linear failure of decisional separability

The following figure illustrates the mean (red), median (blue), and maximum (green) absolute error between the predicted identification confusion probabilities for the two models:

grt_wind_rescale_error And here are the \phi and \omega values:grt_wind_noDS_angles

And here are the \lambda and \kappa values:

grt_wind_lam_kapAnd here is the R script that runs the simulations and generates these plots.

I’ve run the simulation a number of times, and, while sometimes the maximum errors are larger (~.15 or thereabouts), the two models consistently produce extremely similar, in many cases essentially identical, confusion matrices.

I take this as strong support for my intuition that GRT wIND is over-parameterized. A mathematical proof would be better, of course, but simulation results provide relevant evidence. Now, I’m not totally sure why there is any deviation between the models, though there is no correlation between any of the randomly generated parameters and the error. But because there is some error, the simulations don’t provide the kind of airtight case a proof would. Nonetheless, it is all but unimaginable to me that the true parameters of the two models could be recovered by fitting sets of confusion matrices that are this similar to one another.

As it happens, I would like to fit GRT wIND models to simulated data to test how accurately known parameters are recovered and to come at the problem from the opposite direction (i.e., instead of seeing if different parameter values can generate essentially identical data, see if different parameter values can provide essentially identical model fits to identical data). Alas, in the R package, the function grt_wind_fit crashes R, and the function grt_wind_fit_parallel returns an error. I’ve opened up an issue on the GitHub repository, so maybe I’ll try again when/if I can get the code to work on my computer(s). And, I suppose, depending on some of my colleagues and collaborators react, I may also get to work on a more rigorous mathematical treatment of the matter.

Posted in SCIENCE!, statistical modeling | Comments Off on A critical assessment of GRT wIND

Vox lies with statistics

Okay, maybe lies is too strong a word, but Vox‘s German Lopez sure does a good job illustrating numerical dishonesty in this piece on marijuana use in Colorado and Washington since recreational legalization.

The headline is “Marijuana use rises in states with legalization,” and here are the first two take-home bullet points:

  • Past-month marijuana use in Colorado and Washington state rose following the legalization of personal possession, according to new data from a federal survey.
  • Nationwide, Americans reported smaller percentage point increases in marijuana use than people in Colorado or Washington.

Okay, so that’s maybe interesting, but you can’t really interpret increases in Colorado and Washington without knowing something about what happened in other states, none of which legalized marijuana.

Here’s the data. If you bother to look at it, you’ll see that there were also statistically-significant, similar-magnitude increases in the District of Columbia, Georgia, Maine, Maryland, Michigan, Missouri, and New Hampshire.

And this points to a problem with the comparison between Colorado and Washington, on the one hand, and the nation as a whole, on the other. It seem a priori very likely that marijuana use varies fairly substantially across states, and the data confirm this. A single pair of numbers for the whole country obscures this variation.

Nothing Lopez writes in the article is technically wrong. It’s true that past-month use increased following legalization in Colorado in Washington. It’s true that these were the first states to legalize marijuana in 2012. It’s true that use increased less in the nation as a whole than in either state.

But it’s also true that use increased in a number of states without legalization.

Obviously, I don’t know why Lopez wrote this article the way he did. Whether or not it was written with the intention to deceive, it exploits the careless reader’s inclination to conflate correlation and causation.

The fact that the data is directly accessible means that if you care to look, you will quickly see how meaningless the article really is. But it also means that it’s very easy to take what is written at face value under the assumption that it would be ridiculous to link to data that undermines the whole point of the article.


Posted in SCIENCE!, statistical description | Comments Off on Vox lies with statistics

Yglesias makes an innumerate (non-)funny

Yesterday, Matt Yglesias tried to point out the absurdity of a Joe Klein column by turning the tables with the numbers Klein invoked, thereby mimicking and mocking Klein’s argument.

Yglesias quotes Klein:

Blacks represent 13% of the population but commit 50% of the murders; 90% of black victims are murdered by other blacks. The facts suggest that history is not enough to explain this social disaster.

Yglesias then links to the relevant FBI data and points out that:

Back in 2011, the most recent year for which data is available, a staggering 83 percent of white murder victims were killed by fellow Caucasians.

He uses this to (jokingly) make the exaggerated case that white-on-white murder is out of control, that white violence is a social disaster, and so on.

Now, I understand what he’s trying to do here, but, either because of innumeracy or willful negligence, his whole jokey approach fails miserably, at least as long as the reader isn’t innumerate.

Note which parts of the Klein quote above have analogs in the Yglesias quote and which parts do not. The key to Klein’s argument is that “Blacks represent 13% of the population but commit 50% of the murders,” but Yglesias only invokes the white analog to Klein’s “90% of black victims are murdered by other blacks.

Yglesias’ essay depends on the reader not noticing that the relevant analog to the important part of Klein’s argument is that whites represent 72% of the population and commit (approximately) 50% of the murders (based on the FBI data linked above).

To be clear, this is not intended as a defense of (or an argument against) Klein. It’s just a response to Yglesias’ shiftiness (or ignorance) and his willingness to exploit readers’ possible innumeracy to make his point. It’s an unfunny failure of an argument.

Posted in statistical description | Comments Off on Yglesias makes an innumerate (non-)funny

Truly random

In today’s new What If?, Randall Munroe discusses, in part, randomly aiming a laser at the sky, writing “if you aimed in a truly random direction, you would have an almost 50% chance of hitting the Earth.”

This is one of those cases wherein my training in linguistics and statistics conflict with one another. Kind of like how people use “a fraction of” to indicate a small fraction of something, despite the fact that, for example, 999,999/1,000,000 is a fraction, too. As is, for that matter, 10/3. So, saying that this year’s budget deficit is a fraction of last year’s isn’t, technically speaking, informative. It certainly does not imply that this year’s deficit is a small proportion of last year’s, though this is pretty much always what people mean when they say things like this.

So, my linguistics training lets me understand that the phrase is used to mean “a small fraction,” and that’s fine. But my statistics (and assorted math) training just can’t let it go.

I have similar issues with using a phrase like “truly random” to mean “uniformly distributed.” Clearly, Munroe means the latter, but a laser could be pointed randomly as if it were governed by a Gaussian distribution, which, with sufficiently small variance(s) and an appropriate mean (vector), would produce a very small chance of hitting the Earth.

Again, my linguistics tells me that it’s fine to use “truly random” to mean “uniformly distributed” in regular old language, but my statistics training just can’t let this kind of thing go.

It’s particularly hard to digest when it comes from a mathematically sophisticated writer in a semi-technical setting.



Posted in language, mildly informative filler | Comments Off on Truly random

What counts as an upset?

Five Thirty Eight has been discussing the World Cup and their predictions based on what I can only assume is a fancy-pants Bayesian statistical model (done in Excel, natch).

A few days ago, Nate Silver his own self wrote a post on this topic. In trying to give it a counter-intuitive hook, the post got a little too cute for its own good. The headline sums up what’s wrong, but I understand that headlines are often written separately from article proper, so we might forgive statistical silliness up top. But the same statistical silliness makes an appearance in the body. Silver writes:

For instance, the probability of correctly identifying the winners in each of the first four knockout matches — Brazil over Chile, Colombia over Uruguay, the Netherlands over Mexico and Costa Rica over Greece — was about 23 percent, or one chance in 4.3. And the chance of going 12 for 12, as the FiveThirtyEight favorites have done so far, is just one in 75.

It’s an upset, in other words, when all the favorites prevail. On average, we’d have expected three or four upsets through this point in the knockout round.

How interesting! They’re favorites, but it’s still, somehow, a (huge) upset when they prevail! What gives?

What gives is that Silver is comparing a single outcome – all favorites prevail – to every other possible outcomes. With 12 games (8 in the first round, 4 in the second), there are 2^{12} = 4096 possible outcomes.

(As often happens to me with counting problems like this, I came up with that answer fairly quickly, then immediately doubted the assumptions that led me to this answer were correct. Specifically, I was worried that I was somehow counting second round outcomes that were ruled out by first round outcomes (e.g., Brazil winning in the second round after losing in the first round). I’m pretty sure 2^{12} is right, though, and we can spell out a slightly simplified version to see why.

Suppose the semi-finals feature, say, Germany vs Brazil in one game and Argentina vs Holland in the other. Call this ’round one’ and call the final ’round two.’ Using my logic, there should be 2^3 = 8 possible outcomes in these three games, and indeed there are. Listing all possible combinations of winners: (1) Brazil, Argentina, Brazil; (2) Brazil, Argentina, Argentina; (3) Brazil, Holland, Brazil; (4) Brazil, Holland, Holland; (5) Germany, Argentina, Germany; (6) Germany, Argentina, Argentina; (7) Germany, Holland, Germany; (8) Germany, Holland, Holland.

We could, in theory, list out all the possibilities for the rounds with 8 and 4 games – the real rounds one and two – to get the number of possible outcomes stated above, 2^{8+4} = 2^{12}.)

Okay, so, when Silver says that the teams that have won so far were all favorites, he was saying that the single outcome we’ve observed is the single most probable outcome. If it weren’t the most probably single outcome, then at least one team that won wouldn’t have been a favorite.

On the other hand, when he says that it’s an upset that all the favorites have won, he’s saying that the this single most probable outcome is less probable than the sum of the probabilities of all of the other possible outcomes.

Given that there are 4095 other possible outcomes, the favorites would have to have been favored to an absurd degree for the sum total of all other outcomes to be less probable.

The point being that it’s not particularly interesting to compare the single outcome of all favorites winning to every possible other outcome. It gets you a catchy headline, I guess, but it doesn’t provide any useful insight into how the tournament is playing out.

Posted in SCIENCE!, statistical description | Comments Off on What counts as an upset?

Partial progress plotting PyMC properties

As noted yesterday, I figured out how to sample PyMC chains in parallel recently. I’ve also been working on some plotting functions to assess chain convergence and autocorrelation, goodness of model fit, and to illustrate a fitted (multilevel GRT) model.

In assessing chain convergence, you can calculate the Gelman-Rubin statistic \hat{R}, which is, in essence, an F-like statistic providing an ANOVA-like test of any overall differences between the values in your chains. PyMC has a number of useful diagnostic tools built in, including a gelman_rubin method. It also has methods for calculating and plotting autocorrelation functions and histograms of chains, but for some reason, with the model I’m dealing with, the out-of-the-box plotting functions aren’t working well for me. So, I wrote my own, and for each parameter, it produces a figure like this (which is illustrating \hat{R}, autocorrelation (using the statsmodels package time series analysis acf function), and each of three chain’s distribution of values for a parameter called mu_mx):



I’m pretty happy with it, even if there are still some kinks to be worked out (e.g., the x-axis label for the top left panel is clipped by the bottom left panel; for some parameters, the tick labels under the histogram overlap).

I also wrote a function to plot observed and predicted (identification-confusion) response probabilities for the model I’m working with:

predobs_exampleThe vertical error bars indicate 95% highest density intervals, with the circles indicating the mean posterior value (on the y-axis) and the observed value (on the x-axis). The closer the symbols to the diagonal line, the better the fit.

Finally, I wrote a function to illustrate the fitted model for each individual subject (the small panels) and the group-level model (the big panel). These two figures are for models fit to the non-speech auditory perception data presented in my 2009 paper with Lentz and Townsend (see my CV for more details, if you’re so inclined). This one illustrates the multilevel GRT model fit to the frequency-by-duration, broadband noise stimuli:


And this one illustrates the model fit to the pitch-by-timbre (F0-by-spectral-prominence-location) data:

pxt_ml_fitAgain, there are probably some ways these could be improved, but overall I’m quite happy with how they look. I’m using my recently-acquired ellipse-plotting knowledge for its intended purpose (for the purpose I intended for it, anyway), which is nice, and it’s very easy to make this kind of multi-panel plot with matplotlib, which is also nice.

It feels good to have this model functioning in a way that others could, in principle, use. The two papers I’ve published using this model presented analyses done with WinBUGS, and the only way I could figure out how to get WinBUGS to fit this model was by feeding it a giant array of pre-calculated bivariate normal CDF values and using trilinear interpolation. The model works, but it’s ungainly. I don’t think I would want anyone else to mess with it, and I don’t imagine others would particularly want to do so. This isn’t a recipe for reproducible research.

I never could figure out how to get this model to work in JAGS, which is a superior, cross-platform version of BUGS. JAGS would just hang when I tried to feed it the giant CDF array (BUGS would always seem like it was hanging, but if I went away for long enough, it would have a model fit for me – no such luck with JAGS).

I haven’t tried very hard (at all, really) to get this model working with Stan, though I did request multivariate normal CDFs as a feature (it’s on the to-do list, so maybe I’ll come back to work on this model in Stan later on). Given that it’s functioning in PyMC, I don’t feel much need to get it working in Stan right now.



Posted in Python, statistical graphics, statistical modeling | Comments Off on Partial progress plotting PyMC properties

Parallel processing of PyMC models in IPython

I’ve been learning how to use PyMC to fit Bayesian models over the last few months, and recently I decided I wanted to figure out how to sample separate chains in parallel on different cores on whatever machine I’m using.

I poked around a bit, looking at various multi-processing Python packages, and then I found this IPython notebook that walks through a fairly simple example using parallel processing machinery in IPython.

It’s easy to run multiple chains of a PyMC in serial by writing a for loop and calling a model’s sample method once for each chain (see here for details about model fitting in PyMC). But if you want lots of samples in each chain, or if it takes a long time to get even a small number of samples from your model, it seems very inefficient to sample your chains in serial.

Okay, so here’s a very quick and dirty rundown of how to run chains in parallel.

First, you need IPython and PyMC installed, and you need a model script that does everything (i.e., reads in data, defines your model, writes your samples to disk, etc…). Let’s call your model script model_script.py.

The IPython notebook linked above seems to make some of this more complicated than it needs to be, in my opinion, importing PyMC for each core, then pushing the data to each core, and only then feeding the model script to the cores. If your script works when you call it using Python from the command line, it should work as described below.

Let’s say you want three chains. Open one terminal windows and, after the prompt ($), type:

$ ipcluster start -n 3

Now, open another terminal window (or tab) and start IPython. Once that’s up and running, type:

In [1]: from IPython.parallel import Client

In [2]: client = Client()

In [3]: direct = client[:]

In [4]: direct.block = True

In [5]: model_script = open('model_script.py').read()

In [6]: direct.execute(model_script)

I don’t have a particularly deep understanding of what all is going on here, but I know it works. For the time being, that’s good enough for me. You can, and I probably should (and maybe even will, at some point), read all about ipcluster and various associated tools here.

Anyway, assuming, as mentioned above, that your model_script.py saves your samples for you (rather than, e.g., only keeping them in working memory for use in IPython), you should be good to go.

For what it’s worth, I ran into a bit of a problem when I first got this working. Specifically, the separate cores were creating databases with identical names in rapid succession, so only the last core to create a database was actually saving anything.

I fixed this by using something that I joked about when I first learned of its existence, namely Unix time, which is the number of seconds that have elapsed since the beginning of January 1, 1970.

I was fiddling with the time package in Python, trying to get unique time stamps to add to the database names (since I can’t figure out how to feed each core a unique string to add to its database name), and realized that time.time() returns a number with a few decimal places. I tried adding the following to the database names, under the assumption that somewhere in the neighborhood of the fourth, fifth, or sixth decimal place of Unix time there would be measurable delays between the creation of the databases from the different cores.

clock_string = str(int(time.time()*1000000))

It works, and it looks like I could even shave a couple zeros off and still get distinct names.

Posted in Python, statistical modeling | 1 Comment

Bivariate normal ellipse plotting in Python

When I decided to start using Python for data analysis and visualization, I was a bit worried about the latter, as I had become fairly fond of the flexibility of the R base graphics. The more I use matplotlib, though, the happier I am with it.

Case in point: today I figured out how to plot ellipses for bivariate normal densities. In much of my work with general recognition theory (GRT), I have focused on a fairly restricted set of models. Specifically, I have focused on cases in which the marginal variances of the modeled perceptual distributions are not identifiable (well, technically speaking, the issue is that the marginal variances and means are not both identifiable).

In general, a useful way to illustrate GRT models is by taking a bird’s-eye view and looking down at contours describing the modeled distributions, like so:

Phonetic Trapezoid
Phonetic Trapezoid

In this kind of simple case, you can just pick a level above the plane at which to figuratively slice the modeled densities, and you get contours that are comparable across the perceptual distributions.

However, in certain GRT models and associated data sets (e.g., in which you have more than two response levels on each dimension), the marginal variances are identifiable (along with the means). In this case, slicing each density at the same height produces bad graphics, since densities with larger marginal variances are more spread out and lower down than otherwise comparable densities with smaller marginal variances. Hence, slicing at a height sufficiently high up off the base plane can, counterintuitively, produce smaller ellipses for densities with larger marginal variances.

The solution is to plot ellipses that enclose a specified volume (i.e., for which the integral over the region specified by the ellipse takes a particular value) rather than at a specified height.

The code I wrote to plot the height-based figures calculates values for the bivariate normal densities on a grid, then uses built-in ‘contour’ functions to plot the ellipses for a given height. To plot based on volume, this approach would be either very ugly or totally non-functional. Thankfully, I was able to find (and adapt) code that takes an analytical approach based on the density mean and covariance parameters and uses some nice built-in matplotlib features. The adaptation consists mostly of switching from specifying the number of standard deviations to specifying the volume, based on the nice description of the relationships between density parameters and volumes given here.

Here’s a simple example figure (not from a fitted model, just from some semi-randomly chosen covariance and mean values), with each ellipse enclosing half of the volume of each bivariate normal density, and with unit marginal variances for the density located at the origin:


Here’s the adapted code:

def plot_cov_ellipse(cov, pos, volume=.5, ax=None, fc='none', ec=[0,0,0], a=1, lw=2):
    Plots an ellipse enclosing *volume* based on the specified covariance
    matrix (*cov*) and location (*pos*). Additional keyword arguments are passed on to the 
    ellipse patch artist.

        cov : The 2x2 covariance matrix to base the ellipse on
        pos : The location of the center of the ellipse. Expects a 2-element
            sequence of [x0, y0].
        volume : The volume inside the ellipse; defaults to 0.5
        ax : The axis that the ellipse will be plotted on. Defaults to the 
            current axis.

    import numpy as np
    from scipy.stats import chi2
    import matplotlib.pyplot as plt
    from matplotlib.patches import Ellipse

    def eigsorted(cov):
        vals, vecs = np.linalg.eigh(cov)
        order = vals.argsort()[::-1]
        return vals[order], vecs[:,order]

    if ax is None:
        ax = plt.gca()

    vals, vecs = eigsorted(cov)
    theta = np.degrees(np.arctan2(*vecs[:,0][::-1]))

    kwrg = {'facecolor':fc, 'edgecolor':ec, 'alpha':a, 'linewidth':lw}

    # Width and height are "full" widths, not radius
    width, height = 2 * np.sqrt(chi2.ppf(volume,2)) * np.sqrt(vals)
    ellip = Ellipse(xy=pos, width=width, height=height, angle=theta, **kwrg)

Posted in Python, statistical graphics | Comments Off on Bivariate normal ellipse plotting in Python

Why no log toggle in Google Ngrams?

Josh sent me a link to an interesting blog post about whether or not we’re in a tech bubble. As Josh pointed out to me, it’s interesting not because it answers that question, but, rather, because it presents a clear and concise description of one way that low interest rates can distort prices.

Instead of any of that, I’m going to talk about the recency illusion and statistical graphics. The author of that post uses the verb “incent.” My first thought when I read that was to be mildly annoyed at a buzzword-y, business-speak verbed noun.


But then I remembered the recency illusion, after which I remembered the Google Ngram viewer, so I looked up “incent,” “incentivize,” and “incentive”:

Three Words


Not terribly helpful, so I got rid of “incentive” and found that “incent” has a fairly respectable history (i.e., my initial reaction to it was, in fact, a case of the recency illusion), while my intuition that “incentivize” is buzzword-y business-speak was pretty much correct:

Two Words


Which brings me to the point probed in the title of this post. Why is there not an option to toggle a log transformation of the y-axis on the Google Ngram viewer? It would be very helpful to when comparing words that differ by multiple orders of magnitude in frequency.

Posted in language, statistical graphics | Comments Off on Why no log toggle in Google Ngrams?

Multivariate normal CDF values in Python

I was very happy to realize recently that a subset of Alan Genz’s multivariate normal CDF functions are available in Scipy. I first learned of Dr. Genz’s work when I started using the mnormt R package, which includes a function called sadmvn that gives very precise, and very accurate, multivariate normal CDF values very quickly.

In case you don’t know, this is quite an achievement, since there is not a closed form solution. I’ve spent far too much time reading strange, complicated papers found in the deepest recesses of google (i.e., the third page of search results) that claim to provide fast, accurate approximations to multivariate normal CDFs. As far as I can tell, none of these claims hold any water. None other than Genz’s, anyway.

Okay, so Scipy has two relevant functions, but they’re kind of buried, and it might not be obvious how to use them (at least if you don’t know to look at Genz’s Fortran documentation). So, for the benefit of others (and myself, in case I need a refresher), here’s where they are and how to use them.

First, where. In the Scipy stats library, there is a chunk of compiled Fortran code called mvn.so. I’ve copied it here, just in case it disappears from Scipy someday. Should that come to pass, and should you want this file, just save that ‘plain text’ file and rename it mvn.so and you should be good to go.

Otherwise, if you’ve got Scipy, you can just do this:

from scipy.stats import mvn

Now, mvn will have three methods, two of which – mvndst and mvnun – are what we’re looking for here.

The first works like this:

error,value,inform = mvndst(lower,upper,infin,correl,...)

Which is to say that it takes, as arguments, lower and upper limits of integration, ‘infin’ (about which more shortly), and correl (as well as some optional arguments). This is, in turn, to say that it assumes that your multivariate normal distribution is centered at the origin and that you’ve normalized all the variances.

This function is straightforward to use, except for, perhaps, the ‘infin’ argument. From Genz’s documentation:

*     INFIN  INTEGER, array of integration limits flags:
*           if INFIN(I) < 0, Ith limits are (-infinity, infinity);
*           if INFIN(I) = 0, Ith limits are (-infinity, UPPER(I)];
*           if INFIN(I) = 1, Ith limits are [LOWER(I), infinity);
*           if INFIN(I) = 2, Ith limits are [LOWER(I), UPPER(I)].

Which is to say that you put a negative number in if you want, on dimension I, to integrate from -Inf to Inf, 0 if you want to integrate from -Inf to your designated upper bound, 1 if you want to integrate from your designated lower bound to Inf, and 2 if you want to use both of your designated bounds.

Also from Genz’s documentation:

*     INFORM INTEGER, termination status parameter:
*          if INFORM = 0, normal completion with ERROR < EPS;
*          if INFORM = 1, completion with ERROR > EPS and MAXPTS 
*                         function vaules used; increase MAXPTS to
*                         decrease ERROR;
*          if INFORM = 2, N > 500 or N < 1.

Here, N seems to be the number of dimensions (and is the first argument in Genz’s MVNDST Fortran function, but is not in the similar/corresponding R or Python functions).  In any case, it’s the 0 and 1 that seem most informative, and the MAXPTS variable is one of the optional arguments I mentioned above.

The other function allows for non-zero means and covariance (as opposed to correlation) matrices, but it doesn’t, technically speaking, allow for integration to or from +/-Infinity:

value,inform = mvnun(lower,upper,means,covar,...])

As it happens, and as shouldn’t be too surprising, you can give it large magnitude bounds and get essentially the same answer. As long as you’re sufficiently far away from the mean (meaning as long as you’re more than a few standard deviation units away), the difference between the +/-Inf bound and the finite bound will only show up quite a few decimal places into your answer.

If you’ve got numpy imported as np, you could, for example, do this:

In [54]: low = np.array([-10, -10])

In [55]: upp = np.array([.1, -.2])

In [56]: mu = np.array([-.3, .17])

In [57]: S = np.array([[1.2,.35],[.35,2.1]])

In [58]: p,i = mvn.mvnun(low,upp,mu,S)

In [59]: p
Out[59]: 0.2881578675080012

With more extreme values for low, we get essentially the same answer (with a difference only showing up in the 12th decimal place):

In [60]: low = array([-20, -20])

In [61]: p,i = mvncdf(low,upp,mu,S)

In [62]: p
Out[62]: 0.2881578675091007

Still more extreme values doesn’t change it at all:

In [63]: low = array([-100, -100])

In [64]: p,i = mvncdf(low,upp,mu,S)

In [65]: p
Out[65]: 0.2881578675091007

All of this is important to me because I’m working on building a Bayesian GRT (e.g.) model in PyMC, and I’m hoping I’ll be able to use this function to get fast and accurate probabilities, given a set of mean and covariance parameters.

Posted in Python, R, statistical modeling | Comments Off on Multivariate normal CDF values in Python