Sensitivity to mis-specification

I’ve encountered two potentially problematic uses of sensitivity and specificity recently. One is simply an egregious error. The other is a combination of, on the one hand, an oversimplification of the relationship between these and the area under a receiver operating characteristic curve and, on the other, an illustration of one of their important limitations.

Just in case you don’t want to read the wikipedia entry linked above, here are some quick definitions. Suppose you have a test for diagnosing a disease. The sensitivity of the test is the proportion of people with the disease that the test correctly identifies as having the disease (i.e., the hit rate, henceforth H). The specificity of the test is the proportion of people without the disease that the test correctly identifies as not having the disease (i.e., the correct rejection rate, henceforth CR).

H and CR are useful measures, to be sure, but they obscure some important properties of diagnostic tests (and of binary classifiers in probabilistic decision making in general). Rather than H and CR, we can (and should) think in terms of d’ – the “distance” between the “signal” and “noise” classes – and c – the response criterion. Here’s a figure from an old post to illustrate:optimal_model

In this illustration, the x-axis is the strength of evidence for disease according to our test. The red curve illustrates the distribution of evidence values for healthy people, and the blue curve illustrates the distribution of evidence values for people with the disease. The vertical dashed/dotted lines are possible response criteria. So, in this case, d’ would be \displaystyle{\frac{\mu_2 - \mu_1}{\sigma}}, where \sigma is some measure of the variability in the two distributions. It is useful to define c as the signed distance of the response criterion with respect to the crossover point of the two distributions. I’ll note in passing that I’m eliding a number of important details here for the sake of simplicity (e.g., the assumption of equal variances in the two distributions, the assumption of normality in same), which I’ll come back to below.

H and CR are determined by d’ and c. H is defined as the integral of the blue curve to the right of the criterion, and CR as the integral of the red curve to the left of the criterion.

So, one important property of a binary classifier that H and CR obscure but that d’ and c illuminate is the fact that, for a given d’, as you shift c, H and CR trade off with one another. Shift c leftward, and H increases while CR decreases. Shift c rightward, and H decreases while CR increases. In the figure above, you can see how the areas under the red and blue curves differ for the dashed and dotted vertical lines – H is lower and CR higher for the dotted line than for the dashed line.

Another important property of a binary classifier is that, as you increase d’, either H increases, CR increases, or both increase, depending on where the response criterion is. In the above figure, if we increased the distance between \mu_1 and \mu_2 (without changing the variances of the distributions) by shifting \mu_1 to the left by some amount \delta and by shifting mu_2 to the right by \delta, both H and CR would increase.

The egregious error I encountered is in the “Clinical Decision Analysis Regarding Test Selection” section of the ASHA technical report on (Central) Auditory Processing Disorder, or (C)APD (I’ll quote the report as it currently stands – I plan to email someone at ASHA to point this out, after which, I hope it will be fixed):

The sensitivity of a test is the ratio of the number of individuals with (C)APD detected by the test compared to the total number of subjects with (C)APD within the sample studied (i.e., true positives or hit rate). Specificity refers to the ability to identify correctly those individuals who do not have the dysfunction. The specificity of the test is the ratio of normal individuals (who do not have the disorder) who give negative responses compared to the total number of normal individuals in the sample studied, whether they give negative or positive responses to the test (i.e., 1 – sensitivity rate). Although the specificity of a test typically decreases as the sensitivity of a test increases, tests can be constructed that offer high sensitivity adequate for clinical use without sacrificing a needed degree of specificity.

The egregious error is in stating that CR is equal to 1-H. As illustrated above, it’s not.

The oversimplification-and-limitation-illustration was in Part II of a recent Slate Star Codex (SSC) post. Here’s the oversimplification:

AUC is a combination of two statistics called sensitivity and specificity. It’s a little complicated, but if we assume it means sensitivity and specificity are both 92% we won’t be far off.

AUC here means “area under the curve,” or, as I called it above, area under the receiver operating characteristic curve (or AUROC). Here’s a good Stack Exchange answer describing how AUROC relates to H and CR, and here’s a figure from that answer:


The x axis in this figure is 1-CR, and the y axis is H. The basic idea here is that the ROC is the curve produced by sweeping across all possible values of c for a test with a given d’. If we set c as far to the right as we can, we get H = 0 and CR = 1, so 1-CR = 0 (i.e., we’re in the bottom left of the ROC figure). As we shift c leftward, H and 1-CR increase. Eventually, when c is as far to the left as we can go, H = 1 and 1-CR = 1 (i.e., we’re at the top right of the ROC figure).

The AUROC can give you a measure of something like d’ without requiring, e.g., assumptions of equal variance Gaussian distributions for your two classes. Generally speaking, as d’ increases, so does AUROC.

So, the oversimplification in the quote above consists in the fact that the AU(RO)C does not correspond to single values of H and CR.

Which brings us to the illustration of the limitations of H and CR. To get right to the point, H and CR don’t take the base rate of the disease into account. Let’s forget about the conflation of AUROC and H and CR and just assume we have H = CR = 0.92. Per the example in the SSC post, if you have a 0.075 proportion of positive cases, H and CR are problematic: you have 92% accuracy, but less than half of the people identified by the test as diseased actually have the disease!

The appropriate response here is to shift the criterion to take the base rate (and the costs and benefits of each combination of true disease state and test outcome) into account. Given how often Scott Alexander (i.e., the SSC guy) argues for Bayesian reasoning and utility maximization, I am a bit surprised (and chagrined) he didn’t go into this, but the basic idea is to respond “disease!” if the following inequality holds:

    \begin{equation*} \frac{Pr(x|+)}{Pr(x|-)} \geq \frac{(U_{--} - U_{+-})Pr(-)}{(U_{++} - U_{-+})Pr(+)} \end{equation*}

Here, Pr(x|+) is the probability of a particular strength of evidence of disease given that the disease is present, Pr(x|-) is the probability given that the disease is not present, Pr(+) and Pr(-) are the prior probabilities of the disease being present or not, respectively, and U_{or} is the utility of test outcome o and reality r (e.g., U_{+-} is the cost of the test indicating “disease” while in reality the disease is not present).

The basic idea here is that the relative likelihood of a given piece of evidence when the disease is present vs when it’s absent needs to exceed the ratio on the right. By “piece of evidence” I mean something like the raw, pre-classification score on our test, which corresponds to the position on the x axis in the first figure above.

The ratio on the right takes costs and benefits into account and weights them with the prior probability of the presence or absence of the disease. We can illustrate a simple case by setting U_{--} = U_{++} = 1 and U_{+-} = U_{-+} = 0 and just focusing on prior probabilities. In this case, the inequality is just:

    \begin{equation*} \frac{Pr(x|+)}{Pr(x|-)} \geq \frac{Pr(-)}{Pr(+)} \end{equation*}

In the SSC case, Pr(+) = 0.075 and Pr(-) = 0.925, so the criterion for giving a “disease!” diagnosis should be \approx 12. That is, we should only diagnose someone as having the disease if the strength of evidence given the presence of the disease is 12+ times the strength of evidence given the absence of the disease.

Posted in statistical decision making | Comments Off on Sensitivity to mis-specification

Pondering proportional pizza prices

In my last post, I illustrated how much better a deal it is to get large pizzas rather than medium or small pizzas from Dewey’s. A friend pointed out that I hadn’t taken crust into account in that analysis. I dismissed the idea at first, thinking, incorrectly, that it wouldn’t matter. I dismissed it in part because I like to eat the crust, and so don’t tend to think of it as qualitatively different than the rest of the pizza.

As it happens, it matters for the analysis, since it actually makes small and medium pizzas even worse deals. A one inch crust is, proportionally, 33%, 28%, or 22% of the area of a small, medium, or large, respectively.

Here is a graph showing the square inches per dollar as a function of number of toppings, taking a one inch crust into account:


And here’s a graph showing dollars per square inch as a function of number of toppings,  taking a one inch crust into account (see previous post for plots that don’t take the crust into account):


By taking the crust into account, we see that the large is an even better deal than before. It’s also (very, very nearly) interesting to note that the crossover between smalls and mediums has shifted leftward a couple toppings. Without taking crust into account, smalls were a better deal than mediums for 0, 1, or 2 toppings, but with a one inch crust, small just barely beats medium, and then only if you get a plain cheese pizza.

In addition, here’s a plot showing square inches per dollar for small and medium pizzas relative to square inches per dollar for a large:


For no toppings, you get ~60% as many square inches per dollar for either small or medium as you do for a large. This ratio stays fairly constant for mediums, but drops substantially for smalls, approaching 50% for 10 gourmet toppings.

And, finally, here’s a plot showing dollars per square inch for small and medium pizzas relative to dollars per square inch for a large:


With respect to dollars per square inch, you spend ~150% for a no-topping small or medium relative to a large. The ratio stays more or less constant for mediums, while it increases quite a bit for smalls. If, for some reason, you decided to buy a 10-gourmet-topping small pizza, you’d be spending almost twice as much per square inch as you would if you bought a 10-gourmet-topping large.

I have way more real work to do than you might think.


Posted in mildly informative filler, SCIENCE! | Comments Off on Pondering proportional pizza prices

On the price of pizza

I’ve got lots of interesting things to blog about. I’ve been running some interesting perceptual experiments, I’m working on adapting and extending some data analysis methods, I’ve been teaching a class on statistical signal processing, and I’ve read a number of essays and articles in which thought-provoking, seemingly wrong things were written.

So, naturally, this will be about the cost of pizza at a local pizzeria.

Sometimes we get pizza from a local place called Dewey’s. Normally, we get a couple of large pizzas. One time recently, though, I decided to get three medium pizzas, thinking that this would give us more or less the same amount of food while allowing us to maximize the variety of toppings. It occurred to me then (though after I had placed the order) that I should figure out which pizza size is the best deal. Tonight, we got Dewey’s again, and I remembered to work this out before ordering.

On the “create your own” page, they give the sizes (in diameter), the base price for each size, and the cost per topping for regular and gourmet toppings.

The sizes are 11″, 13″, and 17″.

The base prices are $8.95, $13.45, and $15.95.

Each regular topping costs $1.50, $1.75, or $2.00, and each gourmet topping costs $1.75, $2.00, or $2.25.

If r indicates the radius, b indicates the base price, n indicates the number of toppings, and c indicates the cost per topping, the cost of a pizza expressed as the number of square inches per dollar is then given by \displaystyle{\frac{\pi r^2}{b+nc}}, which is a non-linear function of n, and the cost expressed as the number of dollars per square inch is the reciprocal of this, \displaystyle{\frac{b+nc}{\pi r^2}=\frac{b}{\pi r^2} + \frac{c}{\pi r^2}n}, which is a linear function of n.

This difference is very nearly interesting.

Anyway, I wrote a short Python script to calculate and plot costs for each size and topping type from 0 to 10 toppings.

Here’s the graph for square inches per dollar:


And here’s the graph for dollars per square inch:


The most obvious result here is that large pizzas are by far the best deal. You get way more pizza per dollar (or, if you prefer, you spend fewer dollars per unit pizza) with a large than with a small or medium.

It’s also kind of interesting that medium pizzas are the worst deal if you’re getting 2 or fewer toppings, but that small pizzas are the worst deal if you’re getting 3 or more toppings (comparing within topping class).

Finally, given the multiplicative role that each additional topping plays, it’s not surprising that the more toppings you get, the bigger the difference between regular and gourmet toppings.

So, it looks like it’ll be only large pizzas from Dewey’s for us in the future.

Posted in Python, SCIENCE! | Comments Off on On the price of pizza

A frustrating discussion

I’m not totally sure what my opinion of Data Colada is. It’s an odd mix of thought-provoking and frustrating. The most recent example is yesterday’s post (by Uri Simonsohn) on the prejudice of (one particular type of) Bayesian t-test. Directly related, and similarly frustrating, is the response from one of the people (Jeff Rouder) who has been arguing for this particular Bayesian test.

(Disclaimer that’s less a disclaimer than a mildly amusing anecdote: Jeff Rouder and I once went to get a cup of coffee while were at a Math Psych meeting, and as we walked to the coffee shop, I mentioned a paper that I found interesting, incorrectly stating that it was a paper that he had (co-)authored, which he quickly pointed out was not the case. I can’t imagine he remembers this, but I felt like a knucklehead, and that moment of awkwardness has stuck with me. I have not yet had a chance to say anything stupid to Uri Simonsohn in person.)

There are quite a few issues one could dig into here, as I suppose would be obvious to anyone who’s been following the frequentist-v-Bayesian argument. In order to keep things orderly, I’ll focus on one small corner of the larger debate that’s playing an interesting role in this particular discussion.

Specifically, I’ll focus on the role of inferential errors in statistical reasoning. Simonsohn wants to know how often Rouder’s Bayesian t-test produces the incorrect result, so he ran some simulations. In the simulations, the true effect size is known, which allows us to evaluate the testing procedure in a way that we can’t when we don’t know the truth ahead of time. Simonsohn posted this figure to illustrate the results of his simulations:

The take-home point is that the Bayesian t-test in question erroneously supports the null hypothesis ~25% of the time when the effect size is “small.” It is perhaps worth noting in passing that the fact that power is 50% in these simulations means that the standard null hypothesis tests that Simonsohn favors are, by definition, wrong a full 50% of the time. Depending on what you do with the Bayesian “inconclusive” region, this is up to twice the error rate of the Bayesian test Simonsohn is arguing against.

Anyway, Rouder’s take on this is as follows:

Uri’s argument assumes that observed small effects reflect true small effects, as shown in his Demo 1 [the simulations illustrated above – NHS]. The Bayes factor is designed to answer a different question: What is the best model conditional on data, rather than how do statistics behave in the long run conditional on unknown truths?

Now, Rouder is right that Simonsohn seems to assume that observed (small) effects reflect true (small) effects, which is problematic. This isn’t obvious from the simulations, but Simonsohn’s second case makes it clear. Briefly, the second case is an(other) absurd Facebook study that found, with a sample size of ~7,000,000 (~6,000,000 + ~600,000), that 6 million people who saw pictures of friends voting (on Facebook) were 0.39% more likely to vote than 600k people that didn’t see pictures of friends voting (on Facebook). The result was statistically significant, but I don’t know that I’ve ever seen a better illustration of the difference between statistical and practical significance (I mean, aside from the other ridiculous Facebook study linked above). Who could possibly care that a 0.39% increase in voting was statistically significant? I’m having a hard time figuring out why anyone – authors, editors, or reviewers – would think this result is worth publishing. It’s just spectacularly meaningless. But I digress.


Simulations like those run by Simonsohn are useful, and I can’t figure out why so many Bayesians don’t like them. The point of simulations isn’t to understand how “statistics behave in the long run conditional on unknown truths,” it’s to leverage known truths to provide information about how much we can trust observed results when we don’t know the truth (i.e., in every non-simulated case).

That is, simulations are useful because they tell us about the probability that an inference about the state of the world is false given an observation, not because they tell us about the probability of sets of observations given a particular state of the world. I would have thought that Bayesians would recognize and appreciate this kind of distinction, given how similar it is to the distinction between the likelihood and the posterior in Bayes’ rule.

One argument for Bayesian reasoning is that frequentist reasoning can’t coherently assign probabilities to one-off events. I’ve never found this particularly compelling, in large part because I’m interested in statistics as applied to scientific research, which is to say that I’m interested in statistics as applied to cases in which we are explicitly concerned with prediction, replication, and anything other than one-off events. The point being that the frequentist properties of our statistical models are important. Or, put slightly differently, our models have frequentist properties, whether we’re Bayesian or not, and whether the models are Bayesian or not.

So, Simonsohn was right to run simulations to test Rouder’s Bayesian t-test, even if he was wrong to then act as if a statistically significant difference of 0.39% is any more meaningful than no effect at all, and even if he glosses over the fact that null hypothesis testing doesn’t really do any better than the Bayesian method he’s critiquing. And Rouder is right that it’s good to know what the best model is given a set of observations, even if he’s (also) wrong about what simulations tell us.

Posted in SCIENCE!, statistical modeling | Comments Off on A frustrating discussion

Book Review: IPython Notebook Essentials, Ch. 1

On a Google+ statistics group, sometime around the new year, I saw an offer for a free book on IPython Notebooks in exchange for a review of the book. I wanted to learn about IPython Notebooks, so I offered to write a review. The spring semester had started before I got my copy of the book, and what I should have foreseen happened: I got too busy to read the book and write the review. Well, I’m making a good faith effort to fulfill my side of the deal now. Better late than never, right?

The book is IPython Notebook Essentials, written by L. Felipe Martins, and published by Packt Publishing. I’ve read Chapters 1 and 2 so far. Both are clear and well-written, with only a few minor problems, and those possibly as much due to my own proclivities as anything else.

Chapter 1 covers the basics of installing a suitable scientific python package (e.g., Anaconda, which I’m a big, big fan of), creating a notebook, and executing some basic code in that notebook. The code given is a bit goofy, implementing a model of change in coffee temperature as a function of time and initial temperature, but introductory code in this kind of book is often fairly ridiculous.

I was already reasonably familiar with Python before getting this book, so it’s a bit difficult for me to judge the pedagogical effectiveness of the code in Chapter 1 for someone with little to no programming experience. Of course, it’s not meant to be a comprehensive introduction to Python, so this kind of intro has to fit into kind of an awkward space. It can’t be too boring to someone with some (or a lot of) experience, but it can’t be too confusing or complicated for less experienced programmers. This book does a pretty good job balancing the needs of a potentially wide variety of reader experience levels, I think.

One small issue I have with the code is that the convention for variable naming uses a lot of full words and underscores. The benefit of this is, of course, that the names are unambiguous and you mostly know what variables are just by looking at their names. The downside is that the code often spans multiple lines in the book, which can make it slightly more difficult to read. I tend to err very much in the other direction in my own code, giving variables short, but sometimes confusing, names. So, my taste runs a bit against the convention in this book. I don’t think the author should have gone as far as I often do, but I think some of the code snippets in the book would look nicer and be a bit easier to read if the variable names had been a bit shorter.

Okay, so Chapter 1 is fine. It’s not spectacular, but it does its job okay. I’ll give it B.

Posted in Python | Comments Off on Book Review: IPython Notebook Essentials, Ch. 1

More thoughts on GRT wIND

Since yesterday’s post on GRT wIND, I’ve been thinking quite a bit about the model. I am now reasonably confident I know why the two models I simulated didn’t produce identical data, my intuition still tells me that the model is over-parameterized, and I am now back to feeling like GRT wIND doesn’t solve the problem of identifiability of failures of decisional separability. I say “back to” because, when I first heard about the model, my initial thought was that it didn’t solve this problem. I thought about the model some more, and my uncertainty about this issue increased a good bit, but now that I’ve thought about it even more, I’m back to where I started.

However, I don’t want this to be a purely negative effort, so I will use my discussion of why I’ve come full circle with respect to GRT wIND to illustrate one of the most valuable, and (possibly) counterintuitive, aspects of mathematical modeling. Here’s the punchline, in case you don’t want to wade through the technical details below: mathematical modeling allows us to formulate and ask incisive questions. More below, though please note that if you haven’t read yesterday’s post, this one isn’t going to make much sense.

As I discussed yesterday, there are a number of identifiability issues with the model for the standard 2 \times 2 GRT task. The task is identification of stimuli defined by the factorial combination of two levels on each of two dimensions (e.g., red square, purple square, red octagon, purple octagon), and the model consists of bivariate Gaussian perceptual distributions corresponding to the stimuli and two decision bounds that define response regions corresponding to each possible response (in the standard task, there is a unique response for each stimulus).

One issue, as mentioned above, is that failures of decisional separability are, in general, not identifiable. Recall that decisional separability is defined as decision bounds that are parallel to the coordinate axes. In the simplest case, you have linear bounds. If you have a failure of decisional separability with linear bounds, you can apply a sequence of linear transformations to produce an empirically equivalent model in which decisional separability holds. You can see some more details about this here, and you can read the proof in my 2013 Psychonomic Bulletin & Review paper with Robin Thomas (that’s a link to a paywall, so if you want a copy, just email me at noahpoah at any of a number of different domains, including the obvious one, given where you’re reading this).

Another issue is that the means and marginal variances of the perceptual distributions are not simultaneously identifiable. If you have a model with decisional separability and arbitrary marginal variances in the perceptual distributions, it’s easy to transform each distribution so that each marginal variance is one and the means are shifted to preserve the predicted response probabilities. I showed how this works in yesterday’s post.

I contend that the first issue applies to GRT wIND (contra the claims of Soto, et al., in the paper that introduced the model), and given the first issue, the second issue seems tied up in the over-parameterization problem I can’t seem to let go of.

As I discussed yesterday, GRT wIND has a group-level set of perceptual distributions with arbitrary mean vectors, marginal variances, and correlations (with one distribution fixed at the origin and with unit marginal variances). The individual-level model for subject k has a scaled version of the group level model for its perceptual distributions and two linear decision bounds. As I mentioned yesterday, the scaling can be formalized as a linear transformation of the group-level covariance matrix \Sigma_{A_i,B_j} (where, e.g., A_i = “red” and B_j = square):

    \begin{eqnarray*}\mathbf{\Sigma}_{k,A_i,B_j} &=& \mathbf{T\Sigma}_{A_i,B_j}\mathbf{T}^t\\ &=&\left[\begin{array}{cc}{\frac{1}{\sqrt{\kappa_k\lambda_k}}&0\\ 0 &\frac{1}{\sqrt{\kappa_k(1-\lambda_k)}}\end{array} \right]\left[\begin{array}{cc}{\sigma_{xx}&\sigma_{xy}\\\sigma_{yx}&\sigma_{yy}\end{array} \right]\left[\begin{array}{cc}{\frac{1}{\sqrt{\kappa_k\lambda_k}}&0\\ 0 &\frac{1}{\sqrt{\kappa_k(1-\lambda_k)}}\end{array} \right]\end{eqnarray*}

It occurred to me today that you can formalize this as a (possibly illicit) affine transformation that transforms the covariance matrix as desired while leaving the perceptual distribution’s mean vector unchanged:

    \begin{equation*}\mathbf{T} + \mathbf{(I-T)}\boldsymbol{\mu} = \left[\begin{array}{cc}{\frac{1}{\sqrt{\kappa_k\lambda_k}}&0\\ 0 &\frac{1}{\sqrt{\kappa_k(1-\lambda_k)}}\end{array} \right] + \left[\begin{array}{c}\left(1-\frac{1}{\sqrt{\kappa_k\lambda_k}}\right)\mu_x \\\left(1-\frac{1}{\sqrt{\kappa_k(1-\lambda_k)}}\right)\mu_y\end{array} \right]\end{equation*}

This is possibly illicit because it seems like it could cause some problems to include the mean in the transformation. But this is beside the main point.

The main point being that each individual subject has a transformation that scales the whole space (\kappa) and scales the two dimensions with respect to one another (\lambda), and, together with the two decision bounds for subject k, these produce a standard 2 \times 2 Gaussian GRT model. Hence, the model for subject k can be transformed to ensure that decisional separability holds, and once decisional separability holds, the marginal variances and means of each distribution can be scaled and shifted so that all marginal variances are one. Put slightly differently, in addition to the scaling transformation described above, subject k can also have rotation and shear transformations to enforce decisional separability and the unit-variance transformation to force all the action onto the means.

There’s no mathematical rationale for allowing each subject to have \kappa– and \lambda-based scaling but not rotations, shear transformations, and unit variances. Rotation, shear transformation, and scaling to unit variance all produce models that are empirically equivalent to the model pre-transformation. All of which is to say that for every GRT wIND model with failures of decisional separability, there is an empirically equivalent model in which decisional separability holds. Ergo, GRT wIND does not, by itself, solve the identifiability problems discussed in Silbert & Thomas (2013).

Of course, there may be psychological reasons for allowing one type of transformation and not another. In my second review of the more recent GRT wIND paper, I suggested one such line of reasoning. Specifically, I suggested that scaling transformations could be allowed while rotations and shear transformations were not by invoking the notion of primary perceptual dimensions. The basic argument along these lines would be that subjects share a common set of primary dimensions, and so may vary with respect to the overall salience of the stimulus set (\kappa) and the salience of one dimension relative to the other (\lambda), but not with respect to how the perceptual distributions are aligned with the coordinate axes in perceptual space (rotations, shear transformations).

This approach doesn’t rule out re-scaling to enforce unit variances, and I strongly suspect that this is linked to the over-parameterization problem I (still) think the model has, but I’m going to leave this as an open issue for now, since I want to get back to the positive message I’m trying desperately to focus on (and, if I’m being honest, because I started working on trying to prove that this problem exists, and the elements of the scaled, rotated, and sheared covariance matrices were getting kind of ridiculous to work with).

To repeat the punchline from above: mathematical modeling allows us to formulate and ask incisive questions. In this case, it allows us to formulate and ask incisive questions about which kinds of transformations of perceptual space are consistent with human cognition and behavior. More to the point, it allows us to ask this about (absolute and relative) scaling, rotation, and shear transformations.

There has long been compelling evidence that scaling can provide a good account of at least some perceptual behavior (e.g., Nosofsky’s 1986 and 1987 papers on selective attention). There is also some evidence that rotation of at least some dimensions causes substantial changes in behavior (e.g., a 1990 paper by Melara & Marks). I’m not aware of any research on shear transformations of perceptual space, but given that scaling and rotation have been investigated with some success, shear transformation seem very likely to be amenable to empirical and theoretical inquiry.

In the end, then, mathematical analysis (of GRT, in this case) allows us to see where we might productively aim future efforts. It enables us to ask very specific questions about scaling, rotation, and shear transformations. Perhaps unfortunately, though, because of the empirical equivalence of the (rotation and shear) transformed and untransformed GRT models, it does not (yet) give us the tools to answer these questions.

Of course, if this case is at all illustrative of a general state of affairs, working out new mathematical tools to answer these questions seems very likely to allow us to formulate and ask any number of new, and as of yet unanticipated, incisive questions.

Lather, rinse, repeat, ad nauseam.

Posted in statistical modeling | Comments Off on More thoughts on GRT wIND

A critical assessment of GRT wIND

I’ve blogged about GRT (General Recognition Theory) at least a few times before (e.g., here, here, here). The third of those links goes to an introductory GRT post (#1 of an unknown number of planned posts). Here’s the super-abbreviated version: GRT is a multidimensional extension of signal detection theory (SDT). The simplest GRT model consists of four bivariate Gaussian densities (perceptual distributions) and two decision bounds. These are used to model identification confusion patterns for stimuli consisting of a factorial combination of two levels on each of two dimensions (e.g., square or octagon on a shape dimension, red or purple on a hue dimension).

Robin Thomas and I have done some theoretical work with this model recently. More specifically, we wrote a paper in which we described how failures of decisional separability (i.e., decision bounds that are not parallel to the coordinate axes) are not identifiable in the 2 \times 2 model. The basic issue is that, with linear bounds, affine transformations of the modeled perceptual space can always map a model with failure of DS onto a model in which DS holds. (This is illustrated in the second ‘here’ link above.)

More recently, Soto, et al. developed a multilevel GRT model (called GRT wIND – GRT with Individual Differences) and claimed to have solved the identifiability problem that Robin and I detailed (pdf). Still more recently, Soto and Ashby wrote a second paper with GRT wIND (pdf), which I reviewed (twice) for the journal Cognition. Yesterday, I got an email (as a member of the Society for Mathematical Psychology) announcing an R package for fitting GRT wIND (and carrying out various other GRT-related analyses).

The paper I reviewed has since been accepted for publication, and I confess some measure of frustration with the fact that Soto and Ashby (and the editor who handled the paper for Cognition) seem to have completely ignored the comments and suggestions from my second (and signed) review.

One key issue I raised was a concern that the GRT wIND model is over-parameterized, i.e., that given a data set, the parameters are not uniquely identifiable. In part motivated by my frustration, and in (larger) part because I actually want to know if my concern is valid or not, and in part because they made the model fitting software available, I decided to ry to validate (or not) my concern about over-parameterization in GRT wIND.

Some GRT wIND background first: The group level model consists of four bivariate Gaussian densities. Each has a mean vector and a covariance matrix. The individual subject level for subject k consists of four bivariate Gaussian densities with the same means and transformed covariance matrices as well as two decision bounds, the latter possibly not parallel with the coordinate axes.

If the group level covariance matrix corresponding to stimulus A_i,B_j (i.e., the i^{th} level of the A dimension and the j^{th} level of the B dimension) is given by:

    \begin{equation*}\mathbf{\Sigma}_{A_i,B_j} = \left[\begin{array}{cc}{\sigma_{xx}&\sigma_{xy}\\\sigma_{yx}&\sigma_{yy}\end{array} \right]\end{equation*}

then the corresponding individual level matrix for subject k is given by:

    \begin{equation*}\mathbf{\Sigma}_{k,A_i,B_j} = \left[\begin{array}{cc}{\frac{\sigma_{xx}}{\kappa_k \lambda_k}&\frac{\sigma_{xy}}{\kappa_k\sqrt{\lambda_k(1-\lambda_k)}}\\\ \frac{\sigma_{yx}}{\kappa_k\sqrt{\lambda_k(1-\lambda_k)}}&\frac{\sigma_{yy}}{\kappa_k (1-\lambda_k)}\end{array} \right]\end{equation*}

The two additional parameters - \kappa and \lambda - scale the subject’s space overall (\kappa) and scale the two dimensions relative to one another (\lambda). As \kappa increases, the marginal variances decrease, thereby increasing the salience of the stimuli and reducing the number of (predicted) errors. As \lambda increases, the marginal variance on the x axis decreases and the marginal variance on the y axis increases, shifting the relative salience of the two dimensions. As Soto, et al., note, equivalent changes in response probabilities could be obtained by leaving the covariance matrices alone and shifting the means of the group level densities, instead.

As a brief aside, it’s interesting (to me) to note that you can represent the individual level matrix as a linearly-transformed group level matrix:

    \begin{eqnarray*}\mathbf{\Sigma}_{k,A_i,B_j} &=& \mathbf{T\Sigma}_{A_i,B_j}\mathbf{T}^t\\ &=&\left[\begin{array}{cc}{\frac{1}{\sqrt{\kappa_k\lambda_k}}&0\\ 0 &\frac{1}{\sqrt{\kappa_k(1-\lambda_k)}}\end{array} \right]\left[\begin{array}{cc}{\sigma_{xx}&\sigma_{xy}\\\sigma_{yx}&\sigma_{yy}\end{array} \right]\left[\begin{array}{cc}{\frac{1}{\sqrt{\kappa_k\lambda_k}}&0\\ 0 &\frac{1}{\sqrt{\kappa_k(1-\lambda_k)}}\end{array} \right]\end{eqnarray*}

The point here being that the Silbert & Thomas results concerning linear transformations and identifiability of decisional separability apply to the individual subject level models in GRT wIND. Soto, et al., say that the individual subjects share a perceptual space, but in reality they share some, but not all, properties of the group level perceptual space. In GRT wIND, expansion/contraction of the space and stretching/shrinking the group space along one or another coordinate axis are acceptable transformations, but rotations, shears, etc…, are not. So, GRT wIND solves the decisional separability identifiability issue if this distinction is valid for a given stimulus set and experimental subject group, but it doesn’t solve the problem in general.

But I digress.

A little more background about GRT: In some of the oldest papers on GRT, there are mentions of the fact that the means and marginal variances of the perceptual distributions are not both uniquely identifiable (this is closely related to the note above about scaling the variances vs the means in GRT wIND). I long assumed that this was true, but to the best of my knowledge, no one had proven it. I figured out that an affine transformation can be applied to each perceptual distribution (with arbitrary means and covariance matrices) in a GRT model to force it to have unit variances and appropriately shifted means (where by “appropriately” I mean that the shift in the means exactly balances the scaled variances, producing the same predicted response probabilities for a given set of decision bounds, assuming decisional separability holds, which see above with regard to identifiability). More specifically, if you have decision criteria c_x and c_y, the following transformation gives you unit variances and appropriately shifted means:

    \begin{equation*} \mathbf{T} + \Delta = \left[\begin{matrix}\frac{1}{\sqrt{\sigma_{xx}}} & 0\\ 0 & \frac{1}{\sqrt{\sigma_{yy}}}\end{matrix}\right] + \left[\begin{matrix}c_x -\frac{c_x}{\sqrt{\sigma_{xx}}}\\c_y - \frac{c_y}{\sqrt{\sigma_{yy}}}\end{matrix}\right] \end{equation*}

Application of this transformation produces a new covariance matrix \mathbf{R} = \mathbf{T}\Sigma\mathbf{T}^T and new mean vector \boldsymbol{\eta} = \mathbf{T}\boldsymbol{\mu} + \Delta. The transformed covariance matrix is the correlation matrix:

    \begin{eqnarray*} \mathbf{T}\Sigma\mathbf{T}^T &=& \left[\begin{matrix}\frac{1}{\sqrt{\sigma_{xx}}} & 0\\ 0 & \frac{1}{\sqrt{\sigma_{yy}}}\end{matrix}\right]\left[\begin{matrix}\sigma_{xx} & \sigma_{xy}\\ \sigma_{yx} & \sigma_{yy} \end{matrix} \right]\left[\begin{matrix}\frac{1}{\sqrt{\sigma_{xx}}} & 0\\ 0 & \frac{1}{\sqrt{\sigma_{yy}}}\end{matrix}\right]\\ &=&\left[\begin{matrix}\frac{\sigma_{xx}}{\sqrt{\sigma_{xx}\sigma_{xx}}} & \frac{\sigma_{xy}}{\sqrt{\sigma_{xx}\sigma_{yy}}}\\\frac{\sigma_{yx}}{\sqrt{\sigma_{xx}\sigma_{yy}}} & \frac{\sigma_{yy}}{\sqrt{\sigma_{yy}\sigma_{yy}}}\end{matrix}\right]\\ &=& \left[\begin{matrix}1 & \rho_{xy} \\ \rho_{yx} & 1\end{matrix}\right] \end{eqnarray*}

And the transformed mean vector is shifted by the signed distance (in standard deviation units) between the mean and the decision criterion (on each dimension):

    \begin{eqnarray*} \boldsymbol{\eta} &=& \mathbf{T}\boldsymbol{\mu} + \Delta\\ &=& \left[\begin{matrix}\frac{1}{\sqrt{\sigma_{xx}}} & 0\\ 0 & \frac{1}{\sqrt{\sigma_{yy}}}\end{matrix}\right]\left[\begin{matrix}\mu_x\\\mu_y\end{matrix}\right] + \left[\begin{matrix}c_x -\frac{c_x}{\sqrt{\sigma_{xx}}}\\c_y - \frac{c_y}{\sqrt{\sigma_{yy}}}\end{matrix}\right]\\ &=& \left[\begin{matrix}c_x + \frac{\mu_x-c_x}{\sqrt{\sigma_{xx}}}\\c_y + \frac{\mu_y-c_y}{\sqrt{\sigma_{yy}}}\end{matrix}\right] \end{eqnarray*}

Note that this transformation is invertible. The upshot of this is that a standard 2 \times 2 GRT model can be scaled arbitrarily. Or, to repeat the point made above, the means and marginal variances are not both identifiable.

So, what does this all have to do with GRT wIND? My intuition was that this identifiability issue applies to GRT wIND, too. I didn’t feel like working through all the algebra to prove that this is the case, so I did the next best thing. Specifically, I simulated individual subject data using two GRT wIND models. In one, the group level perceptual means describe a square centered on the origin of the space with sides of length four, and the group level covariance matrices have randomly generated variances and correlations. In the second model, I applied the “unit-variance” transformation to the model (with the coordinate axes as the ‘decision criteria’).

I also randomly generated a set of 50 \kappa, \lambda, \phi, and \omega values, where \kappa and \lambda are the GRT wIND parameters described above, and \phi specifies the angle between c_y and the x axis, and \omega specifies the angle between c_x and c_y:

Linear failure of decisional separability
Linear failure of decisional separability

The following figure illustrates the mean (red), median (blue), and maximum (green) absolute error between the predicted identification confusion probabilities for the two models:

grt_wind_rescale_error And here are the \phi and \omega values:grt_wind_noDS_angles

And here are the \lambda and \kappa values:

grt_wind_lam_kapAnd here is the R script that runs the simulations and generates these plots.

I’ve run the simulation a number of times, and, while sometimes the maximum errors are larger (~.15 or thereabouts), the two models consistently produce extremely similar, in many cases essentially identical, confusion matrices.

I take this as strong support for my intuition that GRT wIND is over-parameterized. A mathematical proof would be better, of course, but simulation results provide relevant evidence. Now, I’m not totally sure why there is any deviation between the models, though there is no correlation between any of the randomly generated parameters and the error. But because there is some error, the simulations don’t provide the kind of airtight case a proof would. Nonetheless, it is all but unimaginable to me that the true parameters of the two models could be recovered by fitting sets of confusion matrices that are this similar to one another.

As it happens, I would like to fit GRT wIND models to simulated data to test how accurately known parameters are recovered and to come at the problem from the opposite direction (i.e., instead of seeing if different parameter values can generate essentially identical data, see if different parameter values can provide essentially identical model fits to identical data). Alas, in the R package, the function grt_wind_fit crashes R, and the function grt_wind_fit_parallel returns an error. I’ve opened up an issue on the GitHub repository, so maybe I’ll try again when/if I can get the code to work on my computer(s). And, I suppose, depending on some of my colleagues and collaborators react, I may also get to work on a more rigorous mathematical treatment of the matter.

Posted in SCIENCE!, statistical modeling | Comments Off on A critical assessment of GRT wIND

Vox lies with statistics

Okay, maybe lies is too strong a word, but Vox‘s German Lopez sure does a good job illustrating numerical dishonesty in this piece on marijuana use in Colorado and Washington since recreational legalization.

The headline is “Marijuana use rises in states with legalization,” and here are the first two take-home bullet points:

  • Past-month marijuana use in Colorado and Washington state rose following the legalization of personal possession, according to new data from a federal survey.
  • Nationwide, Americans reported smaller percentage point increases in marijuana use than people in Colorado or Washington.

Okay, so that’s maybe interesting, but you can’t really interpret increases in Colorado and Washington without knowing something about what happened in other states, none of which legalized marijuana.

Here’s the data. If you bother to look at it, you’ll see that there were also statistically-significant, similar-magnitude increases in the District of Columbia, Georgia, Maine, Maryland, Michigan, Missouri, and New Hampshire.

And this points to a problem with the comparison between Colorado and Washington, on the one hand, and the nation as a whole, on the other. It seem a priori very likely that marijuana use varies fairly substantially across states, and the data confirm this. A single pair of numbers for the whole country obscures this variation.

Nothing Lopez writes in the article is technically wrong. It’s true that past-month use increased following legalization in Colorado in Washington. It’s true that these were the first states to legalize marijuana in 2012. It’s true that use increased less in the nation as a whole than in either state.

But it’s also true that use increased in a number of states without legalization.

Obviously, I don’t know why Lopez wrote this article the way he did. Whether or not it was written with the intention to deceive, it exploits the careless reader’s inclination to conflate correlation and causation.

The fact that the data is directly accessible means that if you care to look, you will quickly see how meaningless the article really is. But it also means that it’s very easy to take what is written at face value under the assumption that it would be ridiculous to link to data that undermines the whole point of the article.


Posted in SCIENCE!, statistical description | Comments Off on Vox lies with statistics

Yglesias makes an innumerate (non-)funny

Yesterday, Matt Yglesias tried to point out the absurdity of a Joe Klein column by turning the tables with the numbers Klein invoked, thereby mimicking and mocking Klein’s argument.

Yglesias quotes Klein:

Blacks represent 13% of the population but commit 50% of the murders; 90% of black victims are murdered by other blacks. The facts suggest that history is not enough to explain this social disaster.

Yglesias then links to the relevant FBI data and points out that:

Back in 2011, the most recent year for which data is available, a staggering 83 percent of white murder victims were killed by fellow Caucasians.

He uses this to (jokingly) make the exaggerated case that white-on-white murder is out of control, that white violence is a social disaster, and so on.

Now, I understand what he’s trying to do here, but, either because of innumeracy or willful negligence, his whole jokey approach fails miserably, at least as long as the reader isn’t innumerate.

Note which parts of the Klein quote above have analogs in the Yglesias quote and which parts do not. The key to Klein’s argument is that “Blacks represent 13% of the population but commit 50% of the murders,” but Yglesias only invokes the white analog to Klein’s “90% of black victims are murdered by other blacks.

Yglesias’ essay depends on the reader not noticing that the relevant analog to the important part of Klein’s argument is that whites represent 72% of the population and commit (approximately) 50% of the murders (based on the FBI data linked above).

To be clear, this is not intended as a defense of (or an argument against) Klein. It’s just a response to Yglesias’ shiftiness (or ignorance) and his willingness to exploit readers’ possible innumeracy to make his point. It’s an unfunny failure of an argument.

Posted in statistical description | Comments Off on Yglesias makes an innumerate (non-)funny

Truly random

In today’s new What If?, Randall Munroe discusses, in part, randomly aiming a laser at the sky, writing “if you aimed in a truly random direction, you would have an almost 50% chance of hitting the Earth.”

This is one of those cases wherein my training in linguistics and statistics conflict with one another. Kind of like how people use “a fraction of” to indicate a small fraction of something, despite the fact that, for example, 999,999/1,000,000 is a fraction, too. As is, for that matter, 10/3. So, saying that this year’s budget deficit is a fraction of last year’s isn’t, technically speaking, informative. It certainly does not imply that this year’s deficit is a small proportion of last year’s, though this is pretty much always what people mean when they say things like this.

So, my linguistics training lets me understand that the phrase is used to mean “a small fraction,” and that’s fine. But my statistics (and assorted math) training just can’t let it go.

I have similar issues with using a phrase like “truly random” to mean “uniformly distributed.” Clearly, Munroe means the latter, but a laser could be pointed randomly as if it were governed by a Gaussian distribution, which, with sufficiently small variance(s) and an appropriate mean (vector), would produce a very small chance of hitting the Earth.

Again, my linguistics tells me that it’s fine to use “truly random” to mean “uniformly distributed” in regular old language, but my statistics training just can’t let this kind of thing go.

It’s particularly hard to digest when it comes from a mathematically sophisticated writer in a semi-technical setting.



Posted in language, mildly informative filler | Comments Off on Truly random