Suppose I have a six-sided die that is red on 2 sides and blue on 4 sides, and another die that’s the other way around, red on 4 sides and blue on 2.

I choose a die at random and roll it, and I tell you it came up red. What is the probability that I rolled the second die (red on 4 sides)? And if I do it again, what’s the probability that I get red again?

He provides links to a Jupyter notebook with his answer, but I’m going to write my answer here before I read the notebook.

The first part of the problem is a pretty straightforward Bayes’ rule/signal detection theory. Denote the four-blue-sides die , the four-red-sides die , and the observation of a red side . Then we have:

Assuming that by “choose a die at random” we mean that , then we can plug these into Bayes’ rule to get:

Note that this implies that .

It’s maybe a bit trickier to figure out the next part, but, unless I’m mistaken, it’s not all that tricky:

Of course, you’ll have to take my word for it that I haven’t yet looked at Downey’s notebook. I don’t know if my answer agrees with his or not, but I can’t think of any reason why the equations above are wrong.

I’m going to publish this post to put it on the record, then read the notebook, and report back as needed.

Addendum: Well, I maybe misread the problem. In Downey’s “Scenario A”, by “do it again” he means “pick a die at random and roll it.” In his “Scenario B”, he means what I originally interpreted it to mean, namely “take the die that produced the first time and roll it a second time.”

I’m happy to see that his simulations and my analytic solution agree when the word problem is interpreted the same.

It seems to me that part of what makes probability hard, when it is hard, is translating ambiguous words into unambiguous mathematical statements.

]]>The logic of Kunert’s paper is as follows:

If psychological findings are highly context-dependent, and if variation in largely unknown contextual factors drives low rates of replication, then paper-internal (mostly conceptual) replications should correlate with independent replication success. The idea is that in conceptual replications (which constitute most paper-internal replications), contextual factors are intentionally varied, so effects that show up in repeated conceptual replications should be robust to the smaller amount of contextual variation found in independent, direct (i.e., not conceptual) replications. Kunert calls this the unknown moderator account. Crucially, Kunert argues that the unknown moderator account predicts that the studies with internal replications will be replicated more successfully than are studies without internal replications.

On the other hand, if low replication rates are driven by questionable research practices – optional stopping, the file drawer effect, post-hoc hypothesizing – then studies with internal replications will not be replicated more (or less) successfully than are studies without internal replications.

Kunert analyzes *p* values and effect sizes in the OSF replication studies. Here’s his Figure 1, illustrating that (a) there’s not much difference between studies with internal replications and those without, and (b) what difference there is points toward studies with internal replications having lower rates of replication (as measured by statistically significant findings in the independent replications, see left panel) and greater reductions in effect size (right panel):

I think there’s a flaw in the reasoning behind the unknown moderator account, though. Specifically, I don’t think the unknown moderator account predicts a difference in replication rates between studies with and without internal replications.

The logic underlying the prediction is that *if* internal replications, *then* successful independent replications. But modus ponens does not license the conclusion that if *no* internal replications, then *not* successful independent replications. Studies without internal replications could lack internal replications for any number of reasons. In order for the unknown moderator account to predict a difference in independent replication rates between studies with and without internal replications, the absence of internal replications has to directly reflect less robust effects. Kunert doesn’t make a case for this, and it’s not clear to me what such a case would be or if it could be made.

So, the unknown moderator account is, I think, consistent with equal independent replication rates (on average) across studies with and without internal replications.

It’s possible, for example, that the unknown moderator account is true while all of the OSF studies probed (approximately) equally robust effects, with only a subset of them including internal replications. Or while some proportion of the findings from the OSF studies without internal replications are as robust as the findings from the studies with internal replications, while the rest are not.

The upshot is that the unknown moderator account predicts *equal or greater* independent replication rates for studies with internal replications than for those without. Given this, I think it’s noteworthy that Kunert reports lower replication rates and greater reductions in effect size for studies with internal replications than for those without. These effects aren’t statistically signif… er, don’t have sufficiently large Bayes’ factors or sufficiently shifted posterior distributions to license particularly strong conclusions, but they do both point in the direction that is inconsistent with even my modified version of the unknown moderator effect.

I think the unknown moderator account probably also predicts greater variation in independent replication rates for studies without internal replications than in those with. I’m not sure if this prediction holds or not, but based on Kunert’s Figure 1, it doesn’t seem likely.

It’s also worth remembering that modus ponens also implies that if *not* successful independent replications, then *no* internal replications. So the unknown moderator account also predicts that sufficiently low independent replication rates should correspond to studies without internal replications.

But it’s not at all clear “sufficiently low” means here. The replication rates in the OSF project that Kunert analyzed seem pretty low to me (and the whole point of Kunert’s analysis to test two explanations for such low rates), but I have no idea if they’re low enough to confirm this prediction.

And, of course, this logic is just as asymmetrical as the logic described above, since the *presence* of successful independent replications is consistent with the presence *and* the absence of internal replications. If even the low rates reported in the OSF project count as successful replication, then we can’t really infer much from approximately equal rates across these two categories of psychology study.

The case is the calculation of means (or other functions) along specified axes of multidimensional arrays.

In IPython (with pylab invoked), you specify the axis along which you want to apply the function of interest, and, at least in my opinion, you get output arrays that are exactly the shape you would expect. If you have a array and you calculate the mean along the first axis, you get a array, and analogously for the second and third axes:

In [1]: X = random.normal(size=(3,4,5)) In [2]: X.mean(axis=0).shape Out[2]: (4, 5) In [3]: X.mean(axis=1).shape Out[3]: (3, 5) In [4]: X.mean(axis=2).shape Out[4]: (3, 4)

In R, you get similarly sensible results, but you have to specify the axes along which you *don’t* want to apply the function (which I find much more confusing than the Python approach shown above):

> X = array(rnorm(60),dim=c(3,4,5)) > dim(apply(X,c(2,3),mean)) [1] 4 5 > dim(apply(X,c(1,3),mean)) [1] 3 5 > dim(apply(X,c(1,2),mean)) [1] 3 4

And here’s what Matlab does:

>> X = random(makedist('Normal'),3,4,5); >> size(mean(X,1)) ans = 1 4 5 >> size(mean(X,2)) ans = 3 1 5 >> size(mean(X,3)) ans = 3 4

Ugh. For some reason, Matlab preserves the dimension if you apply the function on the first or second axis, but drops it if you apply it on the third (or greater) axis. This is annoying.

So, in summary, I continue to be happy using Python for almost everything, R for most of what’s left over, and Matlab only very rarely anymore.

]]>I’ve read it twice now, and may well go back to it again at some point. It’s thought provoking, for me in no small part because I like to use Bayesian model fitting software (primarily PyMC(3) these days), but I don’t think of myself as “a Bayesian,” by which I mean that I’m not convinced by the arguments I’ve read for Bayesian statistics being optimal, rational, or generally inherently better than frequentist statistics. I am a big fan of Bayesian *estimation*, for reasons I may go into another time, but I’m ambivalent about (much of) the rest of Bayesianism.

Which is not to say that I *am* convinced by arguments for any particular aspect of frequentist statistics, either. To be frank, for some time now, I’ve been in a fairly uncertain state with respect to how I think statistical models should, and do, relate to the world. Perhaps it’s a testament to my statistical training that I am reasonably comfortable with this uncertainty. But I’m not so comfortable with it that I want it to continue indefinitely.

So, my motivation for writing this post is to (at least partially) work through some of my thoughts on a small corner of this rather large topic. Specifically, I want to think through what properties of confidence and/or credible intervals are important and which are not, and how this relates to interpretation of reported intervals.

(I know that the more general notion is confidence/credible *set*, but everything I say here should apply to both, so I’ll stick with “interval” out of habit.)

Early in my time as a PhD student at IU, I took John Kruschke’s Intro Stats class. This was well before he started writing his book, so it was standard frequentist fare (though I will stress that, whatever one’s opinion on the philosophical foundations or everyday utility of the content of such a course, Kruschke is an excellent teacher).

I learned a lot in that class, and one of the most important things I learned was what I now think of as the only reasonable interpretation of a confidence interval. Or maybe I should say that it’s the best interpretation. In any case, it is this: a confidence interval gives you the range of values of a parameter that you cannot reject.

If I’m remembering correctly, this interpretation comes from David Cox, who wrote, in Principles of Statistical Inference (p. 40) “Essentially confidence intervals, or more generally confidence sets, can be produced by testing consistency with every possible value in ψ and taking all those values not ‘rejected’ at level c, say, to produce a 1 − c level interval or region.”

In Shalizi’s dialogue, A argues that the coverage properties of an interval over repetitions of an experiment are important. Which is to say that what makes confidence intervals worth estimating is the fact that, if the underlying reality stays the same, in the proportion 1-c of repetitions, the interval will contain the true value of the parameter.

But the fact that confidence intervals have certain coverage properties does not provide a reason for reporting confidence intervals in any single, particular case. If I collect some data and estimate a confidence interval for some statistic based on that data, the expected long run probability that the procedure I used will produce intervals that contain the true value of a parameter says *absolutely nothing* about whether the true value of the parameter is in the single interval I have my hands on right now.

Obviously, it’s good to understand the properties of the (statistical) procedures we use. But repetitions (i.e., direct, rather than conceptual, replications) of experiments are vanishingly rare in behavioral fields (e.g., communication sciences and disorders, where I am; second language acquisition, linguistics, and psychology, where I have, to varying extents, been in the past), so it’s not clear how relevant this kind of coverage is in practice.

More importantly, it’s not clear to me what “the true value of a parameter” means. The problem with this notion is easiest to illustrate with the old stand-by example of a random event, the coin toss.

Suppose we want to estimate the probability of “heads” for two coins. We could toss each coin times, observe occurrences of “heads” for the coin, and then use our preferred frequentist or Bayesian statistical tools for estimating the “true” probability of “heads” for each, using whatever point and/or interval estimates we like to draw whatever inferences are relevant to our research question(s). Or we could remove essentially all of the randomness, per Diaconis, et al’s approach to the analysis of coin tosses (pdf).

The point being that, when all we do is toss the coins *N* times and observe “heads,” we ignore the underlying causes that determine whether the coins land “heads” or “tails.” Or maybe it’s better to say that we partition the set of factors determining how the coins land into those factors we care about and those we don’t care about. Our probability model – frequentist, Bayesian, or otherwise – is a model of the factors we don’t care about.

In this simple, and somewhat goofy, example, the factors we care about are just the identity of the coins (Coin A and Coin B) or maybe the categories the coins come from (e.g, nickels and dimes), while the factors we don’t care about are the physical parameters that Diaconis, et al, analyzed in showing that coin tosses aren’t really random at all.

I don’t see how the notion of “true” can reasonably be applied to “value of the parameter” here. We might define “the true value of the parameter” as the value we would observe if we could partition all of the (deterministic) factors in the relevant way for all relevant coins and estimate the probabilities with very large values of *N*.

But the actual underlying process would still be deterministic. Perhaps this notion of “truth” here reflects a technical, atypical use of a common word (see, e.g., “significance” for evidence of such usage in statistics), but defining “truth” with respect to a set of decisions about which factors to ignore and which not to, how to model the ignored factors, and how to collect relevant data seems problematic to me. Whatever “truth” is, it doesn’t seem *a priori* reasonable for it to be defined in such a instrumental way.

The same logic applies to more complicated, more realistic cases, very likely exaggerated by the fact that we can’t fully understand, or even catalog, all of the factors influencing the data we observe. I’m thinking here of the kinds of behavioral experiments I do and that are at the heart of the “replication crisis” in psychology.

So, where does this leave us? My intuition is that it only really makes sense to interpret c{onfidence, redible} intervals with respect to whatever model we’re using, and treat them as sets of parameter values that are more or less consistent with whatever point estimate we’re using. Ideally, this gives us a measure of the precision of our estimate (or of our estimation procedure).

Ultimately, I think it’s best to give all of this the kind of instrumental interpretation described above (as long as we leave “truth” out of it). I like Bayesian estimation because it is flexible, allowing me to build custom models as the need arises, and I tend to think of priors in terms of regularization rather than rationality or subjective beliefs. But I’ll readily own up to the fact that my take on all this is, at least for now, far too hand-wavy to do much philosophically heavy lifting.

]]>Just in case you don’t want to read the wikipedia entry linked above, here are some quick definitions. Suppose you have a test for diagnosing a disease. The sensitivity of the test is the proportion of people with the disease that the test correctly identifies as having the disease (i.e., the hit rate, henceforth H). The specificity of the test is the proportion of people without the disease that the test correctly identifies as not having the disease (i.e., the correct rejection rate, henceforth CR).

H and CR are useful measures, to be sure, but they obscure some important properties of diagnostic tests (and of binary classifiers in probabilistic decision making in general). Rather than H and CR, we can (and should) think in terms of *d’* – the “distance” between the “signal” and “noise” classes – and *c* – the response criterion. Here’s a figure from an old post to illustrate:

In this illustration, the x-axis is the strength of evidence for disease according to our test. The red curve illustrates the distribution of evidence values for healthy people, and the blue curve illustrates the distribution of evidence values for people with the disease. The vertical dashed/dotted lines are possible response criteria. So, in this case, *d’* would be , where is some measure of the variability in the two distributions. It is useful to define *c* as the signed distance of the response criterion with respect to the crossover point of the two distributions. I’ll note in passing that I’m eliding a number of important details here for the sake of simplicity (e.g., the assumption of equal variances in the two distributions, the assumption of normality in same), which I’ll come back to below.

H and CR are determined by *d’* and *c*. H is defined as the integral of the blue curve to the right of the criterion, and CR as the integral of the red curve to the left of the criterion.

So, one important property of a binary classifier that H and CR obscure but that *d’* and *c* illuminate is the fact that, for a given *d’*, as you shift *c*, H and CR trade off with one another. Shift *c* leftward, and H increases while CR decreases. Shift *c* rightward, and H decreases while CR increases. In the figure above, you can see how the areas under the red and blue curves differ for the dashed and dotted vertical lines – H is lower and CR higher for the dotted line than for the dashed line.

Another important property of a binary classifier is that, as you increase *d’*, either H increases, CR increases, or both increase, depending on where the response criterion is. In the above figure, if we increased the distance between and (without changing the variances of the distributions) by shifting to the left by some amount and by shifting to the right by , both H and CR would increase.

The egregious error I encountered is in the “Clinical Decision Analysis Regarding Test Selection” section of the ASHA technical report on (Central) Auditory Processing Disorder, or (C)APD (I’ll quote the report as it currently stands – I plan to email someone at ASHA to point this out, after which, I hope it will be fixed):

The sensitivity of a test is the ratio of the number of individuals with (C)APD detected by the test compared to the total number of subjects with (C)APD within the sample studied (i.e., true positives or hit rate). Specificity refers to the ability to identify correctly those individuals who do not have the dysfunction. The specificity of the test is the ratio of normal individuals (who do not have the disorder) who give negative responses compared to the total number of normal individuals in the sample studied, whether they give negative or positive responses to the test (i.e., 1 – sensitivity rate). Although the specificity of a test typically decreases as the sensitivity of a test increases, tests can be constructed that offer high sensitivity adequate for clinical use without sacrificing a needed degree of specificity.

The egregious error is in stating that CR is equal to 1-H. As illustrated above, it’s not.

The oversimplification-and-limitation-illustration was in Part II of a recent Slate Star Codex (SSC) post. Here’s the oversimplification:

AUC is a combination of two statistics called sensitivity and specificity. It’s a little complicated, but if we assume it means sensitivity and specificity are both 92% we won’t be far off.

AUC here means “area under the curve,” or, as I called it above, area under the *receiver operating characteristic* curve (or AUROC). Here’s a good Stack Exchange answer describing how AUROC relates to H and CR, and here’s a figure from that answer:

The x axis in this figure is 1-CR, and the y axis is H. The basic idea here is that the ROC is the curve produced by sweeping across all possible values of *c* for a test with a given *d’*. If we set *c* as far to the right as we can, we get H = 0 and CR = 1, so 1-CR = 0 (i.e., we’re in the bottom left of the ROC figure). As we shift *c* leftward, H and 1-CR increase. Eventually, when *c* is as far to the left as we can go, H = 1 and 1-CR = 1 (i.e., we’re at the top right of the ROC figure).

The AUROC can give you a measure of something like *d’* without requiring, e.g., assumptions of equal variance Gaussian distributions for your two classes. Generally speaking, as *d’* increases, so does AUROC.

So, the oversimplification in the quote above consists in the fact that the AU(RO)C does not correspond to single values of H and CR.

Which brings us to the illustration of the limitations of H and CR. To get right to the point, H and CR don’t take the base rate of the disease into account. Let’s forget about the conflation of AUROC and H and CR and just assume we have H = CR = 0.92. Per the example in the SSC post, if you have a 0.075 proportion of positive cases, H and CR are problematic: you have 92% accuracy, but less than half of the people identified by the test as diseased actually have the disease!

The appropriate response here is to shift the criterion to take the base rate (and the costs and benefits of each combination of true disease state and test outcome) into account. Given how often Scott Alexander (i.e., the SSC guy) argues for Bayesian reasoning and utility maximization, I am a bit surprised (and chagrined) he didn’t go into this, but the basic idea is to respond “disease!” if the following inequality holds:

Here, is the probability of a particular strength of evidence of disease given that the disease is present, is the probability given that the disease is not present, and are the prior probabilities of the disease being present or not, respectively, and is the utility of test outcome and reality (e.g., is the cost of the test indicating “disease” while in reality the disease is not present).

The basic idea here is that the relative likelihood of a given piece of evidence when the disease is present vs when it’s absent needs to exceed the ratio on the right. By “piece of evidence” I mean something like the raw, pre-classification score on our test, which corresponds to the position on the x axis in the first figure above.

The ratio on the right takes costs and benefits into account and weights them with the prior probability of the presence or absence of the disease. We can illustrate a simple case by setting and and just focusing on prior probabilities. In this case, the inequality is just:

In the SSC case, and , so the criterion for giving a “disease!” diagnosis should be . That is, we should only diagnose someone as having the disease if the strength of evidence given the presence of the disease is 12+ times the strength of evidence given the absence of the disease.

]]>As it happens, it matters for the analysis, since it actually makes small and medium pizzas even *worse* deals. A one inch crust is, proportionally, 33%, 28%, or 22% of the area of a small, medium, or large, respectively.

Here is a graph showing the square inches per dollar as a function of number of toppings, taking a one inch crust into account:

And here’s a graph showing dollars per square inch as a function of number of toppings, taking a one inch crust into account (see previous post for plots that don’t take the crust into account):

By taking the crust into account, we see that the large is an even better deal than before. It’s also (very, very nearly) interesting to note that the crossover between smalls and mediums has shifted leftward a couple toppings. Without taking crust into account, smalls were a better deal than mediums for 0, 1, or 2 toppings, but with a one inch crust, small just barely beats medium, and then only if you get a plain cheese pizza.

In addition, here’s a plot showing square inches per dollar for small and medium pizzas relative to square inches per dollar for a large:

For no toppings, you get ~60% as many square inches per dollar for either small or medium as you do for a large. This ratio stays fairly constant for mediums, but drops substantially for smalls, approaching 50% for 10 gourmet toppings.

And, finally, here’s a plot showing dollars per square inch for small and medium pizzas relative to dollars per square inch for a large:

With respect to dollars per square inch, you spend ~150% for a no-topping small or medium relative to a large. The ratio stays more or less constant for mediums, while it increases quite a bit for smalls. If, for some reason, you decided to buy a 10-gourmet-topping small pizza, you’d be spending almost twice as much per square inch as you would if you bought a 10-gourmet-topping large.

I have way more real work to do than you might think.

]]>

So, naturally, this will be about the cost of pizza at a local pizzeria.

Sometimes we get pizza from a local place called Dewey’s. Normally, we get a couple of large pizzas. One time recently, though, I decided to get three medium pizzas, thinking that this would give us more or less the same amount of food while allowing us to maximize the variety of toppings. It occurred to me then (though after I had placed the order) that I should figure out which pizza size is the best deal. Tonight, we got Dewey’s again, and I remembered to work this out before ordering.

On the “create your own” page, they give the sizes (in diameter), the base price for each size, and the cost per topping for regular and gourmet toppings.

The sizes are 11″, 13″, and 17″.

The base prices are $8.95, $13.45, and $15.95.

Each regular topping costs $1.50, $1.75, or $2.00, and each gourmet topping costs $1.75, $2.00, or $2.25.

If indicates the radius, indicates the base price, indicates the number of toppings, and indicates the cost per topping, the cost of a pizza expressed as the number of square inches per dollar is then given by , which is a non-linear function of , and the cost expressed as the number of dollars per square inch is the reciprocal of this, , which is a linear function of .

This difference is very nearly interesting.

Anyway, I wrote a short Python script to calculate and plot costs for each size and topping type from 0 to 10 toppings.

Here’s the graph for square inches per dollar:

And here’s the graph for dollars per square inch:

The most obvious result here is that large pizzas are by far the best deal. You get way more pizza per dollar (or, if you prefer, you spend fewer dollars per unit pizza) with a large than with a small or medium.

It’s also kind of interesting that medium pizzas are the worst deal if you’re getting 2 or fewer toppings, but that small pizzas are the worst deal if you’re getting 3 or more toppings (comparing within topping class).

Finally, given the multiplicative role that each additional topping plays, it’s not surprising that the more toppings you get, the bigger the difference between regular and gourmet toppings.

So, it looks like it’ll be only large pizzas from Dewey’s for us in the future.

]]>(Disclaimer that’s less a disclaimer than a mildly amusing anecdote: Jeff Rouder and I once went to get a cup of coffee while were at a Math Psych meeting, and as we walked to the coffee shop, I mentioned a paper that I found interesting, incorrectly stating that it was a paper that he had (co-)authored, which he quickly pointed out was not the case. I can’t imagine he remembers this, but I felt like a knucklehead, and that moment of awkwardness has stuck with me. I have not yet had a chance to say anything stupid to Uri Simonsohn in person.)

There are quite a few issues one could dig into here, as I suppose would be obvious to anyone who’s been following the frequentist-v-Bayesian argument. In order to keep things orderly, I’ll focus on one small corner of the larger debate that’s playing an interesting role in this particular discussion.

Specifically, I’ll focus on the role of inferential errors in statistical reasoning. Simonsohn wants to know how often Rouder’s Bayesian t-test produces the incorrect result, so he ran some simulations. In the simulations, the true effect size is known, which allows us to evaluate the testing procedure in a way that we can’t when we don’t know the truth ahead of time. Simonsohn posted this figure to illustrate the results of his simulations:

The take-home point is that the Bayesian t-test in question erroneously supports the null hypothesis ~25% of the time when the effect size is “small.” It is perhaps worth noting in passing that the fact that power is 50% in these simulations means that the standard null hypothesis tests that Simonsohn favors are, by definition, wrong a full 50% of the time. Depending on what you do with the Bayesian “inconclusive” region, this is up to twice the error rate of the Bayesian test Simonsohn is arguing against.

Anyway, Rouder’s take on this is as follows:

Uri’s argument assumes that observed small effects reflect true small effects, as shown in his Demo 1

[the simulations illustrated above – NHS]. The Bayes factor is designed to answer a different question: What is the best model conditional on data, rather than how do statistics behave in the long run conditional on unknown truths?

Now, Rouder is right that Simonsohn seems to assume that observed (small) effects reflect *true* (small) effects, which is problematic. This isn’t obvious from the simulations, but Simonsohn’s second case makes it clear. Briefly, the second case is an(other) absurd Facebook study that found, with a sample size of ~7,000,000 (~6,000,000 + ~600,000), that 6 million people who saw pictures of friends voting (on Facebook) were 0.39% more likely to vote than 600k people that didn’t see pictures of friends voting (on Facebook). The result was statistically significant, but I don’t know that I’ve ever seen a better illustration of the difference between statistical and practical significance (I mean, aside from the other ridiculous Facebook study linked above). Who could possibly care that a 0.39% increase in voting was statistically significant? I’m having a hard time figuring out why anyone – authors, editors, or reviewers – would think this result is worth publishing. It’s just spectacularly meaningless. But I digress.

Ahem.

Simulations like those run by Simonsohn are useful, and I can’t figure out why so many Bayesians don’t like them. The point of simulations isn’t to understand how “statistics behave in the long run conditional on unknown truths,” it’s to leverage known truths to provide information about how much we can trust observed results when we don’t know the truth (i.e., in every non-simulated case).

That is, simulations are useful because they tell us about the probability that an inference about the state of the world is false given an observation, not because they tell us about the probability of sets of observations given a particular state of the world. I would have thought that Bayesians would recognize and appreciate this kind of distinction, given how similar it is to the distinction between the likelihood and the posterior in Bayes’ rule.

One argument for Bayesian reasoning is that frequentist reasoning can’t coherently assign probabilities to one-off events. I’ve never found this particularly compelling, in large part because I’m interested in statistics as applied to scientific research, which is to say that I’m interested in statistics as applied to cases in which we are explicitly concerned with prediction, replication, and anything other than one-off events. The point being that the frequentist properties of our statistical models are important. Or, put slightly differently, our models *have* frequentist properties, whether we’re Bayesian or not, and whether the models are Bayesian or not.

So, Simonsohn was right to run simulations to test Rouder’s Bayesian t-test, even if he was wrong to then act as if a statistically significant difference of 0.39% is any more meaningful than no effect at all, and even if he glosses over the fact that null hypothesis testing doesn’t really do any better than the Bayesian method he’s critiquing. And Rouder is right that it’s good to know what the best model is given a set of observations, even if he’s (also) wrong about what simulations tell us.

]]>The book is IPython Notebook Essentials, written by L. Felipe Martins, and published by Packt Publishing. I’ve read Chapters 1 and 2 so far. Both are clear and well-written, with only a few minor problems, and those possibly as much due to my own proclivities as anything else.

Chapter 1 covers the basics of installing a suitable scientific python package (e.g., Anaconda, which I’m a big, big fan of), creating a notebook, and executing some basic code in that notebook. The code given is a bit goofy, implementing a model of change in coffee temperature as a function of time and initial temperature, but introductory code in this kind of book is often fairly ridiculous.

I was already reasonably familiar with Python before getting this book, so it’s a bit difficult for me to judge the pedagogical effectiveness of the code in Chapter 1 for someone with little to no programming experience. Of course, it’s not meant to be a comprehensive introduction to Python, so this kind of intro has to fit into kind of an awkward space. It can’t be too boring to someone with some (or a lot of) experience, but it can’t be too confusing or complicated for less experienced programmers. This book does a pretty good job balancing the needs of a potentially wide variety of reader experience levels, I think.

One small issue I have with the code is that the convention for variable naming uses a lot of full words and underscores. The benefit of this is, of course, that the names are unambiguous and you mostly know what variables are just by looking at their names. The downside is that the code often spans multiple lines in the book, which can make it slightly more difficult to read. I tend to err very much in the other direction in my own code, giving variables short, but sometimes confusing, names. So, my taste runs a bit against the convention in this book. I don’t think the author should have gone as far as I often do, but I think some of the code snippets in the book would look nicer and be a bit easier to read if the variable names had been a bit shorter.

Okay, so Chapter 1 is fine. It’s not spectacular, but it does its job okay. I’ll give it B.

]]>However, I don’t want this to be a purely negative effort, so I will use my discussion of why I’ve come full circle with respect to GRT wIND to illustrate one of the most valuable, and (possibly) counterintuitive, aspects of mathematical modeling. Here’s the punchline, in case you don’t want to wade through the technical details below: mathematical modeling allows us to formulate and ask incisive questions. More below, though please note that if you haven’t read yesterday’s post, this one isn’t going to make much sense.

As I discussed yesterday, there are a number of identifiability issues with the model for the standard GRT task. The task is identification of stimuli defined by the factorial combination of two levels on each of two dimensions (e.g., red square, purple square, red octagon, purple octagon), and the model consists of bivariate Gaussian perceptual distributions corresponding to the stimuli and two decision bounds that define response regions corresponding to each possible response (in the standard task, there is a unique response for each stimulus).

One issue, as mentioned above, is that failures of decisional separability are, in general, not identifiable. Recall that decisional separability is defined as decision bounds that are parallel to the coordinate axes. In the simplest case, you have linear bounds. If you have a failure of decisional separability with linear bounds, you can apply a sequence of linear transformations to produce an empirically equivalent model in which decisional separability holds. You can see some more details about this here, and you can read the proof in my 2013 Psychonomic Bulletin & Review paper with Robin Thomas (that’s a link to a paywall, so if you want a copy, just email me at noahpoah at any of a number of different domains, including the obvious one, given where you’re reading this).

Another issue is that the means and marginal variances of the perceptual distributions are not simultaneously identifiable. If you have a model with decisional separability and arbitrary marginal variances in the perceptual distributions, it’s easy to transform each distribution so that each marginal variance is one and the means are shifted to preserve the predicted response probabilities. I showed how this works in yesterday’s post.

I contend that the first issue applies to GRT wIND (contra the claims of Soto, et al., in the paper that introduced the model), and given the first issue, the second issue seems tied up in the over-parameterization problem I can’t seem to let go of.

As I discussed yesterday, GRT wIND has a group-level set of perceptual distributions with arbitrary mean vectors, marginal variances, and correlations (with one distribution fixed at the origin and with unit marginal variances). The individual-level model for subject has a scaled version of the group level model for its perceptual distributions and two linear decision bounds. As I mentioned yesterday, the scaling can be formalized as a linear transformation of the group-level covariance matrix (where, e.g., “red” and square):

It occurred to me today that you can formalize this as a (possibly illicit) affine transformation that transforms the covariance matrix as desired while leaving the perceptual distribution’s mean vector unchanged:

This is possibly illicit because it seems like it could cause some problems to include the mean in the transformation. But this is beside the main point.

The main point being that each individual subject has a transformation that scales the whole space () and scales the two dimensions with respect to one another (), and, together with the two decision bounds for subject , these produce a standard Gaussian GRT model. Hence, the model for subject can be transformed to ensure that decisional separability holds, and once decisional separability holds, the marginal variances and means of each distribution can be scaled and shifted so that all marginal variances are one. Put slightly differently, in addition to the scaling transformation described above, subject can also have rotation and shear transformations to enforce decisional separability and the unit-variance transformation to force all the action onto the means.

There’s no mathematical rationale for allowing each subject to have – and -based scaling but not rotations, shear transformations, and unit variances. Rotation, shear transformation, and scaling to unit variance all produce models that are empirically equivalent to the model pre-transformation. All of which is to say that for every GRT wIND model with failures of decisional separability, there is an empirically equivalent model in which decisional separability holds. Ergo, GRT wIND does not, by itself, solve the identifiability problems discussed in Silbert & Thomas (2013).

Of course, there may be *psychological* reasons for allowing one type of transformation and not another. In my second review of the more recent GRT wIND paper, I suggested one such line of reasoning. Specifically, I suggested that scaling transformations could be allowed while rotations and shear transformations were not by invoking the notion of primary perceptual dimensions. The basic argument along these lines would be that subjects share a common set of primary dimensions, and so may vary with respect to the overall salience of the stimulus set () and the salience of one dimension relative to the other (), but not with respect to how the perceptual distributions are aligned with the coordinate axes in perceptual space (rotations, shear transformations).

This approach doesn’t rule out re-scaling to enforce unit variances, and I strongly suspect that this is linked to the over-parameterization problem I (still) think the model has, but I’m going to leave this as an open issue for now, since I want to get back to the positive message I’m trying desperately to focus on (and, if I’m being honest, because I started working on trying to prove that this problem exists, and the elements of the scaled, rotated, and sheared covariance matrices were getting kind of ridiculous to work with).

To repeat the punchline from above: mathematical modeling allows us to formulate and ask incisive questions. In this case, it allows us to formulate and ask incisive questions about which kinds of transformations of perceptual space are consistent with human cognition and behavior. More to the point, it allows us to ask this about (absolute and relative) scaling, rotation, and shear transformations.

There has long been compelling evidence that scaling can provide a good account of at least some perceptual behavior (e.g., Nosofsky’s 1986 and 1987 papers on selective attention). There is also some evidence that rotation of at least some dimensions causes substantial changes in behavior (e.g., a 1990 paper by Melara & Marks). I’m not aware of any research on shear transformations of perceptual space, but given that scaling and rotation have been investigated with some success, shear transformation seem very likely to be amenable to empirical and theoretical inquiry.

In the end, then, mathematical analysis (of GRT, in this case) allows us to see where we might productively aim future efforts. It enables us to ask very specific questions about scaling, rotation, and shear transformations. Perhaps unfortunately, though, because of the empirical equivalence of the (rotation and shear) transformed and untransformed GRT models, it does not (yet) give us the tools to answer these questions.

Of course, if this case is at all illustrative of a general state of affairs, working out new mathematical tools to answer *these* questions seems very likely to allow us to formulate and ask any number of new, and as of yet unanticipated, incisive questions.

Lather, rinse, repeat, ad nauseam.

]]>