This is follow-up number two (here’s follow-up number one) to a post describing and discussing the Bayesian model presented in Feldman, et al. (2009) The influence of categories on perception: Explaining the perceptual magnet effect as optimal statistical inference. The model is described in detail in the other posts (and, obviously, in the paper), so I’m not going to do so again here. If you care enough to read a(nother) long post about this model and the claims made about it, you should maybe go back to one of the older posts and re-acquaint yourself with the model and the perceptual magnet effect, which the model is explicitly intended to account for.

The core idea at the heart of this work is that the perceptual magnet effect is produced directly by optimal (Bayesian) computational inference. In this post, I’m going to discuss what it means for a perceptual model to be optimal, whether the Feldman, et al., model is computational rather than representational and algorithmic, and how these two issues are intertwined. In so doing, I’ll touch on some other models (perceptual, econometric, linguistic) and associated claims about optimality and levels of analysis.

For quite some time, I’ve seen references to “Marr’s levels of analysis,” but I hadn’t actually read anything substantive on the issue. So the next thing I want to do here is quote Feldman, et al., invoking Marr, quote Marr himself (from his 1982 book *Vision*), and talk about the distinction between the computational and representational-algorithmic levels. From the Feldman, et al., paper:

We take a novel approach to modeling the perceptual magnet effect, complementary to previous models that have explored how the effect might be algorithmically and neurally implemented. In the tradition of rational analysis proposed by Marr (1982) and J. R. Anderson (1990), we consider the abstract computational problem posed by speech perception and show that the perceptual magnet effect emerges as part of the optimal solution to this problem.

[p. 753]Other models give explanations of how the effect might occur but do not address the question of why it should occur. Our rational model fills these gaps by providing a mathematical formalization of the perceptual magnet effect at Marr’s (1982) computational level, considering the goals of the computation and the logic by which these goals can be achieved.

[p. 757]

And here’s Marr on the computational and representational-algorithmic levels of analysis (I’m ignoring the hardware implementation level):

Its

[the computational theory]important features are (1) that it contains separate arguments about what is computed and why and (2) that the resulting operation is defined uniquely by the constraints it has to satisfy. In the theory of visual processes, the underlying task is to reliably derive properties of the world from images of it… The second level of the analysis of a process, therefore, involves choosing two things: (1) arepresentationfor the input and for the output of the process and (2) analgorithmby which the transformation may actually be accomplished.[p. 23]At one extreme, the top level, is the abstract computational theory of the device, in which the performance of the device is characterized as a mapping from one kind of information to another, the abstract properties of this mapping are defined precisely, and its appropriateness and adequacy for the task at hand are demonstrated. In the center is the choice of representation for the input and output and the algorithm to be used to transform one into the other.

[p. 24]

Marr illustrates the distinction with a discussion of addition and supermarket cash registers, both describing what cash registers do and why – they use arithmetic addition because commutativity, associativity, inverses, and adding zero correspond to various properties of cash register use (here’s Marr’s cash register discussion, if you’re interested in reading the whole thing). You could have a cash register that uses Roman numerals, base 2, or base 10, but it would still need to add numbers and obey these constraints.

How does this correspond to the Bayesian perceptual magnet model? Here’s Feldman, et al., in the Theoretical Overview of the Model:

The goal of listeners, in perceiving a speech sound, is to recover the phonetic detail of a speaker’s target production. They infer this target production using the information that is available to them from the speech signal and their prior knowledge of phonetic categories….

Phonetic categories are defined in the model as distributions of speech sounds…. Because there are several factors that speakers might intend to convey, and each factor can cause small fluctuations in acoustics, we assume that the combination of these factors approximates a Gaussian, or normal, distribution. Phonetic categories in the model are thus Gaussian distributions of target speech sounds. Categories may differ in the location of their means, or prototypes, and in the amount of variability they allow. In addition, categories may differ in frequency, so that some phonetic categories are used more frequently in a language than others….

In the speech sound heard by listeners, the information about the target production is masked by various types of articulatory, acoustic, and perceptual noise. The combination of these noise factors is approximated through Gaussian noise, so that the speech sound heard is normally distributed around the speaker’s target production.

Formulated in this way, speech perception becomes a statistical inference problem. When listeners perceive a speech sound, they can assume it was generated by selecting a target production from a phonetic category and then generating a noisy speech sound based on the target production. Listeners hear the speech sound and know the structure and location of phonetic categories in their native language. Given this information, they need to infer the speaker’s target production.

[p. 757]

As far as I can tell, only the first paragraph of the Theoretical Overview deals with the computational level of analysis. The use of Gaussian distributions and statistical inference implies an input representation – real numbers – and the output “target productions” – also real numbers – are arrived at (algorithmically) by applying Bayes’ rule and calculating the expected value of the target given stimulus (see the paper or my earlier model descriptions).

The issue of optimality is, I think, intrinsically tied to the algorithmic level of analysis. The word optimal (or optimally) occurs 29 times in the Feldman, et al., paper (28 if we exclude the title), but it’s never defined. Okay, sure, “optimal” means “best,” but best at what? Minimizing incorrect classifications (or associated costs)? This is the definition of optimal that I’m familiar with, at least with respect to classification models.

Consider a basic signal detection model, for example, as illustrated below (and note that the following illustration of optimal classification is taken from pages 53-54 of Fukunaga’s Introduction to Statistical Pattern Recognition). A stimulus is mapped onto the real number line, producing, over the course of many presentations normal perceptual distributions with for class 1 (the red curve below) and with for class 2 (the blue curve below). For any given observation, the response rule that produces the minimal probability of error says to always choose the class with the highest posterior probability, calculated (using Bayes’ rule) as:

Assuming for the sake of simplicity that each category has the same prior probability, the boundary between the classes for this rule is indicated by the dashed vertical line at the crossover between the two distributions. Any perceptual effect to the left of this line is more likely to have come from class 1 than class 2, so the optimal response is “class 1.” And the probability of incorrectly saying “class 1” for a member of class 2 is the area under the blue curve to the left of the boundary, labeled A. Similarly, the probability of saying “class 2” for a member of class 1 is given by the area under the red curve to the right of the boundary, here given by B + C. Hence, the total error probability is A + B + C.

Now consider a different response rule, as illustrated by the dotted line to the right of the dashed line. With this rule, the probability of saying “class 2” to a class 1 input is C, the probability of saying “class 1” to a class 2 input is A + B + D, and the total error probability is A + B + C + D. Since D is positive, the probability of an error with this rule is larger than with the first rule, and since any rule other than the first produces some such positive D, the first rule is optimal.

It’s straightforward to incorporate differences each category’s prior probability and in the costs of errors and benefits of correct responses. In the simple case depicted above, differences in priors and cost-benefit ratios will produce shifts the optimal criterion.

What does all this have to do with the Bayesian perceptual magnet model? All but one mention of optimality in the article simply asserts that the model is, in fact, optimal. In footnote 1, though, they say (about the expected value of given ) that “The expectation is optimal if the penalty for misidentifying a stimulus increases with squared distance from the target.” There’s a response rule implicit in this statement, one that allows for calculation of error probability and cost. But unfortunately (from my perspective), they don’t go into any more detail than this.

Also unfortunately, no one runs experiments in which people actually identify “target productions.” They elicit similarity judgments or discrimination rates between items, or they get categorization responses across items. Which is to say that the target identification of the model is never tested directly. Relative distances between modeled target percepts are compared to scaled similarity judgments from earlier experimental work, and discrimination rates are modeled by incorporating some additional structure in the model. In both cases, additional structure is needed to compare the model to data – a linear model in the former, and a “sameness” threshold in the latter. (A bit more on this below.)

In order to model categorization rates, though, the model doesn’t require any additional structure. It turns out a logistic function for the probability of “category 1” responses falls out directly (see Appendix B in the article for details, and see either the article or my earlier posts for a full explanation of all the parameters):

This is interesting to me primarily because of something I read about random utility econometric models a few months ago. It turns out that a logistic function describing choice probabilities is very closely tied to “utility maximization,” which is, unless I’m mistaken, and with some differences in terminology, equivalent to the optimality I described above. That is, it requires specific assumptions about representation and algorithm – a mathematical definition of random utility and a method for generating utility maximizing responses. In the logit choice model, utility is distributed as an extreme value random variable, and, in any given choice situation, the option with the highest utility is (deterministically) chosen. (If you’re interested, you can read a very thorough description of the logit choice model in Kenneth Train’s book Discrete Choice Methods with Simulation, which is available in its entirety online on his website.)

A thorough explication of the mapping between the Bayesian perceptual magnet model and the logit choice random utility model is well beyond the scope of this blog post. My point in bringing it up at all is to further illustrate that if we want to make claims about optimality, we need to say what it means to be optimal and how a particular model meets whatever criteria are necessary for establishing this optimality. Maybe it’s just a lack of imagination on my part, but I can’t see a way to make substantive claims about optimality without also making substantive claims about representation and algorithm.

With respect to the optimality of target identification (i.e., with respect to footnote 1), these claims aren’t entirely clear. On the other hand, there are clear substantive algorithmic claims in the structure added to the model in order to account for discrimination data, but these aren’t accompanied by any claims about optimality (or proof thereof).

So where does this leave us? The Feldman, et al., model is interesting (I wouldn’t have spent this much time thinking and writing about it if I didn’t think so), but I don’t think it’s computational (in Marr’s sense), and whether or not it’s optimal is a bit unclear. If the model is, as it seems to be, mathematically equivalent to (a special case of) the logit random utility model, then it’s optimal with respect to noise that is extreme value distributed (not normally distributed) and with respect to categorization responses (not target identifications). This might be consistent with optimality defined with respect to normally distributed noise and target identifications, as well, but repeated assertions of optimality don’t establish this one way or the other. To be fair, Feldman, et al., aren’t the only people who make claims of optimality without backing them up. To pick another example that I’m familiar with, Smits’ 2001 paper on hierarchical categorization also contains repeated claims of optimality with no rigorous definition or exhibition that the model meets the (absent) definition (and, it’s worth saying explicitly, I think Smits’ model is *also* interesting and worth thinking hard about).

I don’t think I’m going to post on this model anymore, but I wanted to address one more issue that I said I planned to address in my first post on this topic, namely the issue of how robust the perceptual magnet effect is. Feldman, et al., cite a fairly large number of papers across a variety of psychological domains that apparently report the perceptual magnet effect, so it seems likely to be pretty robust. But there’s at least one example of some work in speech perception that suggests that the effect isn’t as robust as we might like it to be. If a model necessarily produces a magnet effect, can it account for data in which the effect isn’t there? If not, that’s a problem.

As a brief final aside, it’s perplexing to me that, on the one hand, Feldman, et al., could repeatedly describe Bayesian statistical inference as optimal while, on the other, using non-Bayesian statistics. *None* of the empirical evaluations of the model presented in the paper use Bayesian methods. So, even if I grant for the sake of argument that the perceptual model is optimal, why should I believe that this is a good model of actual human perceptual behavior if all of the statistical evidence marshaled in its favor is non-Bayesian and (presumably) therefore non-optimal?

This is a really interesting set of issues. On the different ‘levels’, and how Bayesianism fits into them: I think the idea is supposed to be that constraints and goals are specified at the computational level (ie., the first paragraph of Feldman quoted above, plus the footnote specifying the cost function). (Importantly, for reasons that don’t fit well in the Marr approach, the goals and constraints can’t include memory or time limitations.) Nothing else is computational, exactly. The “computational” claim, then, that I take to be common in Bayesian approaches goes like this:

1) Over the space of all possible algorithms (for some inference task), the one that maximizes the satisfaction of the goals subject to the constraints employs Bayesian inference. More precisely, the set of maximizing algorithms includes at least one Bayesian one, since different algorithms may fare equally well.

2) People are pretty good at [task x], so the actual algorithm they use must be pretty close to some Bayesian one.

In this context, what’s odd is that Feldman et al seem to argue not that some Bayesian model is optimal, but that that particular one is. Of course, you could wrap the the second and third paragraphs of the article quoted above into the specification of the problem, rather than considering them as representations in the solution. In that case, you really can call it computational, at the possible cost of evaluating a problem that isn’t the real problem in phonetic categorization.

In any event, I don’t think the “computational level” claim is that Bayesian algorithms aren’t algorithms (but somehow computations in a void). You could probably better label it a “meta-algorithmic” claim–Bayesian optimality is a claim about the space of possible algorithms, not about some particular algorithm–hence it’s at least kinda computational.

You can pick on this sort of claim in lots of ways. My favorite right now is to call 1 vacuous, because the space of possible algorithms coincides with the space of Bayesian algorithms. That is not to say that all algorithms are Bayesian, but rather that for any algorithm, there is a Bayesian algorithm that produces the same behaviors (and hence is equally “optimal”). Similarly, for any suboptimal algorithm, there is a Bayesian model exactly that sucky.

This doesn’t say anything about whether Feldman’s model is a good model, as such, of phonetic categorization or the perceptual magnet phenomenon; rather that, more generally, no matter what the compatational-level constraints (leaving aside constraints limiting space and time), every algorithm has a Bayesian analogue, so claiming optimality for a model or model class in virtue of a Bayesian algorithm is false.