Optimal Bayesian perceptual magnets, Part 2

I declared my intention to write an unspecified number of follow-ups to my post on the model described in Feldman, et al., (2009) The influence of categories on perception: Explaining the perceptual magnet effect as optimal statistical inference. This here is follow-up number one. In this post, I will discuss exactly how the model accounts for the perceptual magnet effect.

Recall that the perceptual magnet effect is a warping of perceptual space, the primary indicator of which is a reduced ability to discriminate pairs of stimuli near a category center relative to more peripheral pairs. Recall, too, the proposed (optimal, computational) model that “formalizes the idea of a perceptual magnet,” a brief summary of which is as follows.

Given category c, an intended phonetic target T is distributed as a normal random variable with mean \mu_c and variance \sigma_c^2:

(1)   \begin{equation*}T|c \sim \mathcal{N}\left(\mu_c,\sigma_c^2\right)\end{equation*}

Given intended  target T, a stimulus S is distributed as a normal random variable with mean T and variance \sigma_s^2:

(2)   \begin{equation*}S|T \sim \mathcal{N}\left(T,\sigma_s^2\right)\end{equation*}

By assumption, listeners are “trying to infer the target T given stimulus S and category c, so they must calculate p(T|S,c),” which gives another normal distribution, this time with mean \mu_{T|S,c}:

(3)   \begin{equation*} \mu_{T|S,c} = \frac{\sigma_c^2 S + \sigma_s^2 \mu_c}{\sigma_c^2 + \sigma_s^2}\end{equation*}

and variance \sigma_{T|S,c}^2:

(4)   \begin{equation*}\sigma_{T|S,c}^2 = \frac{\sigma_c^2\sigma_s^2}{\sigma_c^2 + \sigma_s^2}\end{equation*}

or, as stated in the paper and in my previous post:

(5)   \begin{equation*}T|S,c \sim \mathcal{N}\left( \frac{\sigma_c^2 S + \sigma_s^2 \mu_c}{\sigma_c^2 + \sigma_s^2}, \frac{\sigma_c^2\sigma_s^2}{\sigma_c^2 + \sigma_s^2} \right)\end{equation*}

The perceptual magnet is formalized by virtue of the fact that “[t]he term \mu_c pulls the perception of stimuli toward the category center, effectively shrinking perceptual space around the category.” It turns out that this is wrong in an interesting way, which is evident in the paper itself and which I will explain below. But, in order to explain this, we need to review the extension to multiple categories, which the model does with Bayes’ rule, which gives us p(c|S):

(6)   \begin{equation*} p(c|S) = \frac{p(S|c)p(c)}{\sum_c p(S|c)p(c)}\end{equation*}

Here, p(S|c) is given by factoring out all possible intended target productions:

(7)   \begin{equation*}S|c \sim \mathcal{N}\left(\mu_c,\sigma_c^2 + \sigma_s^2\right)\end{equation*}

And p(c) is the prior probability of encountering category c. Taking a weighted sum over categories, we then get an expression for p(T|S):

(8)   \begin{equation*}p(T|S) = \sum_c p(T|S,c)p(c|S) \end{equation*}

Finally, we have an expression for the expected value of T|S:

(9)   \begin{align*} E[T|S] &= \sum_c \left(\frac{\sigma_c^2 S + \sigma_s^2 \mu_c}{\sigma_c^2 + \sigma_s^2}\right)p(c|S)\\ &= \frac{\sigma_c^2}{\sigma_c^2 + \sigma_s^2}S + \frac{\sigma_s^2}{\sigma_c^2 + \sigma_s^2}\sum_c \mu_c p(c|S)\\ &= \frac{\sigma_c^2}{\sigma_c^2 + \sigma_s^2}S + \frac{\sigma_s^2}{\sigma_c^2 + \sigma_s^2}\mu_1 p(c_1|S) + \frac{\sigma_s^2}{\sigma_c^2 + \sigma_s^2}\mu_2 p(c_2|S)\end{align*}

Here, the first term in the sum is the expected value of T|S,c (i.e., that which is perceived, given a stimulus and category) and the second term is the relative likelihood of category c given stimulus S. Note that the move from line one to line two makes use of the assumption that variance is constant across all categories, and the move from line two to line three relies on the assumption that there are only two categories.

Note, too, that with just category 1, there is no \mu_2 and p(c|S) is constant (and equal to 1). Which is to say that, in the one category case, E[T|S] is a linear function of S. Mutatis mutandis, the same holds if you’ve only got category 2.

Separate illustrations of various parts of the model will make it clear how this accounts for the perceptual magnet effect. More specifically, I will show how the various parts of Equation 6 contribute to produce variation in perceptual effects across a range of input stimulus values. Using R (and RSTudio, which is fantastic), I picked the following arbitrary parameter: \mu_1 = 3\mu_2 = 7, \sigma_c^2 = 1.5, \sigma_s^2 = 1, and p(c_1) = p(c_2) = 0.5.

I then calculated p(S|c) for values of S ranging from 0 to 10 (all the code for the following figures is given at the end of the post), with the solid blue line corresponding to category 1 and the broken red line to category 2:

p(S|c) as a function of S

Technically, it should be f(S|c), since it is a (Gaussian) density function, not a probability mass function, but I’ll follow the notation in the paper, since there’s no real ambiguity to confuse matters. In any case, the likelihood of a given stimulus is maximized at the mean of the normal density, dropping off to either side in that familiar bell-curve way.

Multiplying by p(c) and dividing by the sum of the two p(S|c)p(c) terms for each values of S gives us p(c|S) from Equation 6 above (blue for category 1, red for 2):

p(c|S) as a function of S

A stimulus way out to the left is overwhelmingly likely to be from category 1, a stimulus way out to the right is overwhelmingly likely to be from category 2, and stimuli between \mu_1 and \mu_2 are, more or less, equally likely to have come from either category. As mentioned above, p(c|S) (for each category) is a non-linear function of S only because there is more than one category to worry about.

Okay, so plug the parameter values and the p(c|S) functions given above into Equation 9, and we get the expected perceptual effect corresponding to each possible stimulus:

E(T|S) as a function of S

The solid blue line indicates the case in which only category 1 exists, the broken red line indicates the case in which only category 2 exists, and the dash-dot purple line the case with both category 1 and category 2. In case you haven’t read the original paper, this is a replication of Figure 5 from there (i.e., the authors are aware that E[T|S] is linear in the one category cases).

To repeat, yet again, a point made (repeatedly) above, the perceptual effects are linear in S when there’s only one category and non-linear in S only when there are two (or more) categories. To elaborate a bit, the non-linearity is most extreme in between the two means (i.e., where the two p(c|S) terms are most similar in value), with the E[T|S] function for two categories sitting very close to the one-category cases near (and beyond) the category means.

All of which is to say that, contrary to the statement in the paper that my Equation 3 (and their Equation 7) does not ”formalize the idea of a perceptual magnet” and that no  \mu_c term “pulls the perception of stimuli toward the category center.”

All the (non-linear) work here is being done by the p(c_1|S) and p(c_2|S) terms in Equation 9. Perceptual space is most warped between the category means and the crossover of the p(c|S) functions, as illustrated in the middle panel of the original paper’s Figure 6, which shows E[T|S] - S as a function of S. (Note, too, in the bottom panel of this figure, the relative perceptual distances are constant in the one-category case – there is no warping of perceptual space). Given all this, I’m not sure where the top panel of their Figure 2 comes from, but whatever.

So where does this leave us, given that I haven’t said anything particularly novel? The model does, in fact, produce non-linear mapping from stimulus space onto perceptual space (assuming T space is the relevant perceptual space). I wanted to work through the model carefully to make sure I understand how it predicts the perceptual magnet effect. In doing so, I found that the presentation in the paper is suboptimal in that the claim is made, erroneously, that the perceptual magnet effect is predicted in the presence of only one category. The extensive elaboration of the model later in the paper makes it clear that this isn’t the case (and that the authors knew this, at least one some level).

There’s also the fact that all of the above requires you to accept that T is the proper percept. As I said in the earlier post, it’s not clear to me that T (and not c) is all that a listener is trying to infer from the presentation of a stimulus S.

In the next post or two, I’ll focus on this issue, how reasonable the distributional assumptions about T are, whether or not this model is optimal, whether or not it’s truly computational (and not algorithmic), and how it relates to some other, mathematically related models.

R Code for the figures above:

lw = 3
lt = c(1,2,4)
clr = c("blue2","red3","purple2")

S = seq(0,10,.01)
nS = length(S)
vs = 1
vc = 1.5
vsum = vs+vc
muc = c(3,7)
pSgc = rbind(dnorm(S,muc[1],sqrt(vsum)),dnorm(S,muc[2],sqrt(vsum)))
pc = c(.5,.5)
pcgS = matrix(nrow=2,ncol=nS)
pcgS[1,] = pSgc[1,]*pc[1]
pcgS[2,] = pSgc[2,]*pc[2]
for(i in 1:nS){
  den = sum(pcgS[,i])
  pcgS[1,i] = pcgS[1,i]/den
  pcgS[2,i] = pcgS[2,i]/den
}
ETgS = matrix(nrow=3,ncol=nS)
for(i in 1:nS){
  ETgS[1,i] = (vc*S[i])/vsum + (vs/vsum)*muc[1]
  ETgS[2,i] = (vc*S[i])/vsum + (vs/vsum)*muc[2]
  ETgS[3,i] = (vc*S[i])/vsum + (vs/vsum)*(pcgS[1,i]*muc[1] + pcgS[2,i]*muc[2])
}

png(file="pSgc.png",width=750,height=750)
plot(S,pSgc[1,],type="l",axes=F,xlab="",ylab="",lwd=lw,col=clr[1],lty=lt[1])
lines(S,pSgc[2,],lty=lt[2],lwd=lw,col=clr[2])
mtext("S",side=1,line=2,cex=2.5)
mtext("p(S|c)",side=2,line=1.75,cex=2.5)
axis(1,at=c(0,muc,10),tick=T,
     labels=c("",expression(mu[1]),expression(mu[2]),""),cex.axis=2.5,padj=1)
dev.off()

png(file="pcgS.png",width=750,height=750)
plot(S,pcgS[1,],type="l",axes=F,xlab="",ylab="",lwd=lw,ylim=c(0,1),lty=lt[1],col=clr[1])
lines(S,pcgS[2,],lty=lt[2],lwd=lw,col=clr[2])
mtext("S",side=1,line=2,cex=2.5)
mtext("p(c|S)",side=2,line=2,cex=2.5)
axis(1,at=c(0,muc,10),tick=T,
     labels=c("",expression(mu[1]),expression(mu[2]),""),cex.axis=2.5,padj=1)
axis(2,at=c(0,1),tick=T,cex.axis=2.5)
dev.off()

png(file="ETgS.png",width=750,height=750)
plot(S,ETgS[1,],type="l",lty=lt[1],axes=F,xlim=c(1,9),
     ylim=c(2,8),lwd=lw,xlab="",ylab="",col=clr[1])
lines(S,ETgS[2,],lty=lt[2],lwd=lw,col=clr[2])
lines(S,ETgS[3,],lty=lt[3],lwd=lw,col=clr[3])
mtext("S",side=1,line=1.75,cex=2.5)
mtext("E(T|S)",side=2,line=2,cex=2.5)
axis(1,at=c(1,muc,9),tick=T,
     labels=c("",expression(mu[1]),expression(mu[2]),""),cex.axis=2.5,padj=1)
axis(2,at=c(2,4,6,8),tick=T,labels=F,cex.axis=2.5)
dev.off()
This entry was posted in language, SCIENCE!, statistical modeling. Bookmark the permalink.

One Response to Optimal Bayesian perceptual magnets, Part 2

  1. Pingback: Again with the optimal Bayesian perceptual magnets » Source-Filter

Comments are closed.