For my Current Topics in Communication Sciences course, we read So & Best’s 2010 paper on non-native tone perception. I won’t go much into the paper’s implications for language or speech, though these are interesting and worth thinking about. Rather, I want to focus (as is my wont) on data analysis and illustrating some models and data analysis tools that I wish were used more often in this kind of research.
The data analysis in So & Best is not so good. The data consist of confusion matrices for Cantonese, Japanese, and English native listeners’ identification of Mandarin tones. The first part of the analysis focuses on ‘tone sensitivity’ and uses A’, a non-parametric measure of perceptual sensitivity, which I assume is (something like) the A’ define by Grier [pdf] (So & Best cite a textbook rather than a paper, so I don’t know for sure how they calculated A’).
It’s probably good that they use A’ rather than d’, given that, for each tone, they’re lumping all three incorrect responses together, which certainly violates pretty much all of the assumptions underlying Gaussian signal detection theory (though, not surprisingly, A’ has a downside, too). But, then, even if they’ve avoided violating one set of assumptions by using A’, A’ values violate pretty much all of the assumptions underlying ANOVA, as do the confusion counts they (also) crank through the ANOVA machine.
Which brings me to my point, namely that there are better statistical tools for analyzing confusion matrices. Some such tools are pretty standard, plug-and-play models like log-linear analysis (a.k.a. multiway frequency analysis). Others are less standard and perhaps less easy to use, but they are far superior with respect to providing insight and understanding of the patterns in the data.
I’ve written about the Similarity Choice Model (SCM) before; as I wrote in the linked post:
In the SCM, the probability of giving response r to stimulus s is (where is the bias to give response and is the similarity between stimulus and stimulus , and is the number of stimuli):
You might want to use this model because it has convenient, closed-form solutions for the similarity and bias parameters:
To illustrate the ease and utility of using the SCM, I estimated parameters for the confusion matrices reported by So & Best (in R and Python – note that the Python code is in a .txt file, not a .py file, since my website host seems to think I’m up to no good if I try to use the latter).
Here’s the data file assumed by the Python script (the matrices are hard-coded in the R script), if you want to play along at home. The first four rows contain the confusion matrix for Cantonese listeners, the next four for Japanese listeners, and the last four for English listeners.
Both the R and Python scripts do essentially the same thing. I’ve been using Python more than R lately for various reasons, so that’s what I’ll focus on here.
I wrote a function that adjusts for any zeros in the confusion matrices then calculates response bias, similarities, and distances () for an input matrix:
# zeros are bad
Mt = Mt + .01
for ri in range(4):
Mt[ri,:] = Mt[ri,:]/np.sum(Mt[ri,:])
# initialize similarity matrix
St = np.zeros((4,4))
# calculate similarities
for ri in range(4):
for ci in range(4):
St[ri,ci] = np.sqrt(Mt[ri,ci]*Mt[ci,ri]/(Mt[ri,ri]*Mt[ci,ci]))
Dt = np.abs(np.sqrt(-np.log(St)))
Bt = np.zeros(4)
for ri in range(4):
Bk = np.zeros(4)
for ki in range(4):
Bk[ki] = np.sqrt(Mt[ri,ki]*Mt[ki,ki]/(Mt[ki,ri]*Mt[ri,ri]))
Bt[ri] = 1/np.sum(Bk)
Bt = Bt/sum(Bt)
return St, Dt, Bt
I also wrote a function for calculating predicted confusion probabilities for a given set of parameters so that I could see how well the model fits the data:
nr = len(b)
Mp = np.zeros((nr,nr))
for ri in range(4):
for ci in range(4):
Mp[ri,ci] = b[ci]*s[ri,ci]
Mp[ri,:] = Mp[ri,:]/np.sum(Mp[ri,:])
You can look at either script to see how to use the estimated similarities/distances to fit hierarchical clustering or MDS models.
Here’s a plot showing the observed and predicted confusion probabilities (closer to the diagonal = better fit):
Overall, the fit seems to be pretty good, though it’s not perfect. There aren’t any really huge discrepancies between the observed and predicted probabilities. For whatever reason, the model seems to fit the Cantonese listeners’ data best, with the largest discrepancies for a couple data points from the Japanese listeners.
As a side note, the plots generated by the R script look a bit better than the plots generated by the Python script, but I’ve been fiddling with R plots for a few years now, while I’m still figuring out how to do this kind of thing with Python. I’m pretty happy with Python so far, though I’d really like to be able to remove the box and just have x- and y-axes. But I digress…
Here are dendrograms for the hierarchical cluster models fit to the estimated distances. In each of these, the y-axis indicates (estimated) distance, with the relative heights of the clusters indicating the relative dissimilarity of the tones (indicated by the numbers at the bottom):
There are two obvious things to note about these plots. The most obvious similarity is that tones 1 and 4, on the one hand, and tones 2 and 3, on the other, form the bottom two clusters for each language group, indicating that 1 and 4 are more similar (less distant) to one another than either is to tone 2 or 3, and vice versa. The most obvious difference is that tones 1 and 4 are more similar than tones 2 and 3 for the English listeners, but tones 2 and 3 are more similar than tones 1 and 4 for the Japanese and Cantonese listeners. (This would be more obvious if I knew how to make the labels and colors consistent across these plots, so that 1 and 4 were always on the right and constituted the red cluster, with 2 and 3 on the left in the green cluster, but, again, I’m still figuring all this out in Python, so this will have to do for now.) It’s also pretty clear that the 1-4 and 2-3 clusters are less similar for the Cantonese listeners than for either other group.
Here’s an MDS plot with all three groups’ data presented together (the letters and colors indicate the language group, and the numbers indicate the tones):
As with the cluster analysis, it’s clear that tones 1 and 4 pattern together as do tones 2 and 3. The differences in 1-4 vs 2-3 similarity across groups is also evident here.
Finally, here’s a plot of each groups’ bias parameters (colors are as in the MDS plot, tones are, again, indicated by the numbers on the x-axis):
I was a bit surprised by how similar the bias parameters are for all three groups, though there are some potentially interesting differences. The English listeners mostly just didn’t want to label anything “4”, while the Japanese listeners seemed to be more biased toward “1” and, to a greater degree, “2” responses than either “3” or “4” responses. The Cantonese listeners exhibit a similar pattern, though with a weaker bias toward “2” responses.
Okay, so where does all this leave us? None of what I’ve done here actually provides statistical tests of any patterns in the data, though the SCM (and related models) can be elaborated and rigorous tests can be carried out by constraining similarities and biases in various ways and comparing fitted models. And, as mentioned above, log-linear analysis is a better out-of-the-box method for analyzing this kind of data than is ANOVA.
Statistical tests aside, though, I would argue that the SCM and clustering/scaling methods are far better than the presentation and visualization of the data presented by So & Best. The SCM allows us to look separately at pairwise similarity between stimuli and response bias, but it also allows us to readily generate easy to interpret figures that illustrate patterns that are not at all obvious in the raw data (or in other figures; e.g., I find So & Best’s Figure 3 to be rather difficult to interpret or draw any kind of generalization from).
As much as I’d like tools like these to be more widely used, I’m not terribly hopeful that they will be any time soon. But I’ll keep promoting them anyway.