Big win for linguistics

Okay, so it’s not really a “win”, and it’s not “for linguistics” in any relevant way. It’s more a moderately interesting statistical finding that happens to make a language-based data set stand out from a bunch of possibly similar data sets.

I was reading this post on Three-Toed Sloth, and followed the “we’ve explained elsewhere” link to an arXiv paper on (and called) power-law distributions in empirical data. Intrigued, I printed and read the paper. It is a very interesting and very clear presentation of a method for estimating power-law parameters, testing how well a power-law distribution accounts for data, and comparing a power-law fit to analogous fits from other, empirically similar distributions. Good stuff, and it looks to be very useful for people who want to be careful about assertions about whether or not their data follow a power-law distribution.

But I’m not going to discuss the merits of the statistical work presented. Rather, I’m going to focus on a trivial fact about the authors’ application of their method to a mess of actual data sets. On page 22, they describe 24 allegedly power-law distributed data sets from a wide variety of scientific fields. Data set #1 is “the frequency of occurrence of unique words in the novel Moby Dick by Herman Melville.”

Well, guess what? This is the only data set that seems to actually follow a power-law! On page 26, the authors write that “[t]here is only one case—the distribution of the frequencies of occurrence of words in English text—in which the power law appears to be truly convincing, in the sense that it is an excellent fit to the data and none of the alternatives carries any weight.”

Take that, intensities of earthquakes occurring in California between 1910 and 1992, measured as the maximum amplitude of motion during the quake. Stick that in your pipe and smoke it, degrees of metabolites in the metabolic network of the bacterium Escherichia coli. So sorry you might be log-normally distributed, number of species alive today, but also including some recently extinct species, where “recent” in this context means the last few tens of thousands of years, per genus of mammals.

This entry was posted in SCIENCE!, statistical modeling. Bookmark the permalink.

Comments are closed.