I’ve encountered two potentially problematic uses of sensitivity and specificity recently. One is simply an egregious error. The other is a combination of, on the one hand, an oversimplification of the relationship between these and the area under a receiver operating characteristic curve and, on the other, an illustration of one of their important limitations.
Just in case you don’t want to read the wikipedia entry linked above, here are some quick definitions. Suppose you have a test for diagnosing a disease. The sensitivity of the test is the proportion of people with the disease that the test correctly identifies as having the disease (i.e., the hit rate, henceforth H). The specificity of the test is the proportion of people without the disease that the test correctly identifies as not having the disease (i.e., the correct rejection rate, henceforth CR).
H and CR are useful measures, to be sure, but they obscure some important properties of diagnostic tests (and of binary classifiers in probabilistic decision making in general). Rather than H and CR, we can (and should) think in terms of d’ – the “distance” between the “signal” and “noise” classes – and c – the response criterion. Here’s a figure from an old post to illustrate:
In this illustration, the x-axis is the strength of evidence for disease according to our test. The red curve illustrates the distribution of evidence values for healthy people, and the blue curve illustrates the distribution of evidence values for people with the disease. The vertical dashed/dotted lines are possible response criteria. So, in this case, d’ would be , where is some measure of the variability in the two distributions. It is useful to define c as the signed distance of the response criterion with respect to the crossover point of the two distributions. I’ll note in passing that I’m eliding a number of important details here for the sake of simplicity (e.g., the assumption of equal variances in the two distributions, the assumption of normality in same), which I’ll come back to below.
H and CR are determined by d’ and c. H is defined as the integral of the blue curve to the right of the criterion, and CR as the integral of the red curve to the left of the criterion.
So, one important property of a binary classifier that H and CR obscure but that d’ and c illuminate is the fact that, for a given d’, as you shift c, H and CR trade off with one another. Shift c leftward, and H increases while CR decreases. Shift c rightward, and H decreases while CR increases. In the figure above, you can see how the areas under the red and blue curves differ for the dashed and dotted vertical lines – H is lower and CR higher for the dotted line than for the dashed line.
Another important property of a binary classifier is that, as you increase d’, either H increases, CR increases, or both increase, depending on where the response criterion is. In the above figure, if we increased the distance between and (without changing the variances of the distributions) by shifting to the left by some amount and by shifting to the right by , both H and CR would increase.
The egregious error I encountered is in the “Clinical Decision Analysis Regarding Test Selection” section of the ASHA technical report on (Central) Auditory Processing Disorder, or (C)APD (I’ll quote the report as it currently stands – I plan to email someone at ASHA to point this out, after which, I hope it will be fixed):
The sensitivity of a test is the ratio of the number of individuals with (C)APD detected by the test compared to the total number of subjects with (C)APD within the sample studied (i.e., true positives or hit rate). Specificity refers to the ability to identify correctly those individuals who do not have the dysfunction. The specificity of the test is the ratio of normal individuals (who do not have the disorder) who give negative responses compared to the total number of normal individuals in the sample studied, whether they give negative or positive responses to the test (i.e., 1 – sensitivity rate). Although the specificity of a test typically decreases as the sensitivity of a test increases, tests can be constructed that offer high sensitivity adequate for clinical use without sacrificing a needed degree of specificity.
The egregious error is in stating that CR is equal to 1-H. As illustrated above, it’s not.
The oversimplification-and-limitation-illustration was in Part II of a recent Slate Star Codex (SSC) post. Here’s the oversimplification:
AUC is a combination of two statistics called sensitivity and specificity. It’s a little complicated, but if we assume it means sensitivity and specificity are both 92% we won’t be far off.
AUC here means “area under the curve,” or, as I called it above, area under the receiver operating characteristic curve (or AUROC). Here’s a good Stack Exchange answer describing how AUROC relates to H and CR, and here’s a figure from that answer:
The x axis in this figure is 1-CR, and the y axis is H. The basic idea here is that the ROC is the curve produced by sweeping across all possible values of c for a test with a given d’. If we set c as far to the right as we can, we get H = 0 and CR = 1, so 1-CR = 0 (i.e., we’re in the bottom left of the ROC figure). As we shift c leftward, H and 1-CR increase. Eventually, when c is as far to the left as we can go, H = 1 and 1-CR = 1 (i.e., we’re at the top right of the ROC figure).
The AUROC can give you a measure of something like d’ without requiring, e.g., assumptions of equal variance Gaussian distributions for your two classes. Generally speaking, as d’ increases, so does AUROC.
So, the oversimplification in the quote above consists in the fact that the AU(RO)C does not correspond to single values of H and CR.
Which brings us to the illustration of the limitations of H and CR. To get right to the point, H and CR don’t take the base rate of the disease into account. Let’s forget about the conflation of AUROC and H and CR and just assume we have H = CR = 0.92. Per the example in the SSC post, if you have a 0.075 proportion of positive cases, H and CR are problematic: you have 92% accuracy, but less than half of the people identified by the test as diseased actually have the disease!
The appropriate response here is to shift the criterion to take the base rate (and the costs and benefits of each combination of true disease state and test outcome) into account. Given how often Scott Alexander (i.e., the SSC guy) argues for Bayesian reasoning and utility maximization, I am a bit surprised (and chagrined) he didn’t go into this, but the basic idea is to respond “disease!” if the following inequality holds:
Here, is the probability of a particular strength of evidence of disease given that the disease is present, is the probability given that the disease is not present, and are the prior probabilities of the disease being present or not, respectively, and is the utility of test outcome and reality (e.g., is the cost of the test indicating “disease” while in reality the disease is not present).
The basic idea here is that the relative likelihood of a given piece of evidence when the disease is present vs when it’s absent needs to exceed the ratio on the right. By “piece of evidence” I mean something like the raw, pre-classification score on our test, which corresponds to the position on the x axis in the first figure above.
The ratio on the right takes costs and benefits into account and weights them with the prior probability of the presence or absence of the disease. We can illustrate a simple case by setting and and just focusing on prior probabilities. In this case, the inequality is just:
In the SSC case, and , so the criterion for giving a “disease!” diagnosis should be . That is, we should only diagnose someone as having the disease if the strength of evidence given the presence of the disease is 12+ times the strength of evidence given the absence of the disease.