Friday, June 5, 2015

Precision != Precision, or Measurement, Categorical Variables, and Polysemy

Two words in data science are unfortunately pronounced and spelled exactly the same. They are 'precision' and 'precision'.

Both are very technical in meaning Their informal meaning, though not wrong exactly and metaphorically in the ball park, does not give much clue as to the exact meanings.

The first one meaning is relevant to measurement. It means 'how many digits in a numerical measure are used' or very similarly the variance of a set of measures. This is in contrast to 'accuracy' which means 'on average how correct'. A number is 'precise' if it has lots of digits to the right of the most significant digit (or the set has very small variance). A set of numbers is 'accurate' if the set's average is very close to the true value (note the grammar: precision can apply to a single number but accuracy is for a set). Here is a classic picture of the difference of 'precision' with 'accuracy' (from wikipedia):


Another view is high and low precision and accuracy (from NOAA)



The second definition, relevant to 2x2 contingency tables, means technically 'TP/(TP+FP)' or the ratio of True Positives to Total Positives (the latter of which is the sum of True Positives and False Positives). What it means (for how good a test is a measure of reality) is how well the test (when positive) captures the phenomenon. A technical synonym (which means it is an exact synonym which means they are identical) is Positive Predictive Value or PPV. Almost as metaphorically meaningful, but really that doesn't matter, the meaning is stipulated to be the ratio. The generic picture of a 2x2 contingency table is (from alpine.atlassian):



But wait! you say. You see two by two tables in each case, and both are about how good a test is with reality. Isn't that the same? Yes, they involve some similar principles, but they appear in different circumstances. One is about the significant digits of real values vs the average (a computation on continuous values) where variance and average are very different computations. The other is about comparison of two different binary (yes/no) values, an identical dimension of true vs false.


In addition, note that 'accuracy' is also for two words spelled the same way (one for average being close to true, and for 2x2 tables the ratio of TP plus TN to the total, the diagonal in the image above). The contingency table 'accuracy' is not as popular a concept/term though.

The lessons to learn then is:

- these two concepts, spelled the same way, are very different, even though metphorically they have something to do with how good a set of numbers is.

- some technical words have more than one meaning. Really really different meanings. But usually context will tell you which is which. If you're talking about just the quality of a metric by itself, then 'precision' is the variance. If contingency tables, then it's the same as PPV (positive predictive value).

See also:
http://en.wikipedia.org/wiki/Accuracy_and_precision
which has a section on both.

No comments: