Tuesday, May 31, 2016

Kinds of Data: there are more than just the basic four

The science of statistics specifies that data points come from four basic types:

  • nominal - these are incomparable labels like truth (yes, no), color (red, blue green), country (UK, France Germany, Italy). There is no relation among these elements there other than that they are in the same set. All you know about them is their names and that a name is different or the same as another.
  • ordinal - only a rank is known (1st, 2nd, 3rd...) and nothing else (we don't know how far ahead 1st is from 2nd), just the order, like finishing order in a race.
  • interval - we know the distance between any two elements A - B  like the height.
  • ratio - we also know the ratio of two numbers where for example A can be twice B, like half-life of an element.
Notice how I describe these both mathematically and conceptually, because often a set selected from a mathematical domain, like the reals, can be interpreted in any one of these. For example, from the reals, they obviously have a ratio by division, and a distance by difference, can be ordered by 'less than', and can be categorical by using cutoffs, say >= 0 for yes and < 0 for no.

Of course, as with most systematizations, this list came after years of using methods that were created to work with whatever data was at hand, and then when the data just didn't work with those methods, new analogous method were created, or entirely new methods created for quite different purposes.

Statistical procedures seem geared to work with one of these types. Chi-squared on contingency tables are good for categorical data. Wilcoxon signed ranks for ordinals, t-tests for integral data, Poisson for count data. But mostly there are just two kinds discrete and continuous which fall to nominal/categorical statistics and pretty much all the rest of statistics respectively.

Existing science isn't as deliberate as a current systematization, as monday-morning quarterbacking/textbook-writing may make it seem. It's more incremental, and filling in gaps as needed rather than laying out the system ahead of time. You have a problem and you use a tool that works good enough right now, you develop that tool incrementally until it metastasizes well beyond it's initial conception. Contingency tables are great ways of summarizing tabular data, but you may want to do a significance test like all the t-test guys. 

---

Any kind of systematization is an oversimplification, forgetting possibly irrelevant details to make different things look alike, and placing a particular item into that systematization is also forgetting possibly irrelevant details to make it look like one of a few categories. But sometimes those details are not so irrelevant.

Binary data is a subset of nominal data, with just two categories. Two by two contingency tables and logistic regression are especially designed to deal with them.  Some multinomial categories will have some minimal relationship, say geographic location with countries, or wavelength for colors (colors are very complex because the brain processes them by multiple systems involving the wavelength, opponent process pairs, or beyond. Rank data is ordinal by definition, but when encoded as numbers, can be processed as interval or even ratio data (depending on the interpretation desired.

These four data types work very well for statistics. But it seems underspecified. We're used to measuring quantities or counting objects so all those categorical and interval methods apply so well. But there's so much more structure to the way things can be measured. Not humanities-style vague, wordy, qualitative description. Perfectly exact, just not necessarily a number.

There is an existing method for description of data. A very rich description method. It's mathematical notation. If data should be treated continuously, use R. If a vector over integers, Z^n. If an ordinal set, then that's a total order. If categorical, then you have a simple set. If the elements are related to each other one on one but in a complex restricted manner, then maybe a graph is the way to notate things. if the elements allow certain operations but not others, then maybe it's from a particular algebra, a Hilbert algebra, or instead a Banach algebra. 

Measurement is not always in the elementary numbers we count or measure or weigh with. There can be quite a bit more structure in the measurements than just a number.

No comments: