Least Uninteresting Number: What statisticians and ML'ers really think of each other

Labels aren't the thing, they just name the thing, and the same thing can have different names, and many different things have the same name. But people often take the label to be the thing.

'Statistics' and 'Machine Learning' are labels for two different things that have some overlap, not identical but cover a lot of the same things.

Statistics is concerned with averages and deviations, probability distributions, design of experiments, and regression, trying to extract knowledge out of tables of numerical data. The usual single sentence summaries are hardly distinguishable from many other things with data in their title, like databases or IT (Information Technology).

Machine Learning is a subset of Artificial Intelligence (itself considered a subset of Computer Science but practiced and motivated by other engineering departments and psychology related fields including linguistics, philosophy and neuroscience). It tries to extract patterns out of numerical data too, but has a different provenance. The two overlap some but each have their own separate culture and methods.

And more to the point, they’re really trying to do mostly the same things and the math for them both is often identical.

But what do they really think of each other?

From the point of view of the statisticians (people who call themselves with that label or are employed by institutions with that label) is that ML is a handful of ad hoc 'predictive analytics' done by a bunch of computer scientists, engineers, or amateurs (or worse!) pulling it out of their ass, their methods are immature (they don't know anything!) and don’t take into account the decades of principles established by the more mature staisticians for quality of results. That is, ML may do new, interesting things but they usually aren’t that new and they’ve never thought of all the methodological pitfalls that have been managed so well already by statistical principles (think of the data!). The statisticians may begrudgingly acknowledge that some of the ML methods are externally successful, but really, with such complicated models how do you know if it is any good outside of your toy domain when you haven’t done a proper analysis of your distributional assumptions? You ML people don't actually know anything!

People who say that they do ML probably do not give themselves the label statistician or work in a statistics group, but rather ‘are’ a computer scientist or engineer. Their point of view is that statisticians are studying pointless details about ancient brittle methods that aren’t particularly interesting, don’t really apply to all the new data sources, and just aren’t as good as this shiny new toy. Also, Bayes says p-values are dumb! The ML people may begrudgingly acknowledge that some of the statistical methods produce quality results, but really who cares about the normal curve and what about Bayes? You statisticians are so old and ossified!

From my point of view, it would be better for everybody if ML were considered a subset of statistics (but successfully studied in other departments) and ML methods could use a lot of analysis by statisticians. And a job that is labeled as data scientist should be easily fillable by a statistician or an ML person. Both sides need more exposure to the methods of the other.

See also Statistics and Machine Learning, Fight! (it's funding and conference culture) and Statistical Modeling the Two Cultures (by Breiman) (data vs algorthmic modeling), The Two Cultures: Statistics-vs Machine Learning for more opinions on the difference.

Least Uninteresting Number

Thursday, September 10, 2015

What statisticians and ML'ers really think of each other

No comments:

Blog Archive

About Me