Least Uninteresting Number: ml

Showing posts with label ml. Show all posts

Wednesday, March 13, 2019

Replace AI and ML in headlines with STATISTICS

Whatever the culture, machine learning methods are statistical. Even if people, both academic and pedestrian, distinguish ML and stats in a practical sense, most ML methods are statistical and in fact created by statisticians. Vladimir Vapnik, the inventor of SVM, has the label 'statistics' (although in Russian) somewhere in his CV. Leo Breiman, the inventor of Random Forests (and a lot of other things), was both an industry consultant and professor of statistics.

Sure, neural networks (which include the metastasized Deep Learning deep neural networks) were invented in the control/systems/cybernetics/computer area, but a label doesn't confer a monopoly on ideas. And that idea is essentially a cascade of logistic regressions, which is pretty easily labeled as statistical.

All this is to say that all those AI techniques that are so big in the news... you could replace those headlines with ...

27 Incredible Examples Of AI And Machine Learning In Practice - Forbes

AI Can Recognize Images, But Text Has Been Tricky—Until Now - Wired

Why the 'AI revolution' is really a deep learning revolution - Medium

What is 'deep learning'? - BBC News

All these could be rewritten as:

27 Incredible Examples Of STATISTICS In Practice - Forbes

STATISTICS Can Recognize Images, But Text Has Been Tricky—Until Now - Wired

Why the 'STATISTICS revolution' is really a LOGISTIC REGRESSION revolution - Medium

What is 'CASCADING LOGISTIC REGRESSION'? - BBC News

----

What's the point of all this?
Absolutely nothing. But AL and ML tend to be words thrown around as though they're magic. They're not magic. So using 'statistics' instead will bring some sobriety to the conversation. The things that are coming out nowadays are really cool and revolutionary and are real progress in science... but it's not some magical genius in silicon, it's just little math tricks that have built up over time. It's not some science fiction faster-than-light warp drive, it's old tech that has been optimized little by little and it only just popped over the threshold into the mainstream.
---
Of course, not all cool new things in AI and ML are statistical. All the ones you hear about in the news lately are. Except the poker playing machine Libratus. There is a portion of it that involves learning from many games, but the major new process is not anywhere near what is traditionally called 'statistics'.

Wednesday, December 7, 2016

Testing Software and Machine Learning

Testing is a major part of any non-trivial software development project. All parts of a system require testing to verify that the engineered artifact does what it claims to do. A function has inputs and expected outputs. A user interface has expectations of both operation and ease of interaction. A large system has expectations of inter module operability. A network has expectations of latency and availability.

Machine learning produces modules that are functional, but with the twist that the system is statistical. ML models are functional in that there are one-to-one correspondence between inputs and outputs, but with the expectation that not every such output is ... desired.

Let's start with though with a plain old functional model. To test that it is correct, you can test the output against all possible inputs to check if. That's usually not feasible, and anyway is in the more esoteric domain of proving program correctness. What is preferred is instance checking, checking that a particular finite set of inputs gives exactly the corresponding correct outputs. This is formally called 'unit testing'. It usually involves simply a list of inputs and the corresponding expected outputs. The quality of such a set of unit tests relies on coverage of the 'space'. Edge cases (extreme values), corner cases (more than one variable in an extreme value), important instances, generic instances, random instances, etc. Also, any time a bug is found and corrected, the faulty instance can be added to the list to ensure it doesn't happen again.

An ML function is, well, but its name, also functional, but the expectations are slightly different. An ML model (the more technical term for an ML function) can be wrong sometimes. You create a model using functional directly corresponding inputs and outputs, but then when you test on some new items, most items will have correct outputs, but some may be incorrect. A good ML model will have very few incorrect, but there is no expectation that it will be perfect. So when testing, a model's quality isn't yes or no, that absolutely every unit test has passed, but rather that a number of unit tests beyond a threshold have passed. So in QA, if one instance doesn't pass, here it's OK. It's not great, but it is not a deal breaker. If all tests pass, that certainly is great (or it might be too good to be true!). But if most tests pass, for some of 'most', then the method is usable.

There are three major parts to creating a machine learning model that need to be tested, (where something can go wrong and where things can be changed): the method or model itself, the individual features supplied to the model, and the selection of data. The method or model itself is the domain of the ML engineer, analgous to regular coding.

I can almost go so far as to say that testing is almost integrated to a large part already within ML methods. Testing use systematic data to test the accuracy of code; ML methods use systematic data in the creation of a model (which is executable code). And so if an independent team, QA or testing, is to be involved, they need to be aware of the statistical methods used, how they work, and all the test-like parts to the model.

Let's taking logistic regression as an example. The method itself fits a threshold function (the logistic function) to a set of points (many binary features, one continuous output feature between 0 and 1). Immediately from the regression fitting procedure you get correlation coefficients, closeness of fit, AUC and other measures of goodness. There are some ways to improve the regression results without changing the input data, namely regularization (constraints on the model), and cross validation. For the features, mostly independent of the method), first there are the number of features, how correlated they are, how individually each feature is predictive; each feature could be analyzed for quality itself. And last for the selection of data, also independent of the method, there's selection bias, there's separation into training, validation, and test sets.

Where QA can be involved directly or indirectly:

- ensuring metric thresholds are met - analogous to overseeing unit-test coverage of code)
- questioning stats methods (like being involved in architecture design
- cross-validation - both makes the model better (less overfitting) and returns quality metric
- calibration of stats methods - quality of prediction
- test and training data selection - to help mitigate selection bias
- testing instances - for ensuring output for specific instances (must-have unit tests)
- feedback of error instances - helps improve model
- quality of test/training data - ensuring few missing values/typos/inappropriate outliers
- UX interaction of humans with inexact system - final say - does the model work in the real world via the application, has interaction with people shown any hidden variables, any unmeasured items, any immeasurables, any gaming of the system by users.

The latter seems to be the most attackable by a dedicated QA group that is functionally separate from an ML group, and all the previous ones seem to be quite on the other side, only the domain of ML to the exclusion of a QA group. But hopefully the discussion above shows that they're all the domain of both. There should be a lot of overlap in what the ML implementers are expected to do and what QA is expected to do. Sure, you don't want to relieve engineers of their moral duty to uphold quality. The fact that QA may be looking over quality issues doesn't mean the data scientist shouldn't care. Just as software engineers doing regular code should be including unit tests as part of the compilation step, the data scientist should be checking metrics as a matter of course.

Tuesday, May 31, 2016

Deep Learning: Not as good, not as bad as you think.

Deep Learning is a new (let's say 1990, but common only since 2005) ML method for identification (categorization, function creation) used mostly in vision and NLP.

Deep Learning is a label given to traditional neural nets that have many more internal nodes than ever before, usually designed in layers to feed one set of learned 'features' into the next.

There's a lot of hype:

Deep Learning is a great new method that is very successful.

but

Deep Learning has been overhyped.

and even worse:

Deep Learning has Deep Flaws

but

(Deep Learning's deep flaws)'s deep flaws

Let's look at details.

Here's the topology of a vision deep learning net:

(from Eindhoven)

Yann LeCun
What's missing from deep learning?
1. Theory 2. Reasoning, structured prediction 3. Memory, short-term/working/episodic memory 4. Unsupervised learning that actually works

From all that, what is it? Is DL a unicorn that will solve all our ML needs? Or s DL an overhyped fraud?

With all such questions, the truth is somewhere between the two extremes, we just have to figure out which way it leans.

Yes, there is a lot of hype. It feels like whatever real world problem there is, world hunger, global warming, DL will solve it. That's just not the case. DL's are a predictive model machine, very good at learning a function (with lots of training data). The function may be yes or no, or even a continuous function, but still it's take an input and give an output that's likely to be right or close to right. Not all real world problems fit that (parts of them surely do, but that's not 'solving' the real world problem.

Also, DL's take a lot of tweaking and babysitting. There are lots of parameters (number of nodes, topology of layers, learning methods, special gimmicks like autoencoding, convolution, LSTM, etc etc with lots of their own params). And there are lots of engineering methods that have made DLs successful, but these methods aren't specific to DL. Lots of better data, better software environments, super fast computing environments, etc etc.

However, there are few methods nowadays that are as successful across broad applications as DL. They really are very successful at what they do and I expect lots of applications to be improved considerably with a DL.

Also, for all the tweaking and engineering that needs to be done (as oppose to the comparatively out of the box implementations of regression, SVMs and random trees), there are all sorts of tools publicly available to make that tweaking much easier: Caffe, Theano libraries like Keras or Lasagne, Torch, Nervana’s Neon, CGT, or Mocha in Julia.

So there are lots of problems with DLs. But they're the best we have right now and do stunningly well.

Tuesday, March 29, 2016

Can Deep Learning be applied to Automated Deduction?

Deep learning is just a deeply layered neural network, and by deep they mean more than one internal layer. It has recently gained much attention because of its successes with images and go and speech and all sorts of things plain old NNs weren't doing so well at.

But what about automated deduction? That's not rhetorical, this is entirely speculative and answer free. I am wondering if DL (or any ML technique) could be thrown at AD (or ATP (automated theorem Proving) whatever you'd like to call it).

First the optimism: wow wouldn't that be cool, a method that would prove really hard mathematical theorems (the ATP community has to work hard and do a lot by hand to do things like flyspeck) or even . But desiring the outcome doesn't say anything about the implementation.

Next the pessimism. By analogy, images and speech/text are iffy optimization problems. Lots of perturbations get you the same thing. But logic (and similar combinatorial problems) are all or nothing. It is either true/proven or false/wrong. how could an approximation optimization method translate to logical combinatorial problems? Where do you get the scads of supervised instances required by DL? Millions of tagged images from MNIST but from TPTP literally only thousands.

So, I don't know. But just because I don't know how to do it doesn't mean it can't be done.

Monday, October 26, 2015

Vapnik says "Deep Learning is the Devil"...maybe

Zach Lipton gave a summary of Vapnik's talk at Second Yandex School of Data Analysis conference (October 5-8, 2015, Berlin). Lipton wrote:

Vapnik posited that ideas and intuitions come either from God or from the devil. The difference, he suggested is that God is clever, while the devil is not.

and

Vapnik suggested that the devil appeared always in the form of brute force.

and

[Vapnik] suggested that the study of machine learning is like trying to build a Stradivarius, while engineering solutions for practical problems was more like being a violinist

My interpretation of all this is that this is about the difference between science and engineering, or general vs specific. Coming up with a good general algorithm, I'm guessing Vapnik is thinking of SVMs or the idea of neural networks, is the study or science of ML, but most successes of Deep Learning (or really just particular and particularly large neural networks) come from the given design of the DL network.

As to clever vs brute force, somehow the statement that can be extracted is that DL is not clever but devilishly brute force. I'm not sure how to make sense of this (I don't see how DL is more brute force that SVM or logistic regression or random forests). Unless all the work that must be done in engineering a good DL is in creating the topology of nodes; this is not automatic at all but needs a lot of cleverness to make a successful learner. But the DL part enables that cleverness (which would otherwise be impossible).

Cleverness is not easily scalable; you can't just throw a whole bunch of extra nodes and arbitrary connections into a DL and hope it learns connections well, you have to organize the layers well. Those details,, the needed to be clever is what slows down the scaling and I am guessing it what is 'devilish' about DL.

This is all second hand and rewording of suggestions through someone's hearsay, and connecting dots that are barely mentioned and far apart. I'm totally putting words in his mouth, but this is what I expect Vapnik really means (or what I think Lipton thinks that Vapnik thinks, all telegraphically expressed). But really how much of anything is really not that?

Friday, October 9, 2015

What's the point of a hold-out set?

The purpose of a predictive model is to collect some sample data, calculate some function to help predict future unknown performance, hopefully with low error (or high accuracy).

The classic statistical procedure takes the sample, a small subset of past data called the data or for later purposes the training set, does some rocket science on that set (say, linear regression), produces the model (some coefficients, some small machine that says yes or no or outputs a guess on a single new data point), and maybe also produces some extra measures that says how good or bad the fit is expected to be (correlation coefficient, F-test). And we're done. So many papers and studies have been done over the years that follow this pattern.

But... what is a hold-out set? The modern way (not that modern) is to split the sample randomly into two parts, the training set (on which to do the classic part) and the test set or hold-out set to check. Run the model on all of the items in the test set and see how bad the fit. The test set is distinct from the training set because we want to validate on unseen data, we don't want to assume something we're trying to prove.

Why do this? It seems like such a waste. Why in a sense throw away perfectly good sample data on a test when you could use it in making a more accurate model? Why in a sense test again when you can use that test data to train? More data is better, right?

Well, you're not really throwing it away, but it does seem like a secondary, minor desire. After all, don't most statistical procedures compute some sort of quality measure on the entire set first? This desire not to 'waste' hard won sample data is very understandable; most of the labor in an experiment is not the statistics but in gathering the actual data.

Of course one could weakly justify this test set by saying it gives more reliable quality statistics.

The real desire for a holdout set is to combat overfitting. There are two sides to modeling: real life data is not perfect, the model is trying to get close to the rule behind the data, but it may go too far and get close to the data itself instead of the rule. The classic step gets us the first part, the modern step avoids going too far. A hint to the purpose is another name for the 'hold-out set which is validation set which name gives a better idea of its purpose. You create a model with the training set, and validate it with the validation set. You're validating your model, making sure that it does well what you claim does well. The first step in predictive modeling is to not underfit, to get close to reality that the data hopefully represents. The test or validation step is to make sure you don't overfit, get too close to the data at the expense of reality.

So I've weakly justified the desire for some kind of hold-out/test set. But how does one actually choose this set? Obviously a random subset but what size? The primary issue is a balance between the model and the goodness of fit: with smaller training set, more variance in the model; with smaller test set, more variance in the stats. There's no hard and fast rule (80/20 is considered reasonable).There are a number of strategies to deal with this.

number not proportion - just make sure you have enough data points in each and after that proportion doesn't matter as much
resample- do the test a few times on random subsamples. This is the very general procedure bootstrap/jackknife
data partition and validate each as a test set against the rest - cross validation. This idea is to split the entire dataset into many samples and do the test/training on each set vs the rest. That is, all data is used as part of a training set and all as part of a test set at some point. There are many strategies here: leave one-out (LOOCV), where all but one is the training set and a single item is the test set, but do this for every single item in your data set. Under some models (like general linear regression models) you don't have to repeat the process n times because the math cancels out a lot (linearity is great!). Another method is k-fold CV where you split your data into k pieces (in practice often 5 or 10) and create a model on n-n/k items and validate on the, do that for each of these k pieces. It takes more time (k more times). LOOCV is essentially n-fold CV, so it is not efficient time wise when model creation takes a while (like for SVM)

A lot of this ignores the issue of what to do if your validation set has bad performance. What is the statistically 'right thing to do' then? Do you rejigger things knowingly? How adaptive can you be and avoid p-hacking? I'll save that for later.

Saturday, September 26, 2015

More comparisons between Statistics and ML

This is a continuation of a post I made about differences between statistics and ML.

I'm not intentionally trying to piss people off ("How dare you imply that we are not as good as those other guys") but I suppose some things might be provocative and arguable. All generalizations are false but a dog with three legs is still a dog ("Are you calling me a dog? How dare you!"). Isn't the point here really that stats and ML have quite a bit in common? Also, I use 'data' as a mass noun "the data is consistent with an increase in effect". Like 'water', I use it grammatically as singular. So there.

Knowledge doesn't come to us in a package; it is discovered piece by piece, following the path of least resistance, with no overarching systematic plan to fill out. Afterwards, the stories are made coherent and clean. and oversimplified for the textbooks. Also, different people in different academic cultures may explore the same things but with different basic tools. Some people call themselves X, some call themselves Y, they both do Z. But X and Y never communicate, not because they are competitors but because their motivations, their culture, the building they are housed in on campus, are so very different, they just aren't even aware of the other's existence.

Statistics started in the 1800's with government and economic numbers, but then also sociology (Quetelet), and then at the beginning of the 1900's with agronomy (Fisher) before it then exploded in every natural science (medicine, psychology, econometrics, etc). Though it started from applications, the mathematics behind it (I blame Pearson?) came from mathematical analysis (all those normal curves and beta distributions are special functions of analysis). Everyday statistics is making hypotheses, doing a t-test, p-values, most likelihood estimators, Gamma distributions. The point of statistics is to take a lot of data and say one or two small things about it (x is better than y).

ML (machine learning), very distinctly, came out of the cybernetics/AI community, a mix of electrical engineers and computer scientists each of which have their own subcultures but closer to each other than they are to statistics. The mathematics behind ML came out of numerical analysis and industrial engineering, decision trees and linear algebra, linear programming. Everyday ML is neural networks, SVMs. The point of ML is to engineer automatic methods to take lots pf data (like pixels in a picture or a sound pattern) and convert that to a label (what the picture is) or text sequence.

The cultural overlap is basic data munging, data visualization, and logistic regression.

I think the primary social difference (which leads to a few technical differences) is the following. Stats is much older and has tried to solve a few problems very very well. They try to take as little data as possible (because they were historically constrained computationally) and determine knowledge. A lot of statistical consulting is judging the study design, determining what can be known with what probability and what assumptions (like prior distributions) restrict what can be known with what reliability. ML is much newer; expects lots of computational power. It often overlooks lessons learned by stats.

But then stats is a bit held back by its insistence on blind rigor. ML is creating techniques that very successful without worrying about the foundations, about what a p-value is a probability of, or whether it is a probability at all.

What they actually do

Statistics is the science of analysis of data: mean and standard deviation (descriptives, what the data looks like), distributions (eg normal, Chi-squared, Gamma, Poisson), p-values, hypothesis testing, type I/II errors, t-tests and ANOVA, regression and general linear models. Its foundations are probability theory which is applied measure theory which is applied analysis (distributions turn out to be mostly special functions). Concerns: significance, p-value, confidence intervals, power analysis, correct interpretation of data and inferences. There are principles

Machine Learning is almost entirely methods for solving prediction problems. Instead of a human looking through a set of data and eye-balling what the pattern is, let the algorithm look at way more instances than is humanly possible to get the pattern. Most of the methods are ad hoc: neural networks, naive Bayes, SVM, decision trees, random forests. There are no principles. Sorry, there is not the depth of principles that statistics has, except when it borrows those principles.

Misnomers

Both labels are misnomers. Statistics sure is used to study states and governments, but is overwhelmingly the province of (a very weird subset of) mathematics.

Machine Learning does include some learning techniques (in the Active Learning area where real time data feeds supply and modify the model), but is primarily a relabeling of Pattern Recognition (which is a more accurate name, somewhat closer to the prediction methods of complex models, the pattern in general being a very specific kind of model).

View from the outside

From the outside, statisticians are consultants for the research community for agronomy, econometrics, medicine, psychology, any academic science or applied version that takes a lot of data and (interestingly it is the softer sciences like psychology and sociology that send their grad students to the statistics departments for instruction, but the physicists and chemists, even though they may individually use some regressions, don’t usually depend on a statistician even thigh they may do a regression or two. Maybe they think they know enough to do it themselves?). Either way, ML people make more money, I don't know why.

In industry (applied)

Statisticians are employed for quality control. This is their primary act as working statisticians. Taking samples of products, calculating error rate. ML people are more directly part of creating machines that do things in a fancy way, building things that work, like an assembly line robot for cars or zip code reader for handwritten mail.

In academia

Statistics is concentrated in an academic statistics department (Often attached to a mathematics department or ag school) or as a group of consultants for agronomists or medical research.

ML is concentrated in the AI section of a CS department or sprinkled throughout the engineering departments (robotics in MechE, EE (they do everything!). Or in real life in lots of industries, speech recognition, text analytics, vision.

Of course, there are some individuals who probably consider themselves in both camps (Breiman, Tibshirani, Hastie. What about Vapnik)?

Controversies

This has mostly been controversail as to what the differences are because of the tension between trying to assume they are the same but showing where the cultures make them different. Instead this is about the controversies within each.

In statistics, there has been a great internal controversy between frequentism vs Bayesianism. Frequentism is for lack of a better way of saying it, the traditional p-value analysis. Bayesiansim avoids these somewhat with the added controversial notion of allowing an assumed prior distribution set by the experimenter.

Less controversial though is the tension between descriptive statistics (or data exploration) and hypothesis testing.

ML is mostly Bayesian by default since arely are assumptions made about the distribution (or any investigation whatsoever about the effects of the distribution) and MCMC (Monte Carlo Markov Chain). The biggest controversy is between rule based learning and stochastic learning. The success of neural networks in the mid 80's (and the success of 'Google' methods in the 2000's) has largely killed rule learning except for maybe decision tree learning and association rules.

Notation

Usually stats is the old fogie and ML is the uncultured upstart, but in mathematical notation it is different. ML, coming out of engineering, uses more traditional mathematical notation. Though nominally more closely connected to mathematics practice, statistics uses a bizarre overloading of notation that no one else in math uses. For probabilities, distributions, vectors and matrices. Every thing element has multiple meanings, context barely tells you what's the right syntax.

Random notes

ML is almost entirely about prediction, in stats there’s quite a bit else other than that.
ML is almost entirely Bayesian (implicitly). Explicit Bayesianism is out of Stats. Frequentism, traditional statistics, is what most applied statistics uses.
Stats is split into descriptive and inferential meaning either simplify the entirety of some data into a few representative numbers, or judge if some statement is true.. Descriptive creates patterns/hypotheses that then the inferential judges how good the patterns/hypotheses are
predictions vs comparisons. ML is almost entirely predictive. Stats spends a lot of time on comparisons (is one set different from another, is the mean (central tendency) of one set significantly different from that of another)
Leo Breiman also explained a distinction between algorithmic and data modeling which I think maps mostly to ML and stats respectively

How they're the same

I consider ML to be an intellectual subset of stats, taking a lot of data and getting a rule out of it no matter what the application. Whatever things get labeled ML, they really should have a statistical analysis (to be good), and statisticians should be willing to call these methods statistical. So what if they're in different departments.

Friday, September 18, 2015

Confidence in association rules is identical to conditional probability

There's something that has bothered me for a while. In presentations of association rule learning (as a method of an unstructured learning method/data mining), the basic principles are:

the store - the set of all possible items {milk, bread, eggs, beer, diapers} = d
transactions - a list of subsets from all possible items (a transactoin = 1 market basket) eg {milk, bread, eggs, beer}, could be represented by a 0-1 vector of length d. # of transactions = n <= 2^d
itemsets - a subset of items in a transaction eg {milk, bread, eggs} or {bread, beer}, k-itemset has k items.
support - support count = frequency of occurrence of n item set \sigma({bread}) = 2, support = proportion of an itemset to total transactions s({bread}) = \sigma({bread})/n = 1
frequent itemset - itemset with s >= given threshold
association rule: X-> Y,X,Y itemsets, The intention is that X implies Y, or if X appears in a transaction, Y is likely to appear also.
support (X->Y) = \sigma(X \cup Y), fraction of transactions including both X and Y
confidence - c(X->Y) = \sigma(X\cup Y)/\sigma(X), how often Y appears in transactions that have X

And the various algorithms (brute force, apriori, Eclat, FP-growth) work on the list of transactions to discover association rules with high confidence. Confidence is the primary concept to be optimized

So what is the difficulty? That last definition of confidence. all that buildup with all that new vocabulary, all so straightforward and sensible, but all so new. There's something about ... confidence... that seems so familiar, but the notation... of implies and support .. it's just...

Of course this has been done elsewhere already.

Confidence is simply the conditional probability of Y given X. That's it. In notation:

Pr(Y | X) = Pr( Y and X) / Pr( X )

which is the probability of Y occurring when restricted to when X is already known to have occurred (not temporally). What might be misleading here is 'and' versus 'union'. In the confidence formula we want the frequency of the itemset and in Pr we want the proportion of events. There is a just a little step of manipulating subsets and events here; the elements of the set unioned with those of Y is equivlanet to the event of those elements conjoined (= anded) with those of Y. A subset of elements S of T is the dual of the events T a subset of S.

Just a little rejiggering of notation and a whole set of concepts opens up to help think about the space of association rules.

(from Pier Luca Lanzi, DMTM 2015 - 05 Association Rules)

Thursday, September 10, 2015

What statisticians and ML'ers really think of each other

Labels aren't the thing, they just name the thing, and the same thing can have different names, and many different things have the same name. But people often take the label to be the thing.

'Statistics' and 'Machine Learning' are labels for two different things that have some overlap, not identical but cover a lot of the same things.

Statistics is concerned with averages and deviations, probability distributions, design of experiments, and regression, trying to extract knowledge out of tables of numerical data. The usual single sentence summaries are hardly distinguishable from many other things with data in their title, like databases or IT (Information Technology).

Machine Learning is a subset of Artificial Intelligence (itself considered a subset of Computer Science but practiced and motivated by other engineering departments and psychology related fields including linguistics, philosophy and neuroscience). It tries to extract patterns out of numerical data too, but has a different provenance. The two overlap some but each have their own separate culture and methods.

And more to the point, they’re really trying to do mostly the same things and the math for them both is often identical.

But what do they really think of each other?

From the point of view of the statisticians (people who call themselves with that label or are employed by institutions with that label) is that ML is a handful of ad hoc 'predictive analytics' done by a bunch of computer scientists, engineers, or amateurs (or worse!) pulling it out of their ass, their methods are immature (they don't know anything!) and don’t take into account the decades of principles established by the more mature staisticians for quality of results. That is, ML may do new, interesting things but they usually aren’t that new and they’ve never thought of all the methodological pitfalls that have been managed so well already by statistical principles (think of the data!). The statisticians may begrudgingly acknowledge that some of the ML methods are externally successful, but really, with such complicated models how do you know if it is any good outside of your toy domain when you haven’t done a proper analysis of your distributional assumptions? You ML people don't actually know anything!

People who say that they do ML probably do not give themselves the label statistician or work in a statistics group, but rather ‘are’ a computer scientist or engineer. Their point of view is that statisticians are studying pointless details about ancient brittle methods that aren’t particularly interesting, don’t really apply to all the new data sources, and just aren’t as good as this shiny new toy. Also, Bayes says p-values are dumb! The ML people may begrudgingly acknowledge that some of the statistical methods produce quality results, but really who cares about the normal curve and what about Bayes? You statisticians are so old and ossified!

From my point of view, it would be better for everybody if ML were considered a subset of statistics (but successfully studied in other departments) and ML methods could use a lot of analysis by statisticians. And a job that is labeled as data scientist should be easily fillable by a statistician or an ML person. Both sides need more exposure to the methods of the other.

See also Statistics and Machine Learning, Fight! (it's funding and conference culture) and Statistical Modeling the Two Cultures (by Breiman) (data vs algorthmic modeling), The Two Cultures: Statistics-vs Machine Learning for more opinions on the difference.

Thursday, September 3, 2015

Spoilers in Clustering Methods

In voting schemes, when there are more than two cadidates, there is the possibility of a 'spoiler'. That is, if a third candidate is introduced, votes might be taken away only from the formerly winning candidate, 'spoiling' a chance of victory, letting a candidate not preferred by the majority to win because the majority is split between two similar candidates.

This is similar to a clustering algorithm, where having set three clusters, the agglomeration is set to distinguish between two smaller clusters such that, if only two clusters were desired, together would be bigger than the third.

(from Pier Luca Lanzi)

In the example figure, the parameter to the system for k-means clustering is 3. The upper right set is split into two separate clusters. But if the parameter were 2, then those two clusters might combine to make a cluster larger than the lower left.

Of course, that doesn't mean the lower left cluster would be preserved, some elements may move back or forth. This shows that clustering can have anomalies like voting schemes, even though clustering doesn't account for all possible orderings (permutations) of 'candidates' and the correspondence of cluster label with candidate is not perfect.

Least Uninteresting Number