The classic statistical procedure takes the sample, a small subset of past data called the data or for later purposes the training set, does some rocket science on that set (say, linear regression), produces the model (some coefficients, some small machine that says yes or no or outputs a guess on a single new data point), and maybe also produces some extra measures that says how good or bad the fit is expected to be (correlation coefficient, F-test). And we're done. So many papers and studies have been done over the years that follow this pattern.
But... what is a hold-out set? The modern way (not that modern) is to split the sample randomly into two parts, the training set (on which to do the classic part) and the test set or hold-out set to check. Run the model on all of the items in the test set and see how bad the fit. The test set is distinct from the training set because we want to validate on unseen data, we don't want to assume something we're trying to prove.
Why do this? It seems like such a waste. Why in a sense throw away perfectly good sample data on a test when you could use it in making a more accurate model? Why in a sense test again when you can use that test data to train? More data is better, right?
Well, you're not really throwing it away, but it does seem like a secondary, minor desire. After all, don't most statistical procedures compute some sort of quality measure on the entire set first? This desire not to 'waste' hard won sample data is very understandable; most of the labor in an experiment is not the statistics but in gathering the actual data.
Of course one could weakly justify this test set by saying it gives more reliable quality statistics.
The real desire for a holdout set is to combat overfitting. There are two sides to modeling: real life data is not perfect, the model is trying to get close to the rule behind the data, but it may go too far and get close to the data itself instead of the rule. The classic step gets us the first part, the modern step avoids going too far. A hint to the purpose is another name for the 'hold-out set which is validation set which name gives a better idea of its purpose. You create a model with the training set, and validate it with the validation set. You're validating your model, making sure that it does well what you claim does well. The first step in predictive modeling is to not underfit, to get close to reality that the data hopefully represents. The test or validation step is to make sure you don't overfit, get too close to the data at the expense of reality.
So I've weakly justified the desire for some kind of hold-out/test set. But how does one actually choose this set? Obviously a random subset but what size? The primary issue is a balance between the model and the goodness of fit: with smaller training set, more variance in the model; with smaller test set, more variance in the stats. There's no hard and fast rule (80/20 is considered reasonable).There are a number of strategies to deal with this.
- number not proportion - just make sure you have enough data points in each and after that proportion doesn't matter as much
- resample- do the test a few times on random subsamples. This is the very general procedure bootstrap/jackknife
- data partition and validate each as a test set against the rest - cross validation. This idea is to split the entire dataset into many samples and do the test/training on each set vs the rest. That is, all data is used as part of a training set and all as part of a test set at some point. There are many strategies here: leave one-out (LOOCV), where all but one is the training set and a single item is the test set, but do this for every single item in your data set. Under some models (like general linear regression models) you don't have to repeat the process n times because the math cancels out a lot (linearity is great!). Another method is k-fold CV where you split your data into k pieces (in practice often 5 or 10) and create a model on n-n/k items and validate on the, do that for each of these k pieces. It takes more time (k more times). LOOCV is essentially n-fold CV, so it is not efficient time wise when model creation takes a while (like for SVM)
A lot of this ignores the issue of what to do if your validation set has bad performance. What is the statistically 'right thing to do' then? Do you rejigger things knowingly? How adaptive can you be and avoid p-hacking? I'll save that for later.
No comments:
Post a Comment