Wednesday, December 7, 2016

Testing Software and Machine Learning

Testing is a major part of any non-trivial software development project. All parts of a system require testing to verify that the engineered artifact does what it claims to do. A function has inputs and expected outputs. A user interface has expectations of both operation and ease of interaction. A large system has expectations of inter module operability.  A network has expectations of latency and availability.

Machine learning produces modules that are functional, but with the twist that the system is statistical. ML models are functional in that there are one-to-one correspondence between inputs and outputs, but with the expectation that not every such output is ... desired.

Let's start with though with a plain old functional model. To test that it is correct, you can test the output against all possible inputs to check if. That's usually not feasible, and anyway is in the more esoteric domain of proving program correctness. What is preferred is instance checking, checking that a particular finite set of inputs gives exactly the corresponding correct outputs. This is formally called 'unit testing'.  It usually involves simply a list of inputs and the corresponding expected outputs. The quality of such a set of unit tests relies on coverage of the 'space'. Edge cases (extreme values), corner cases (more than one variable in an extreme value), important instances, generic instances, random instances, etc. Also, any time a bug is found and corrected, the faulty instance can be added to the list to ensure it doesn't happen again.

An ML function is, well, but its name, also functional, but the expectations are slightly different. An ML model (the more technical term for an ML function) can be wrong sometimes. You create a model using functional directly corresponding inputs and outputs, but then when you test on some new items, most items will have correct outputs, but some may be incorrect. A good ML model will have very few incorrect, but there is no expectation that it will be perfect. So when testing, a model's quality isn't yes or no, that absolutely every unit test has passed, but rather that a number of unit tests beyond a threshold have passed. So in QA, if one instance doesn't pass, here it's OK. It's not great, but it is not a deal breaker. If all tests pass, that certainly is great (or it might be too good to be true!). But if most tests pass, for some of 'most', then the method is usable.

There are three major parts to creating a machine learning model that need to be tested, (where something can go wrong and where things can be changed): the method or model itself, the individual features supplied to the model, and the selection of data. The method or model itself is the domain of the ML engineer, analgous to regular coding.

I can almost go so far as to say that testing is almost integrated to a large part already within ML methods. Testing use systematic data to test the accuracy of code; ML methods use systematic data in the creation of a model (which is executable code). And so if an independent team, QA or testing, is to be involved, they need to be aware of the statistical methods used, how they work, and all the test-like parts to the model.

Let's taking logistic regression as an example. The method itself fits a threshold function (the logistic function) to a set of points (many binary features, one continuous output feature between 0 and 1). Immediately from the regression fitting procedure you get correlation coefficients, closeness of fit, AUC and other measures of goodness. There are some ways to improve the regression results without changing the input data, namely regularization (constraints on the model), and cross validation. For the features, mostly independent of the method), first there are the number of features, how correlated they are, how individually each feature is predictive; each feature could be analyzed for quality itself. And last for the selection of data, also independent of the method, there's selection bias, there's separation into training, validation, and test sets.


Where QA can be involved directly or indirectly:

- ensuring metric thresholds are met - analogous to overseeing unit-test coverage of code)
- questioning stats methods (like being involved in architecture design
- cross-validation - both makes the model better (less overfitting) and returns quality metric
- calibration of stats methods - quality of prediction
- test and training data selection - to help mitigate selection bias
- testing instances - for ensuring output for specific instances (must-have unit tests)
- feedback of error instances - helps improve model
- quality of test/training data - ensuring few missing values/typos/inappropriate outliers
- UX interaction of humans with inexact system - final say - does the model work in the real world via the application, has interaction with people shown any hidden variables, any unmeasured items, any immeasurables, any gaming of the system by users.

The latter seems to be the most attackable by a dedicated QA group that is functionally separate from an ML group, and all the previous ones seem to be quite on the other side, only the domain of ML to the exclusion of a QA group. But hopefully the discussion above shows that they're all the domain of both. There should be a lot of overlap in what the ML implementers are expected to do and what QA is expected to do. Sure, you don't want to relieve engineers of their moral duty to uphold quality. The fact that QA may be looking over quality issues doesn't mean the data scientist shouldn't care. Just as software engineers doing regular code should be including unit tests as part of the compilation step, the data scientist should be checking metrics as a matter of course.

No comments: