We want both. We want for something to be practical and nice looking as possible. We want the shiny doorknob to look elegant, but we also want it to work smoothly without having to jiggle it, and not to need cleaning all the time. Sometimes one has to be given up for the other, an industrial assembly line may be a little grimy all the time but it gets the job done; of course, cleanliness may be a desired property of the object and practicality and esthetics share a common cause.
Sometimes we want something that is esthetically pleasing and superficially practical but not necessarily perfectly practical, like a watch with no face numbers. The esthetics is the desired intention.
That's all philosophy diatribe to justify something that bugs the crap out of me.
Basically wordles are the worst. And periodic tables (except for The Periodic Table which is the best). And usually subway diagrams (except for actual city subways). Here's one for semantic web technologies:
This! This is the worstest.
If you know what each of the entities are, and you make all sorts of qualifications, maybe this makes sense a little. It makes only the slightest bit more of sense if 'A above B' means A is 'built on B'. But then there are all sorts of 'If you did that, then why did you do that?' questions (why is encryption and signature off to the side only for some, why is logic and proof separate, is Unicode really such a huge important base technology, etc etc). Wait, isn't a namespace a particular kind of URI? There are many variations on the 'Semantic Web Stack', but each in its own way has all these "I don't get why they did that?" problems. This is all about esthetics (Nice color combo!) and little to do with imparting coherent information. No, you will not learn anything from this. Wait...what the hell is 'signature'?
Friday, December 16, 2016
Trust vs Depend
"I trust that guy as far as I can throw him"
"I can't trust him to complete the project on time"
'Trust' is used in two different ways. One is the usual opposite of falsehood. If you can't trust them, they are a liar. This presumes intent and is almost demonizing.
The other way is dependability. Trust of the outcome. If you can't trust someone this way, it's not a reflection of their evil intent but about ability to execute. This is very different from falsehood. You can actively do something about this.
The other way you can only seek other sources of information.
So instead of 'trust' 'use 'depend on'. 'Trust' makes it sound like you think they're lying. 'Depend' just means there is doubt without judging.
"I can't trust him to complete the project on time"
'Trust' is used in two different ways. One is the usual opposite of falsehood. If you can't trust them, they are a liar. This presumes intent and is almost demonizing.
The other way is dependability. Trust of the outcome. If you can't trust someone this way, it's not a reflection of their evil intent but about ability to execute. This is very different from falsehood. You can actively do something about this.
The other way you can only seek other sources of information.
So instead of 'trust' 'use 'depend on'. 'Trust' makes it sound like you think they're lying. 'Depend' just means there is doubt without judging.
Wednesday, December 7, 2016
Testing Software and Machine Learning
Testing is a major part of any non-trivial software development project. All parts of a system require testing to verify that the engineered artifact does what it claims to do. A function has inputs and expected outputs. A user interface has expectations of both operation and ease of interaction. A large system has expectations of inter module operability. A network has expectations of latency and availability.
Machine learning produces modules that are functional, but with the twist that the system is statistical. ML models are functional in that there are one-to-one correspondence between inputs and outputs, but with the expectation that not every such output is ... desired.
Let's start with though with a plain old functional model. To test that it is correct, you can test the output against all possible inputs to check if. That's usually not feasible, and anyway is in the more esoteric domain of proving program correctness. What is preferred is instance checking, checking that a particular finite set of inputs gives exactly the corresponding correct outputs. This is formally called 'unit testing'. It usually involves simply a list of inputs and the corresponding expected outputs. The quality of such a set of unit tests relies on coverage of the 'space'. Edge cases (extreme values), corner cases (more than one variable in an extreme value), important instances, generic instances, random instances, etc. Also, any time a bug is found and corrected, the faulty instance can be added to the list to ensure it doesn't happen again.
An ML function is, well, but its name, also functional, but the expectations are slightly different. An ML model (the more technical term for an ML function) can be wrong sometimes. You create a model using functional directly corresponding inputs and outputs, but then when you test on some new items, most items will have correct outputs, but some may be incorrect. A good ML model will have very few incorrect, but there is no expectation that it will be perfect. So when testing, a model's quality isn't yes or no, that absolutely every unit test has passed, but rather that a number of unit tests beyond a threshold have passed. So in QA, if one instance doesn't pass, here it's OK. It's not great, but it is not a deal breaker. If all tests pass, that certainly is great (or it might be too good to be true!). But if most tests pass, for some of 'most', then the method is usable.
There are three major parts to creating a machine learning model that need to be tested, (where something can go wrong and where things can be changed): the method or model itself, the individual features supplied to the model, and the selection of data. The method or model itself is the domain of the ML engineer, analgous to regular coding.
I can almost go so far as to say that testing is almost integrated to a large part already within ML methods. Testing use systematic data to test the accuracy of code; ML methods use systematic data in the creation of a model (which is executable code). And so if an independent team, QA or testing, is to be involved, they need to be aware of the statistical methods used, how they work, and all the test-like parts to the model.
Let's taking logistic regression as an example. The method itself fits a threshold function (the logistic function) to a set of points (many binary features, one continuous output feature between 0 and 1). Immediately from the regression fitting procedure you get correlation coefficients, closeness of fit, AUC and other measures of goodness. There are some ways to improve the regression results without changing the input data, namely regularization (constraints on the model), and cross validation. For the features, mostly independent of the method), first there are the number of features, how correlated they are, how individually each feature is predictive; each feature could be analyzed for quality itself. And last for the selection of data, also independent of the method, there's selection bias, there's separation into training, validation, and test sets.
Where QA can be involved directly or indirectly:
- ensuring metric thresholds are met - analogous to overseeing unit-test coverage of code)
- questioning stats methods (like being involved in architecture design
- cross-validation - both makes the model better (less overfitting) and returns quality metric
- calibration of stats methods - quality of prediction
- test and training data selection - to help mitigate selection bias
- testing instances - for ensuring output for specific instances (must-have unit tests)
- feedback of error instances - helps improve model
- quality of test/training data - ensuring few missing values/typos/inappropriate outliers
- UX interaction of humans with inexact system - final say - does the model work in the real world via the application, has interaction with people shown any hidden variables, any unmeasured items, any immeasurables, any gaming of the system by users.
The latter seems to be the most attackable by a dedicated QA group that is functionally separate from an ML group, and all the previous ones seem to be quite on the other side, only the domain of ML to the exclusion of a QA group. But hopefully the discussion above shows that they're all the domain of both. There should be a lot of overlap in what the ML implementers are expected to do and what QA is expected to do. Sure, you don't want to relieve engineers of their moral duty to uphold quality. The fact that QA may be looking over quality issues doesn't mean the data scientist shouldn't care. Just as software engineers doing regular code should be including unit tests as part of the compilation step, the data scientist should be checking metrics as a matter of course.
Machine learning produces modules that are functional, but with the twist that the system is statistical. ML models are functional in that there are one-to-one correspondence between inputs and outputs, but with the expectation that not every such output is ... desired.
Let's start with though with a plain old functional model. To test that it is correct, you can test the output against all possible inputs to check if. That's usually not feasible, and anyway is in the more esoteric domain of proving program correctness. What is preferred is instance checking, checking that a particular finite set of inputs gives exactly the corresponding correct outputs. This is formally called 'unit testing'. It usually involves simply a list of inputs and the corresponding expected outputs. The quality of such a set of unit tests relies on coverage of the 'space'. Edge cases (extreme values), corner cases (more than one variable in an extreme value), important instances, generic instances, random instances, etc. Also, any time a bug is found and corrected, the faulty instance can be added to the list to ensure it doesn't happen again.
An ML function is, well, but its name, also functional, but the expectations are slightly different. An ML model (the more technical term for an ML function) can be wrong sometimes. You create a model using functional directly corresponding inputs and outputs, but then when you test on some new items, most items will have correct outputs, but some may be incorrect. A good ML model will have very few incorrect, but there is no expectation that it will be perfect. So when testing, a model's quality isn't yes or no, that absolutely every unit test has passed, but rather that a number of unit tests beyond a threshold have passed. So in QA, if one instance doesn't pass, here it's OK. It's not great, but it is not a deal breaker. If all tests pass, that certainly is great (or it might be too good to be true!). But if most tests pass, for some of 'most', then the method is usable.
There are three major parts to creating a machine learning model that need to be tested, (where something can go wrong and where things can be changed): the method or model itself, the individual features supplied to the model, and the selection of data. The method or model itself is the domain of the ML engineer, analgous to regular coding.
I can almost go so far as to say that testing is almost integrated to a large part already within ML methods. Testing use systematic data to test the accuracy of code; ML methods use systematic data in the creation of a model (which is executable code). And so if an independent team, QA or testing, is to be involved, they need to be aware of the statistical methods used, how they work, and all the test-like parts to the model.
Let's taking logistic regression as an example. The method itself fits a threshold function (the logistic function) to a set of points (many binary features, one continuous output feature between 0 and 1). Immediately from the regression fitting procedure you get correlation coefficients, closeness of fit, AUC and other measures of goodness. There are some ways to improve the regression results without changing the input data, namely regularization (constraints on the model), and cross validation. For the features, mostly independent of the method), first there are the number of features, how correlated they are, how individually each feature is predictive; each feature could be analyzed for quality itself. And last for the selection of data, also independent of the method, there's selection bias, there's separation into training, validation, and test sets.
Where QA can be involved directly or indirectly:
- ensuring metric thresholds are met - analogous to overseeing unit-test coverage of code)
- questioning stats methods (like being involved in architecture design
- cross-validation - both makes the model better (less overfitting) and returns quality metric
- calibration of stats methods - quality of prediction
- test and training data selection - to help mitigate selection bias
- testing instances - for ensuring output for specific instances (must-have unit tests)
- feedback of error instances - helps improve model
- quality of test/training data - ensuring few missing values/typos/inappropriate outliers
- UX interaction of humans with inexact system - final say - does the model work in the real world via the application, has interaction with people shown any hidden variables, any unmeasured items, any immeasurables, any gaming of the system by users.
The latter seems to be the most attackable by a dedicated QA group that is functionally separate from an ML group, and all the previous ones seem to be quite on the other side, only the domain of ML to the exclusion of a QA group. But hopefully the discussion above shows that they're all the domain of both. There should be a lot of overlap in what the ML implementers are expected to do and what QA is expected to do. Sure, you don't want to relieve engineers of their moral duty to uphold quality. The fact that QA may be looking over quality issues doesn't mean the data scientist shouldn't care. Just as software engineers doing regular code should be including unit tests as part of the compilation step, the data scientist should be checking metrics as a matter of course.
Subscribe to:
Posts (Atom)