Sunday, June 21, 2015

"You don't see people casually become neurosurgeons in their spare time"

From James Hague, Organizational Skills Beat Algorithmic Wizardry via John Cook, The most important skill in software

This is the reason there are so many accidental programmers. You don't see people casually become neurosurgeons in their spare time--the necessary training is specific and intense--but lots of people pick up enough coding skills to build things on their own.

Yeah, why is it that so many professional programmers are self-taught? So a brain surgeon is the top of the charts, but even a family practice doctor, the front line of medicine, still has the education and training of a rocket scientist.

So that's a bit tantalizing 'This is the reason...'. But Hague doesn't really give a reason. OK he sorta does. Previously he says:

To a great extent the act of coding is one of organization [not algorithmic wizardry]. Refactoring. Simplifying. Figuring out how to remove extraneous manipulations here and there.
Except programmers don't get hired because they show how well they can reorganize things, but by how they write an algorithm (classic interview questions). Sure, I bet there are some refactoring questions in some places (and I think those are just as useful a measure as the algorithm questions). But what gets you hired is your knowledge of the syntax of a programming language and maybe its libraries. But that's interviews and hiring.

It seems that Hague is saying that 'refactoring well' is the key to successful programming, not algorithmic wizardry. Historically, programming was algorithmic wizardry (a la CLRS) because that's what computers were good for, calculating extremely difficult mathematical or combinatorial computations. What is the shortest path that visits all nodes in this special graph? (greedy just doesn't work) How do you solve for a vector in a matrix equation using Gaussian elimination without using extra space? (space was at a premium). Those tasks needed wizardry, the kind of thought that goes into solving a Rubik's Cube, having good memory for a few examples, seeing patterns in them, seeing in the patterns, always an eye for optimization.


How do you get that one yellow orange edge piece in place without messing everything else up? Figuring that out takes a lot of 3D imagination ability, a lot of memory, and a lot of trial and error, but eventually you come up with a set pattern to follow like "U' L' U L U F U' F'" (ha ha that's technical, and not doable without, but that's what algorithms are). But once you have the pattern, it's straightforward to compose with other patterns (which you have to figure out possibly totally different from the patterns for the above or maybe somewhat similar)

But nowadays it seems lots of those difficult algorithms have been solved. They're in a library. You don't have to reimplement red-black trees to create a database index, you use the library. Programming today is more like 'Rush Hour', where you're moving a small set of pieces, some blocking others, but maybe the first are blocking the latter and you have to plan ahead.


You're rearranging a bunch of pieces to make the whole thing come out. For a webapp, you need some forms on the front end, communicating via http calls to a backend with a database, but your webapp needs to run on a smart phone and a tablet, too. All straightforward large pieces, but to make sure that you make it easier for yourself to modify later, you want to separate the pieces well.

What makes a good programmer, and allows so many without academic education in it to perform well, is that this kind of large piece thinking is usually amenable to anybody with a technical background, engineering, math, science. Most of such fields promote detail-oriented, symbolic thinking and memory (all of which is good to grasp the computational nature of programming) but also the

So the answer to the original question? Not everybody is an algorithms wizard, can figure out Rubik's cube, there's lots of special talent needed for that. But most technically oriented people, without specific experience, can program nowadays, can solve Rush Hour problems. It's not easy (especially for the more difficult layouts), but most technically oriented people can plow through it. You don't need to know linear algebra to do a website, you just need to read a few docs in order to move a handful of pieces around.

Thursday, June 18, 2015

The term 'Big Data' now means 'More Data Than You Thought Of Before'

What is 'Big Data'? For the past few years (let's say roughly 5, before which it was not nearly as popular as after) everybody knows what it means. It's a bunch of data and there's a lot of it. Also machine learning and FitBit. The term has come to mean many things. It was originally coined.. well.. it's sorta vague. And its current usage (how it is used by people) is not the same as its intended definition.

First, what is it supposed to be now? Wikipedia gives the following:

Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate

Conveniently a source search for the provenance of the term has been done: who first used it in a similar way to what it means now.

And Forbes gave a list of many possible definitions, some more substantive than others. These are all definitions I can agree with. There's data, it's structured in weird new ways, and it comes from new sources that produce gobs and gobs of it. The term 'Big Data' is used to contrast with traditional data, its storage and techniques. Traditional data is not 'Big' and therefore is 'small' and its techniques are 'old'. I'd say any kind of relational database (SQL related) is traditional.

Big data is things like mouse clicks or even mouse movements on a screen, or fitbit near continuous polling of heart rate and blood O2 (does it really do that? I don't know!). Or road sensors checking car presence on a road or at a stop light. Every possible transaction, position.

The thing is that what a lot of people are calling 'Big Data' are really pretty traditional. For example, what about credit card transactions, is that big? They're constantly checking them for out of the ordinary purchases so that they can cut off my service right when I need it at an airport. Or tax returns? There must be millions, and for even the simplest return, hundreds of entries and calculations! Or weather sensors for near continuous temp/pressure/precip/windspeed?  These are old data sources, no one would call them Big Data, they have been around long enough that they -are- traditional. Yet they are all at the forefront of large and complex systems design.

And then there are the false positives, those things that are called big data but don't really fit that definition. For example, electronic health records collect a lot of information on a patient. But... it's about as traditional as traditional gets. Straightforward collection of forms or transcripts or lists of entries.

I am purposefully leaving out things like EKGs or radiological images, not because they deny my point but because they are the exception that proves the rule. They both take of gobs and gobs of space (for radiology a single CT scan of the body can take up X Gig. In the best sense, the data 'size' of a study is 1, or rather it is the size of the radiologist's text report describing the results found in the images ('Insignificant thyroidal calcification. No other findings') that counts, and that hardly counts. The scale is pretty small.

So what is not Big Data that actually isn't Big Data (by the official definition)? There's the database of clients that one company manages, all the contact information, the account transactions. Classic relational database. There's the auto parts store with its database of suppliers and buyers, all the various parts for sale with their characteristics and which make/model/year they work with, the list of sales and inventory. Classic relational database.

Also, the term 'Big Data' has a particularly ... millenial 'kids-these-days' feel to it. It is a bit inarticulate in that it's not saying exactly what it is, but everyone has a good idea of what they think it should mean. 'Big' is about as meaningful as 'wow'.

Anyway, 'Big Data' ... I'll use it for what everybody else uses it for (I don't know what that is yet !), both big new things, and also smaller traditional things that most people never really thought about before.

Footnote: this is entirely patterned after the term 'superccomputer' which has a similar history. In the late 70's Cray I was a supercomputer, and today (mid 2010's) a smart phone is just a little ... thing, despite the fact that the smart phone computes more flops than the Cray. A supercomputer is pretty much just the best possible computer in existence -right now-.

Tuesday, June 16, 2015

SWOT Analysis of SWOT Analysis

Where is this company going? Who are our competitors? How do we increase market penetration?

SWOT analysis, a diagram listing Strengths, Weaknesses, Opportunities, and Threats, is a great high level management strategy tool in business. It doesn't tell you what to do when there's a problem, it just helps you describe a situation.

It forces you to make explicit both good and bad, and the internal and external features of your situation. As a graphic model, it puts like things in the same place and gives a well-defined, limited structured place to put them. You can use it for the highest level (for an entire company) or for a very low level (one particular product), or even for the most technical of things (should I learn a particular technical skill?). It's not for everything and it doesn't do everything for the things it is for.

Here's a template of a SWOT analysis chart with the kinds of questions that go in each:




(from http://www.conceptdraw.com/How-To-Guide/swot-analysis-matrix-template)

There are two axes, either internal or external, and the quality either good or bad. It's probably bad form to mention the negative of an item from the positive side, sort of non-creative and redundant. Threats can be competitors, but it may also be useful to do an entire SWOT on a competitor for true comparison sake.

There are many examples and explanations of SWOT analysis. Here's a simple example for a bank:


(from http://www.theasianbanker.com/benchmarking/our-tools)

Unfortunately, I couldn't find any examples of real  SWOTs (from the real world), so a made up academic example will have to do.

So now the point of this is to judge SWOT by some judgement method. Since SWOT analysis is exactly one of those methods, let's turn it on itself. I (and other sites) have given a wordy analysis, so I'll just give the SWOT analysis as is without any additional commentary.

Internalities

Strengths
- well-defined, simple structure, limits issues to judge
- simple to understand
- quick to complete, low cost
- makes explicit for all ideas of things to do (externalities) and how well you can expect to do them (internals)
- doesn't need too much knowledge of the company to create details

Weaknesses
- no ranking of importance of items
- too superficial, oversimplification of issues, shallow
- many details left out, lots of context left out
- not operational, doesn't help with figuring out what to do
- only binary, hard to include important but multifaceted issues
- balance implied but not necessary


Externalities

Opportunities
- many people don't know of this method
- lo-tech, doesn't need (computer) tools to fill out
- can be used for non-business situations

Threats
- there are many other forms of business analysis (Gannt charts, weighted average, competency chart)
- easily overlooked because so shallow
- results often overlooked, swamped by other methods

So there's not much that can be said reasonably about opportunities or threats here. SWOT is an analysis method, not a startup. So I guess that's another weakness of SWOT, that  it's not applicable in more contexts. Externalities are mostly relevant in gaming situations where there are true competitors vying for resources (like in a business situation). Opportunities and threats seem geared almost directly to a sales or niche competition (greenfield vs brownfield, red ocean vs blue), how to use one's strengths to pursue an opportunity.

One thing missing from this analysis of analysis, which doesn't fit in the restrictions of this analysis, is the range of competitors. It's a detail mentioned but not made explicit. I feel like it would be a great lack if those details aren't mentioned here even though it doesn't fit within the constraints of the academic problem.

The competitors, or rather alternatives to SWOT analysis are:
- TOWS - builds on top of SWOT, pairs of SWOT quadrants
- SOAR - the positive version of SWOT, Strengths, Opportunities, Aspirations and Results
Growth Share Matrix - scatter plot of businesses by market share vs growth rate
- Gap Analysis - comparison of actual with potential performance

Surely there are other more substantive alternatives to SWOT?


Tuesday, June 9, 2015

Why are there no moon base plans?

The  current space obsession is a manned mission to Mars. In the past couple of years there's been all sorts of stories and books on how to to do it, what the purpose of such a mission is, the difficulties, the variations.

Every president since Bush Sr (wait did Obama mention it?) has promised to put a man on mars (wait, did -Clinton do it?).

It seems like these big media plans are almost as common as plans to create a high speed rail line between NYC and Washington (or San Francisco and LA, or Chicago and St. Louis). Every new governor

I'm all gung ho for every sci-fi inspired space plan: mining asteroids for precious resources, terraforming Ganymede for farming, solar sails to travel among the planets.

But.. this should be sci-fi inspired engineering, not science fantasy. Wouldn't it be more cost effective and profitable and room for learning more about engineering around off-earth environments if we went incrementally? There is a space station, a bit smallish, with worldwide support. Shouldn't there be some intermediary step, like a moon base?

First, an efficient transport mechanism to a low orbit space station, via rockets or space elevator or what have you.

Then maybe an intermediate high orbit one.

Then a minimal lunar base.

Then lunar L1 and L2 satellite stations.

Then an expanded lunar base.

...and a whole bunch of intermediary supply chain steps, not just to support a permanent connection (realistically, we don't know if we'l be able to support that in the long term), but to support exploitation of those intermediate steps as ends to themselves.

Then, once all that's done, a visit to Mars (because all those previous items will make the trip that much easier. Don't blow a shitload of money on a one-off to Mars. Make it realistically attainable.

Also, in parallel (and maybe with more money than carved out for a manned mission), that much more robotic exploration. Let the machines die first. It's less expensive and less upsetting and demoralizing.

Oh. I'm sorry. There -are- plans for a moon base.. But I have no idea if this is part of a grand plan.

Also, what's the business plan other than 'Holy shit this will be cool'? (I'm all for that business plan, but my funding is in science fiction dollars)

Ioannidis' "Why Most Published Research Findings Are False" - two kinds of 'false'

Ioannidis put out "Why Most Published Research Findings Are False" 10 years ago.

The title is provocative. False? Really? Oh my god, are bridges falling? (no). But are people having adverse reactions to this medication that doesn't really do anything? Very probably!

But really, what did Ioannidis mean by 'false'? Most people think of 'false' as the entire opposite of 'true' (well, there's very little denying that!) But what I mean is rather 'false' means that the entire opposite of a statement is true. And that subtlety makes a difference.

"This apple is red"... that is either true or false. It may be entirely green, in which case that statement is perfectly false. Or it may be mostly red with green spots. Or it may be partly green and partly red about the same area. Or maybe it is about to turn from green to red and is paradoxically half way between. You can judge for yourself how true or false each of those might be. But there is undoubtedly play or vagueness in those words, in 'red' and 'false, maybe even 'is'.

When it comes to experiments, the situation is now about a number of things. 'All apples are red'. That is certainly not the case, because some are green. If there is at least one apple that is not red, then that statement is false (not only be everyday common sense, but by the stipulated mathematical/logical usage of quantifiers like 'all'. But scientifically 'all apples are red' can be statistically justified (and it is accepted usage) if there are a reasonable few that aren't red. That is, 'All apples are red, except for a few which don't really count'.

But that is my semantic analysis. It is totally relevant to modern research, and gives a reasonable interpretation to the title of the paper. Most experimental research is trying to say something like "All X are Y, for the most part". Ioannidis in his paper is actually pursuing another definition of false. Sorry, not another definition, but a perfectly good, oops the best, most correct definition of false... in a particular context. Well, in the interests of full disclosure, he uses it two ways, in the traditional definit... oops context, and then also in a computational context.

How he uses 'false' this is not terribly complex and is very logical and supportable and I agree with it and it is a useful way of using 'false'... but it's not what you expect. He takes the usual 2x2 statistical hypothesis testing paradigm with its type I and type II errors ("Apples are mostly red" vs "Apples are not mostly red (a non-negligeable amount are not red" as competing hypotheses and then testing against reality).

What Ioannidis does is take the parameters for hypothesis testing, alpha the probability of a false negative, beta the probability of a false positive (this is also the parameter for determining power, or the number of instances required to guarantee a significant result if there is one), and R the prior probability of the hypothesis being true, and computes the PPV (positive predictive value), using very elementary and straightforward arithmetic (see Table 1)


The PPV is TP/(TP+FP) = (1-beta)R/(R+alpha-beta*R). OK that's not 2+2, but it's not at all rocket science. He simplifies this considerably by setting alpha to be the usual cutoff for significance acceptability:

Since usually the vast majority of investigators depend on a = 0.05, this means that a research finding is more likely true than false if (1 - β)R > 0.05.
He then goes to show in a later table 4, that for given beta and R (and bias u) that, given a few types of studies, studies in each type having roughly the same params, each kind of study will have a probability of being ... true (statistically/roughly/acceptably).





Presumably, He considers only those top two to be generally true, and all the rest presumably false (the latter even in the loosey goosey/benefit of the doubt/non-categorical/for the most part 'true').

The press for this article often claims that Ioannidis says that '75% of studies are false'. Again, presumably, that figure comes from a some weighted average, over all studies (in some unspecified context) using the above table. I have not done that computation, nor the setup work of judging a large set of studies (medical?) and which category they lie in.


Friday, June 5, 2015

Precision != Precision, or Measurement, Categorical Variables, and Polysemy

Two words in data science are unfortunately pronounced and spelled exactly the same. They are 'precision' and 'precision'.

Both are very technical in meaning Their informal meaning, though not wrong exactly and metaphorically in the ball park, does not give much clue as to the exact meanings.

The first one meaning is relevant to measurement. It means 'how many digits in a numerical measure are used' or very similarly the variance of a set of measures. This is in contrast to 'accuracy' which means 'on average how correct'. A number is 'precise' if it has lots of digits to the right of the most significant digit (or the set has very small variance). A set of numbers is 'accurate' if the set's average is very close to the true value (note the grammar: precision can apply to a single number but accuracy is for a set). Here is a classic picture of the difference of 'precision' with 'accuracy' (from wikipedia):


Another view is high and low precision and accuracy (from NOAA)



The second definition, relevant to 2x2 contingency tables, means technically 'TP/(TP+FP)' or the ratio of True Positives to Total Positives (the latter of which is the sum of True Positives and False Positives). What it means (for how good a test is a measure of reality) is how well the test (when positive) captures the phenomenon. A technical synonym (which means it is an exact synonym which means they are identical) is Positive Predictive Value or PPV. Almost as metaphorically meaningful, but really that doesn't matter, the meaning is stipulated to be the ratio. The generic picture of a 2x2 contingency table is (from alpine.atlassian):



But wait! you say. You see two by two tables in each case, and both are about how good a test is with reality. Isn't that the same? Yes, they involve some similar principles, but they appear in different circumstances. One is about the significant digits of real values vs the average (a computation on continuous values) where variance and average are very different computations. The other is about comparison of two different binary (yes/no) values, an identical dimension of true vs false.


In addition, note that 'accuracy' is also for two words spelled the same way (one for average being close to true, and for 2x2 tables the ratio of TP plus TN to the total, the diagonal in the image above). The contingency table 'accuracy' is not as popular a concept/term though.

The lessons to learn then is:

- these two concepts, spelled the same way, are very different, even though metphorically they have something to do with how good a set of numbers is.

- some technical words have more than one meaning. Really really different meanings. But usually context will tell you which is which. If you're talking about just the quality of a metric by itself, then 'precision' is the variance. If contingency tables, then it's the same as PPV (positive predictive value).

See also:
http://en.wikipedia.org/wiki/Accuracy_and_precision
which has a section on both.