Least Uninteresting Number: complaints

Showing posts with label complaints. Show all posts

Friday, December 16, 2016

Practicality vs Esthetics in DataViz

We want both. We want for something to be practical and nice looking as possible. We want the shiny doorknob to look elegant, but we also want it to work smoothly without having to jiggle it, and not to need cleaning all the time. Sometimes one has to be given up for the other, an industrial assembly line may be a little grimy all the time but it gets the job done; of course, cleanliness may be a desired property of the object and practicality and esthetics share a common cause.

Sometimes we want something that is esthetically pleasing and superficially practical but not necessarily perfectly practical, like a watch with no face numbers. The esthetics is the desired intention.

That's all philosophy diatribe to justify something that bugs the crap out of me.

Basically wordles are the worst. And periodic tables (except for The Periodic Table which is the best). And usually subway diagrams (except for actual city subways). Here's one for semantic web technologies:

This! This is the worstest.

If you know what each of the entities are, and you make all sorts of qualifications, maybe this makes sense a little. It makes only the slightest bit more of sense if 'A above B' means A is 'built on B'. But then there are all sorts of 'If you did that, then why did you do that?' questions (why is encryption and signature off to the side only for some, why is logic and proof separate, is Unicode really such a huge important base technology, etc etc). Wait, isn't a namespace a particular kind of URI? There are many variations on the 'Semantic Web Stack', but each in its own way has all these "I don't get why they did that?" problems. This is all about esthetics (Nice color combo!) and little to do with imparting coherent information. No, you will not learn anything from this. Wait...what the hell is 'signature'?

Trust vs Depend

"I trust that guy as far as I can throw him"

"I can't trust him to complete the project on time"

'Trust' is used in two different ways. One is the usual opposite of falsehood. If you can't trust them, they are a liar. This presumes intent and is almost demonizing.

The other way is dependability. Trust of the outcome. If you can't trust someone this way, it's not a reflection of their evil intent but about ability to execute. This is very different from falsehood. You can actively do something about this.

The other way you can only seek other sources of information.

So instead of 'trust' 'use 'depend on'. 'Trust' makes it sound like you think they're lying. 'Depend' just means there is doubt without judging.

Friday, November 18, 2016

Effect size versus statistical significance

One of the major tropes in the p-value wars is the difference between statistical significance and effect size. The usual (important) observation is that you can have a calculation on data that results in very small p-value, meaning very high statistical significance, but very small effect. And often this can be effected by increasing the number of instances: the more instances the smaller the p-value can be guaranteed, that the phenomenon is really not due to chance, no matter how small the scale phenomenon actually is. This is not to say that the phenomenon is not real, just that the phenomenon doesn't change that much in one direction.

This difference is presented often laconically ("Using Effect Size—or Why the P Value Is Not Enough") as:

Statistical significance is the least interesting thing about the results. You should describe the results in terms of measures of magnitude –not just, does a treatment affect people, but how much does it affect them.(Kline RB)

This makes it sound like you have two things that can be presented, and one is much more important than the other. But it's a false dichotomy. You want both. The magnitude is descriptive stats - how big it is. In an experiment on n individuals, fish oil tablets increased memory performance by 10%. If you don't know the effect size, what exactly beyond 'better' do you know about the phenomenon? Statistical significance is trust - how (mathematically) representative the sample is of the population. You can claim something is better but can you really trust the claim?

It's very easy to see how to manufacture a high statistical significance but low effect size - increase the number of instances. In fact, as you increase n, almost all statistical tests asymptotically approach statistical significance (for real world phenomena). Chi-squared is the worst!

A consistent high effect size (over samples) leads obviously to high statistical significance.

But it is possible to have high effect size and low significance.

So in the end, it is not one or the other. Both should be presented. The effect size tells you how different the sample shows phenomenon is, and the p-value tells you how much you can trust the sample that showed the phenomenon.

Wednesday, November 16, 2016

Annoying Sciency Tropes: Big effing number

Here's a really annoying pattern that comes up in science journalism (OK really any news story that involves a number) and it involves so many fallacies and misdirections and insults to intelligence that I can't over underestimate it. It is the presence of a Big Number.

The yearly output of carbon dioxide gas into the atmosphere is 50 bajillion tons. Wow, that must be bad because a bajillion is a lot. (Also 'tons'. You can have a ton of air? (of course you can that's physics, but it is counterintuitive enough to simply leave the reader with the simple incoherent feeling of 'wow').

The number of deaths due to the Iraq War of 2003 was approximated at 600,000. Of course that is terrible (any such death is terrible). But is it reliable? Is the scale right? How was the number arrived at? What groups are in that number? Is it overcounted? Undercounted? Adding a zero hardly changes the impact of the story but is still wildly inaccurate.

Million, billion, trillion are hard to distinguish. They're mostly 'really a lot', 'really really a lot', 'that sounds like a lot'.

I realize I'm giving these without context, but the point is that often news stories lack all context too.

There's a little bit of technical obscurantism going on (is a nanometer bigger or smaller than a picometer?) which expects education; that is, it is questionable whose fault this is, the one using the technical term or the one reading it. If the reader were educated, this is the best most accurate communication, what technical language nuances are created for. If the reader is not educated in these nuances (which are not nuances to the initiated), then what?

Part of the annoyance is that this is usually combined with a Base Rate Fallacy; usually no comparison data is given - no comparison with the total or comparable items, no context. For example, the debt of the US government is given (latest number) in news stories as $14 trillion. Obviously this is a big unfathomable number, but also there is nothing to compare it with, either the historical debt (what the trend has been over the past few years, what the debt in other countries is like).

What's the solution? For the reader, look outside the article for the base rate or trend. For the writer, supply that! Give something to compare with.

Tuesday, September 27, 2016

"Probabilistic programming languages" aren't

I'm looking for truth in advertising: "Probabilistic Programming Languages" or PPL is a recent term to describe a very useful new paradigm in statistical modeling computing... but it's not a set of new languages. It is much more restricted to very particular kind of statistical modeling as opposed to the broad sense of probability.

As someone who likes a little consistency in language use, for words to have meanings you can mostly rely on, I am bothered by this usage (just as I'm bothered by the similarly mystically enticing marketing term Deep Learning). Here is a very representative description of PPLs:

Probabilistic Programming (PP)
There’s a revolution in Computer Science called Probabilistic programming (PP) where programming languages are now built to compute with uncertainity in addition to computing with logic. This means that existing programming languages can now support random variables, constraints on variables and inference packages. Using a PP language, you can now describe a model of your problem in a compact form with a few lines of code. Then an inference engine is called to automatically generate inference routines (and even source code) to solve that problem. Some notable examples of PP languages include Infer.Net, Stan, BUGS, church, Figarro and PyMC. In this blog post, we will access Stan algorithms through the R interface.

from a blog article on PPL (which also tries to introduce new but uninformative terminology, MPML).

I expect words to mean things, and despite liking metaphorical usage in literature and expository writing, not calling a technical thing what it is sounds too much like slimy obscurantist marketing practice. If it is misleading in any way, it is suspect. Suspect maybe not in venal terms, but more likely suspect in intellectual depth.

For the record, the difficulties in the passage above are:

There's no revolution, not in computer science, not in programming languages, not in AI. Maybe there's some recognition that there is some progress in usage, but it is incremental.
No new programming languages are being built. No programming languages are being modified to accommodate new probabilistic data types. This is the biggest clunker. There's no new programming language thing at all. What is new is packages or libraries or functions, in the existing programming languages. PyMC is a library written in Python, and used in Python as native Python. Stan is written in C++ but it is not a syntax/semantics, just a library that is accessible from existing languages (R, Python, Matlab, Julia, etc).
The idea of operating on distributions as a type is not actually new. Mathematica and Maple have had object oriented implementations of distributions, allowing operating on those distributions functionally. What these PPL packages add is approximation algorithms to compute values for Bayesian inference using Markov-Chain Monte-Carlo (MCMC) , which is fancy talk for calculating a number approximately. Pretty much very analogous to computing a p-value.
All these PPLs are just library add-ons to existing languages. So in that sense don't worry that you have to learn a new syntax. You surely will have to learn how to use the library.
It's not about probability in the large. Most all languages have probabilities already (restrict floats to the range 0:1). Some people are creating packages that make it more easy to use probability distributions (which some languages already had libraries for), and to manipulate those distributions (and make statistical inferences from them. But, no, it's not a revolutionary new alternative to languages with logic using probability. It might be a revolutionary library of functions that will make manipulating and computing with distributions and models easier, nut it's not a new language.

A programming language is a syntax and semantics and a compiler. A library is a set of functions written in a programming language usable by programs written in that programming language. One language can use functions written in another programming language as long as there is an interface.

To call a programming language probabilistic (or extension to an existing language that would render in my eyes the extension to be probabilistic), there would have to be a basic data type, like integer or boolean, that corresponds to a probability distribution (for the purposes of efficiency in compilation).

This is certainly a diatribe about naming. I have no qualms about the use of these PPL libraries. It's just the name. Distributional Modeling Libraries may be more accurate and doesn't have the same punch but is not actually incorrect like PPL. But if you're advertising faster-than-light speed, those words come with a lot of meaning, and should actually provide that without a lot of qualifications.

Thursday, December 31, 2015

What's with dictionary definitions for metaphorical usage?

Like a Jeopardy contestant giving an anecdote about their life changing experience when a pet dog tore up a favorite slipper, I have something I am terribly upset about.

I have noticed some online dictionaries giving metaphorical definitions. By this I mean that for a word, giving a meaning entry that is metaphorical, not its literal meaning.

For example, 'to devour'. Without checking, this means to eat ravenously. But it's easy to see that, say, a paper shredder could be said to devour some documents,

You may well note that many words, rather most words, really almost all words have multiple meanings (except for highly stipulated technical terms, and even then things can get loose). Our perception is usually that a word has one meaning and that's that. But then we notice that, well, that same spelling can be used for more than one distinct concept, usually nearby.

You may then well note that for many words, there really is a primary meaning: its meaning out of context that everyone thinks of first, and then secondary meanings, ones that appear in different contexts, that are slight extensions of the primary meaning, or used in analogous situations, not literally.

Here is the example for 'devour' from google:

de·vour

eat (food or prey) hungrily or quickly.

(of fire, disease, or other forces) consume (someone or something) destructively.

read (something) quickly and eagerly.

The first is the primary definition, the second a metaphorical one, the third... huh? That is definitely not what 'devour' means. Sure, one can easily use it in 'I devoured the sequel' meaning that I read the sequel quickly and eagerly. But that's not the meaning of 'devour'. That's not what 'mean' means. It's too specific. Does the omission mean you can't watch a movie voraciously? How come 'reading' is more devour-like than other metaphorical uses? This isn't right! If you include read, you should include every other possible metaphorical usage. But of course that is too laborious to imagine.

The difficulty I'm having is the demarcation line. When does a reasonable metaphorical usage of a word become dictionary-entry-worthy?

Taking the title word 'incensed', which was not deliberate, its primary and only definition is around 'angry', and no mention of the ostensible literal meaning which might have been 'burned like incense'. It already is a metaphor. The only definition is non-literal. So putting in metaphorical usages is necessary. At what point of semantic drift, at what point of leaving the original does a dying metaphor become dead, and at what point does the altered meaning move from quantitative difference to qualitatively requiring a new entry?

A close analogy is with suffixes. You can take any word in the dictionary and find some suffix that applies that will create a perfectly good word. 'Neologistically' is my favorite. 'Neologism' to 'neologistical' to 'neologistically'. Probably not in any dictionary, but perfectly understandable, sounds like a word, and is (arguably) undeniable as a word. Does it need to be in a dictionary? At what point do lexicographers decide not to include a possible variant?

There are a number of possibilities. Checking multiple dictionaries, most don't have the strange 'read' entry, only Google and Macmillan. What I suspect is that there is a tendency to require definitive alternate usage for an additional entry to be made when the entries are edited by humans. And that Google and/or Macmillan introduce metaphorical entries mechanically and its easier to be lenient. The latter two dictionaries certainly need human oversight; that is, the 'read' entry isn't a mistake but a lower threshold.

This will require looking into the editing policies of the various dictionaries.

Wednesday, September 23, 2015

Where is the universal electronic health record?

It's the 21st century. Where is our universal electronic health record? The one where all the medical knowledge about us individually is viewable by any doctor anywhere. You know, you get a yearly flu vaccine at your local drug store, and show up at the nearby emergency room for a sprained ankle, but when you go to your yearly checkup with your doc near work, they have no idea! Forget about it being possibly available when you're on vacation and get food poisoning and go to a non-local hospital.

In the middle of backest-woods China I can show up at an ATM for cash. On a flight 40,000 feet over the ocean I can get wifi to check on who was in that movie with that actress in that TV show. But in Boston, in the best place to get sick in the world, with every hospital connected with multiple medical schools, and every doctor with an MD and PhD and leader of the field that covers exactly your problem, you still have to, after getting a CT scan, walk down the hall to pick up a CD to physically deliver it yourself to your assigned specialist's office next door, nominally part of the same hospital network, but only financially connected, not electronically (oh, it is electronically connected, just not for that one thing. Oh, and the other things too which you'll have to walk back and get).
.

What's the point (other than that EHRs suck (and not just for the lack of interoperability))? The point is that the technology, the capability, and the knowledge to implement seamless connection for all electronic health data (images, reports, visits, medlists) was possible in the 70's ... with 60's technology. There is no rocket science here (a little electronics and programming sure). It is about as complex as ATMs. The internet should make things that much easier. But for whatever reason (oh there are reasons) it isn't there.

(that's not Jimmy Carter, it's a made up person for HIPAA compliance)

http://www.theplaidzebra.com/first-manned-mission-to-mars/

It is the year 2015, and there are plans to send people to Mars, so there is no technological reason why an interplanetary health record (IHR) doesn't already exist for use when they show up there. The record of the infection you got training in the desolate arctic landscape of Ellesmere Island. The dosimeter readings while stationed temporarily on the L2 jump-off station. Your monthly wellness-checkup with your PCP (well, remotely).

Right now all you get is your intraoffice electronic health record (that is, within an office, not between). It would work great if your PCP, endocrinologist, and cardiologist all belong to the same practice. Of course they don't. Sometimes you're lucky and a big hospital will be the only center for an area and all docs belong somehow to that one hospital. I'm not saying things are bad everywhere.

Wait. Expletive. I can't go to any local drugstore (again!) to get an over the counter bottle of Sudafed, some batteries for a game controller, and a jug of bleach for my socks without stormtroopers crashing through the windows, hog-tying me, and interrogating me on suspicion for running a meth lab (I mean every time), because I went to another drugstore across town for that very suspicious flu shot. At least somebody can connect systems. I was almost happy that they cared! About me!

Enough idle complaining. My idle blaming is that it is the health care businesses's fault. The docs are doing their job as well as they can. The businesses don't get anything out of making things easier on the patients or docs. I have all sorts of constructive suggestions just no one likes advice.

Friday, September 4, 2015

The Turing Test - like magic!

Clarke's Third Law: Any sufficiently advanced technology is indistinguishable from magic

Finally, an invocation of the Turing Test which doesn't lie down in fawning adulation, which doesn't assume the Turing Test is the judge of intelligence, artificial or otherwise.

First, the Turing Test is a well accepted method for judging creation of a successful Artificial Intelligence (those capitals are ironic, because artificial intelligence is mostly not HAL 9000). To generalize, the test is really that if a human believes the human source of the test data, then that is successful Artificial Intelligence. The canonical test is a teletype (so that the mechanics of communication is not in question). A person communicates back and forth over the teletype. If that person can't tell if the conversation was with a machine producing the words (presumably by the machine mimicking a human's ... uh.... humanity, rather than being hyperlogical, then success.

It is great fast thinking on Turing's part, going quickly to a workable solution, cutting out lots of junk rationalizations, don't concern oneself with the infinite hypotheses of the underlying processes, just go for the jugular of what you have, the surface behavior and believability.

But frankly it is no different from bald anthropomorphism; if the animal acts like a human it must be human-like more deeply, with the lesson that doing so is usually not very successful. (But contrarily, a subject for another time is that I think many vertebrates share many cognitive abilities of humans, and also contrarily, some behavior that is usually considered special human intelligence may have very low complexity biological mechanisms that underlie them).

Not only is the Test the basis of countless scifi plots, but also countless dumbed-down explanations of artificial intelligence machines.

If it acts like a human then it -is- a human.

Basing success on limited explicit experience rather than looking behind the curtain and seeing the design? That is just plain idiotic. It is denial of common sense. The true test of whether something is artificial or human is by looking behind the curtain, by looking inside the blackbox, by seeing how it is designed. The design is the thing that should be judged, not the paltry examples.

Finite behavior doesn't define essence. The essence defines essence. Sure there's a lot more, the rules create the instances and anecdotes which are telling, but it's the all possibilities that are relveant, not just the small handful of instances.

A counter argument might be that telling essence is not the point, and that knowing essence is not available. So that experience is finite and is all that can be known sometimes (you can't always look inside the blackbox).

Here are two analogies that express my point. Generating genre texts with ngram probabilities (using markov models or deep learning), and generating biological objects using fractals. Here's an example of generated text (from Paul Masurell):

Pride and Prejudice, Jane Austeen.

I do not. It is very much at Pemberley. The idea of their all walking out. I must acknowledge to you. When I do not marry Mr Collins had promised herself. But you have the carriage might be copied. It would look odd to be proud. You are perfectly good. Elizabeth was at the sight of Miss Darcy was delighted. You have no objection to my charge. I know not. Lydia was urgent with the keenest of all. Mr Collins, you puzzle me exceedingly. But, my proposals will not go. To the rest. But, to much conversation, no traces of them.

The results look vaguely like the real thing, and could totally pass for reality (as long as they're not inspected too closely). Also, some humans using all their own skill can only each this level of coherence. So this is a terrible example? Turn up some dials and it gets less and less 'wandering' and more coherent.

Here's another example: Fractal trees. take a visual object like a line. Tack on smaller versions of that line to itself. Repeat to each smaller tree ad infinitum. You get a fractal tree like:

(from Gurpreet's blog. he has code!)
Depending on the rule, the 'tree' can look fluffier or sparser, and more regular or irregular. And it looks so much like a real tree:

(from Sarah Campbell)
And one could go the other direction and say that nature is implementing a recursive algorithm to grow its trees. But this is obviously crap. It certainly looks like a fractal, and I'm sure there are biological processes that can be modeled by some limited nesting (see the Chomsky/Everett disagreement over Piraha). But we know the fractal trees are not made by biology but by an algorithm, and that similarly a broccoli shaped tree whose trunk and branches and branches of those has to stop at some depth to give leaves.

It's like magic tricks: they work on the toy problem (having a card you're thinking of pulled out of a just-cut-up lemon) but don't generalize at all to anything beyond.

So you can make an elephant disappear on stage? Make it really disappear. It all looks right that one time, but is not repeatable because the reality isn't there.

Here's another example, IBM's Deep Blue chess playing program. So what if it wins against a human? (or plays at all). It's not magic. It's simply following game paths. Many game paths.

The Turing Test works in very limited contexts but is superficial.

Any sufficiently advanced technology is indistinguishable from a rigged demo. James Klass

Friday, August 28, 2015

Just Stop It. Website complaints

A little constructive feedback to web site designers.

Stop it. Please just stop it.

Website designers, stop adding crazy stuff and stop changing my defaults 'for' me:

Stop changing things that I can set up locally. Allow me to set the font size rather than fixing it to what you think is best. Don't change the scroll speed on me, I set it already the way that's easiest for me. I don't want to swipe to move down a paragraph but then you make it skip a page or two.
Stop it with all the moving images. just a few are OK. Well no not really. One is already almost too much.
Stop it with the audio. I'm listening to something else. Also, with multiple tabs that I move around, your audio is randomly starting when Im not on your tab. Then there's a frantic search for your goddam tab to kill with a vengeance and remember never to visit anything of yours again.

Tech rationalization: All these things also take up lots of memory and processing time on the local computer running them. Also, they waste my time.

So stop it.

PS IMDB, you're the worst. I love wasting my time on your site. But I don't want to waste away my time-wasting time on waiting for your candy crap ads to load. I want to see them immediately or move on to finding out what the movie was with the thing that that actor (who was in with the actress from that TV show (no not that one, the comedy, no the other more serious comedy) who had that thing happen to him. It was a couple years ago. I think it was a remake?

Wednesday, August 19, 2015

Even docs replaced by robots? Only for boring operations

Will technology replace us with robots? (us = 'billion year DNA-developed flesh-covered endoskeletal devices')

A new automated anesthesiology device has recently made the news: Automated anesthesiology for colonoscopies. There's the obvious fear of high-priced docs losing their jobs "How dare they assume a machine could replace a physician with years of education and knowledge?'.

But for the moment, what's the situation? Colonoscopies for polyp screening and removal are very routine procedures. For the colonoscopy part, only 5% of patients have a polyp removed. So most of the time the GI doc is doing boring work, looking for polyps that mostly never there.

And similarly for the anesthesiologist except moreso. Even if the GI doc find polyps that are removable, that doesn't change the sedation. If something is found that needs more than just the colo tool, then hey, we ain't doing that here, we're backing out anyway, no need for more anesthesia. All they are doing is conscious sedation over and over and over again.

Every patient needs oversight. Things go wrong. "I didn't know the patient would have a seizure, allergic reaction, is used to the sedation drugs" These things need tweaking. For the most part, the every day stuff and these few weird things are extremely well-known (there's been a high tech assembly line of patients getting colonoscopies forever!). So this is the perfect place for automation to both reduce cost and time and effort. And the machines are going to have extra sensitive alarms, a good buffer to stay away from the bad situations.

There'll still be a need for lots and lots of physicians, don't worry about it, freshly graduated MD. Hopefully family practice, where the real medicine happens, will become more respectable = more highly paid, because it is already high in demand but nobody is going into it because it won't pay for med school tuition loans.

---

The whole point to science is to make things repeatable.

The trend then is that if you do something enough times and for what variation there is, it can be parametrized, then it can be automated and packaged.

We do it for medications: an expert gives very simple instructions on use, and then you do it yourself. Simple first-aid for even life threatening situations doesn't need to be handled by a full physician. Anyone who can read directions and gets a couple hours training can do CPR and use a defibrillator.

Medicine is progressing towards knowledge constantly. Radiology is miniturizing image taking to the point where soon you really could have a Star Trek tricorder to wave over someone to see and judge any internal problems.

Look, there's already the DaVinci robotic surgeon. Of course it doesn't do every thing and needs to be operated by a full surgeon.

(from Medical Devices)

But, soon enough you'll be able to go to your local drugstore and go down the pain-relief aisle, turn on the cough and cold section, then come to the Surgeon-in-a-box aisle:

Wart-Removal-In-A-Box - wait, don't they have these already, some freezing solution?
Stitches-In-A-Box - for non-serious cuts that are too deep to heal themselves, place the box opening over the wound and the sensors will be able to see where to close up. Applies flesh knitting goop reducing scarring (Dermabond, based on superglue, it's real).
Colonoscopy-In-A-Box - you'll still need to take the prep, robots can't see through poop either. Send to the lab any polyps removed in the enclosed vial.
Lasik-In-A-Box - just place against the affected eye for ten seconds and hold your breath.

OK for most of these you'll need a prescription for them. But still you'll be administering them at home yourself.

Yes, I agree, the last three I'm not sure I'll ever be comfortable with. But none of them exist so I'm off the hook for now.

Saturday, August 15, 2015

There -are- realistic moon base plans

I was wrong. Lack of evidence is not evidence of lack.

I lamented the lack of moon base plans recently, but it was an error of not looking around enough.

Recently the European Space Agency got a new director, Johann-Dietrich Woerner, starting July 1.

But even before he started, he had stated his plans for what to do next on the way to other space plans

"the moon station can be an important stepping stone for any further exploration in deep space,"

He states this in the context of ESA's targets after the ISS project finishes.

"In any case, the space community should rapidly discuss post-ISS proposals inside and with the general public, to be prepared,"

I can't tell yet how these plans relate to NASA's stated plans for manned mission to Mars.

Friday, August 7, 2015

What is wrong, terribly wrong, with wordles

I love wordles! They're so cool, like making artwork out of a big long text that I don't want to bother reading! I can see what's really important in a text by what's most common! And I get that in a flash!

I can't stand wordles. They're so mindless and dumbing down. Any good text will have a variety of vocabulary. frequency is misleading, texts are not just dumb bags of words.

These are extremely tendentious. I believe them both. But what I'll explain is what is problematic with them as data visualization.

What's a wordle? Also known as a tag cloud or word cloud, it's a graphic design method that takes a document, determines the frequencies of the unique words in that document, and mooshes the text of the words into an image, some vertical, the size of the word text in proportion to its frequency in the document. So from some document we get the dry list of individual word frequencies:

Wordle 127
word 35
words 30
cloud 28
students 22
clouds 22
Day 18
lessons 12
fused 12
adjectives 6
historical 5
classroom 5
even 4
see 4
...

This can be converted into a barchart:

which is the Zipf curve of the document.

Now comes the cool graphic the wordle. instead of boring bars, make the word itself and its size tell you how important it is. Mushing them all together and letting the natural instinct of readability draw your eye to what's important:

It is certainly esthetically pleasing, a bit Mondrian, with a jazzy visual rhythm. The algorithm to lay out the words is clever in simplicity, and the resulting image allows some simple inference about a text.

But what is the point of a wordle and how successful is it for what ever points it might have?
If the point is that it is a piece of art, then I've made a case for it already. A new wordle for each new document is a bit derivative though, with too many barely distinguishable varieties. One here or there is great, but a number of them is numbing.

How is it as a data visualization? How well does it relate the data?

The ostensible purpose of a wordle is to show you the relative frequency of words in a document. What is actually done is to show you the obvious top two or three most frequent words. All other words are essentially ignored.

That may very well be the best part of the wordle, that it presents essential information (the two or three most frequent) in an esthetically pleasing manner. The size of a word pulls your eye towards it because it is easier to read, and if it is readable, there's no unreading it (it forces its meaning on you).

- the eye is encouraged to dance around. this may account for the esthetics, but it is an annoyance for comparison.
- Vertical presentation of a word almost guarantees that you can't read it.
- comparison of size is even more difficult than a pie chart. two words not even exactly next to each other are difficult to compare (the word length itself is not the frequency but it accounts for the relative noticeability.

So really the information that can be pulled out of a wordle is: the most frequent word (which does usually outweigh all others in most documents), the second and third most frequent, but you're not sure which is which, and maybe one or two in the top ten but maybe you missed some.

Under this analysis, this is a Type V error in Fung's Visualization Trifecta Checkup, where the data and questions are well defined, but the visualization (the V) just isn't right.

So instead of complaining, what would be a better method, one that would actually address the stated purpose of showing relative frequencies?

The simplest (and least graphically pleasing) is the source list of stats: a text list, one word per line followed by its count in the document. Because numbers themselves are hard to judge easily in a list (but lengths are), maybe using a barchart sorted by frequency, and then maybe cut off at about 10 or so. The screen space taken up by the frequency list is about the same as the wordle image itself and allows extraction of a lot more information. All the information is in this list, and it is all readable, and all comparisons can be made very easily. Surely there are frequency questions that can be asked that are not easily answered by the list, but what might be slightly difficult for the list is impossible for the wordle.

What this says is that wordles are really good at showing you the top couple of words in an esthetically pleasing manner; what it puts in your head is mostly 'X is the most common, and Y is maybe a little less common' and thats the extent of its specificity.

But if you want to know even minimally less vague comparisons, and more than 2 words, a wordle does not do it that well.

Or to put it more bluntly, a wordle is popular because it is beautiful, not true.

TL;DR: A wordle is estheticaly pleasing but is not even as good as a piechart for transmitting information.

Monday, July 20, 2015

Deep Learning is not Magic Learning

Any sufficiently advanced technology is indistinguishable from magic. Arthur C Clarke

"Deep Learning is Teaching Computers New Tricks"

"Andrew Ng: Why 'Deep Learning' Is a Mandate for Humans"

"The real innovation challenge to us then it seems will not be to apply deep learning to replace humans but to use it to create new ideas, products and industries that will generate new jobs and opportunities for skilled workers."

Deep Learning Deep Learning Deep Learning

Holy crap! Use Deep Learning to create new ideas? You may be thinking that I'm being too harsh; of course article and title writers stretch things out to be more provocative, details left to the gross middle of the article that no one reads. Well, then, yes, I'm being too harsh, not because the details are left out, but because their implications of the details are ignored.

Deep learning is not Magic Learning. Deep Learning isn't what its name says. It is 'just' a more complex (= many more layers than traditional) neural network (which is itself not exactly what its name say, it is 'just' a set (OK I'll grant network) of linear regression models, where some depend on others. It's not magic. It's not human like learning or deep cogitation on concepts. It is just a mathematical model. It can distinguish two almost identical things. It can identify one thing out of many. But that's the all that the technique itself does (just the best in a long line of similar techniques). It (like many other techniques: logistic regression, decision trees, random forests (ooh..they're magical! Their names are so exotic!) needs to be put in a larger framework (like in a process that determines the outlines of faces in a set of cat pictures, or splitting words n a speech to text analyzer). By themselves, there's nothing magical.

This is not to say that there's something wrong with Deep Learning. On the contrary, it is a great recent development, with lots of successes (which is exactly what happened to its simpler self in the late 80's). But in the end it is 'just' a regression model, either saying yes or no to some inputs, or calculating a complex function. But that's it. It is not 'an' artificial intelligence, responding and implementing our requests like a valet. It's just (one of the more) recent advances in discrimination methods. It is an important part of the field of artificial intelligence, but not the entire thing.

Is extracting 100's of initial petroleum products (fuel, plastics, lubricants, medications, etc) magic? not to mention 1000's of downstream products created from manipulation of these?

Frankly Siri is closer to magic because at least 40 years of electrical engineers and phoneticians have worked on converting sound waves produced by a humans oral and nasal cavity, modulated by teeth and tongue, into readable letters.

Deep Learning is not magic. They are a great development in neural networks (an incremental development (a very big incremental development)), but they're not magic and they won't make you your toast for you in the morning.

(this morphed from the inarticulate unfinished beginning of a rant I had planned about ML (Machine Learning). And NLP (Natural Language Processing (not Neuro-Linguistic Programming which actually is horseshit))).

Tuesday, June 9, 2015

Why are there no moon base plans?

The current space obsession is a manned mission to Mars. In the past couple of years there's been all sorts of stories and books on how to to do it, what the purpose of such a mission is, the difficulties, the variations.

Every president since Bush Sr (wait did Obama mention it?) has promised to put a man on mars (wait, did -Clinton do it?).

It seems like these big media plans are almost as common as plans to create a high speed rail line between NYC and Washington (or San Francisco and LA, or Chicago and St. Louis). Every new governor

I'm all gung ho for every sci-fi inspired space plan: mining asteroids for precious resources, terraforming Ganymede for farming, solar sails to travel among the planets.

But.. this should be sci-fi inspired engineering, not science fantasy. Wouldn't it be more cost effective and profitable and room for learning more about engineering around off-earth environments if we went incrementally? There is a space station, a bit smallish, with worldwide support. Shouldn't there be some intermediary step, like a moon base?

First, an efficient transport mechanism to a low orbit space station, via rockets or space elevator or what have you.

Then maybe an intermediate high orbit one.

Then a minimal lunar base.

Then lunar L1 and L2 satellite stations.

Then an expanded lunar base.

...and a whole bunch of intermediary supply chain steps, not just to support a permanent connection (realistically, we don't know if we'l be able to support that in the long term), but to support exploitation of those intermediate steps as ends to themselves.

Then, once all that's done, a visit to Mars (because all those previous items will make the trip that much easier. Don't blow a shitload of money on a one-off to Mars. Make it realistically attainable.

Also, in parallel (and maybe with more money than carved out for a manned mission), that much more robotic exploration. Let the machines die first. It's less expensive and less upsetting and demoralizing.

Oh. I'm sorry. There -are- plans for a moon base.. But I have no idea if this is part of a grand plan.

Also, what's the business plan other than 'Holy shit this will be cool'? (I'm all for that business plan, but my funding is in science fiction dollars)

Sunday, August 28, 2011

Math error in news: divorce rates

I heard an egregious math error the other day on NPR (from the morning of Friday, August 26. The story was about divorce statistics in the United States, regional differences, and changes over time.

The statement in questions was worded something like this:

The South has one of the highest rates of divorce in the country. One reason is that it has more marriages than elsewhere.

Sounds plausible right? Only if you redifine the concepts of what you are hearing. This is an egregious type mismatch of a rate to a number. a rate is the ratio of the subset to the whole (whatever the whole is), and a number is..well... it's just the count with no division going on. The rate is presumably the number of divorces per capita (entire population of the region).

The statement, as is, is inferring a number (more marriages) from a rate (higher divorce rate).

So maybe you have a large number of divorces and that can be because here is a large number of marriages (which may or may not be because of a large number of people). That is a reasonable inference to make.

Or you might have a large marriage rate leading to a large marriage number in the region and (assuming people tend to get married within a region) this could lead to a large number of divorces in the region, and so immediately a large divorce rate.

But note this is all relative. A region could have a large divorce rate but small number of marriages or divorces. (or contrapositively, a lower -number- of marriages and high divorce -rate-). Much too unspoken is the relevant contexts for ratios and number comparison.

I don't think this is shoddy math exactly just shoddy use of language (which arguably -is- shoddy mathematics).

First, disclaimers: this is a paraphrase from memory, and I cannot find a transcript to corroborate my hearing.

Wednesday, October 28, 2009

Something about the real world: Bagels at Finagle-a-Bagel suck

Please pardon my commenting on reality (or my view of it) but... Bagels at Finagle-a-Bagel suck Or to be more politic about it, in my view, the bagels at said convenience are shaped like bagels, but the composition is much sweeter and well the texture is different enough from, let's say, Bruegger's Bagels bagels (to compare with another commercial establishment rather than, say, the heaven of a NJ bagel, it is so different that I don't think the things served at F-a-B should be called 'bagels'.

And I mean it to sting.

To make this much more than what is a simple complaint (that Finagle-a-Bagel bagels suck and get to that out on the web), let me continue. I now see the arbitrary authoritarian desire for an appellation committee that decides what is what. To mix many philosophies, word meanings are totally a social construction (to be useful, people have to 'agree' and act like they agree) but with a necessary private language (internal theory). Humpty Dumpty can't go around saying 'those things you get at Finagle-a-Bagel with the hole in them that taste sorta muffin-like'...well, actually, yes he can, but it just won't catch on, not because of semantics but because people aren't time-wasting idiots. If everybody calls them bagels, then that's what you'll call them, even if that label doesn't evoke the properties (in your head) that you normally associate with things that you call by that label.

Like how 'white chocolate' might be liked by many people, but... it ain't chocolate.

In a completely different way, I don't get bagels at Dunkin' Donuts. I don't expect them to have good ones. I don't go to FaB for muffins ...

Which is all to say... Finagle-a-Bagel bagels suck.

Now if only I could direct all this energy to the positive....

Friday, June 13, 2008

The invisible character bug

Computers suck. In this case... I have a file that ostensibly has really long lines, or rather the original data has really long lines, but what I have the lines are split, with a '-' to show that the line was split, plus a new line.

e.g. a file like this:

dfasdfasdfaasdfsregaregeagrerg242342423ytuyutuy
qqweqweqweqsdadsasdasdasdzxczcxzcx

I get it like this:

dfasdfasdfaasdf-
sregaregeagrerg-
242342423-
ytuyutuy
qqweqweqweq-
sdadsasdasdasd-
zxczcxzcx-

Fine. So I can't just remove newlines, so a simple sed oneliner won't work. But a little looking on the web gets me a summary of quick sed oneliners which has exactly what I'm looking for but would never in a million years have figured out on my own:

# if a line ends with a backslash, append the next line to it
 sed -e :a -e '/-$/N; s/-\n//; ta'

It looks for the dash followed by the end of line (in sed fashion, the new line character is not part of a line), and if found appends that line -and- an actual new line character to the search space, which is then searched for by the next 's/...' and removed (and then a little 'goto'ing' which I never new existed in sed before).

Great. Except it doesn't work. Why not? because..well, before the explanation, I have to complain about the hours and hours (well, 3) that I spent doing the 'debugging by permutation', trying all the possibilities of small changes, maybe it's for a different shell or slightly different sed version, or whatever. OK, that's enough...on with the solution...

Like all the Sherlock Holmes stories, there's always a tiny bit of information that the author doesn't tell you until the very end, which of course if anybody knew already would have solved the problem...the file I received was in -MSDOS- format, meaning simply that new lines are denoted by -2- characters, carriage return -and- line feed (or \r \n, or \x0d \x0a).

So the sed was correctly finding '-' at the end of a line, and appending the next line, but it couldn't find '-\n' and remove it because it really needed to look for '-\r\n'.

That is, an invisible character. You can't see it but you have to know about it to correctly solve the problem. In my very dim memory of the far past, it seems like this used to be a 'joke' bug, a possibility to blame something unknowable on (because you can't -see- it), when the bug is probably really a thinko.

Anyway, hours wasted on trivialities.

That is all.

Least Uninteresting Number