Least Uninteresting Number: visualization

Showing posts with label visualization. Show all posts

Friday, December 16, 2016

Practicality vs Esthetics in DataViz

We want both. We want for something to be practical and nice looking as possible. We want the shiny doorknob to look elegant, but we also want it to work smoothly without having to jiggle it, and not to need cleaning all the time. Sometimes one has to be given up for the other, an industrial assembly line may be a little grimy all the time but it gets the job done; of course, cleanliness may be a desired property of the object and practicality and esthetics share a common cause.

Sometimes we want something that is esthetically pleasing and superficially practical but not necessarily perfectly practical, like a watch with no face numbers. The esthetics is the desired intention.

That's all philosophy diatribe to justify something that bugs the crap out of me.

Basically wordles are the worst. And periodic tables (except for The Periodic Table which is the best). And usually subway diagrams (except for actual city subways). Here's one for semantic web technologies:

This! This is the worstest.

If you know what each of the entities are, and you make all sorts of qualifications, maybe this makes sense a little. It makes only the slightest bit more of sense if 'A above B' means A is 'built on B'. But then there are all sorts of 'If you did that, then why did you do that?' questions (why is encryption and signature off to the side only for some, why is logic and proof separate, is Unicode really such a huge important base technology, etc etc). Wait, isn't a namespace a particular kind of URI? There are many variations on the 'Semantic Web Stack', but each in its own way has all these "I don't get why they did that?" problems. This is all about esthetics (Nice color combo!) and little to do with imparting coherent information. No, you will not learn anything from this. Wait...what the hell is 'signature'?

Tuesday, August 2, 2016

SQL JOIN Venn diagrams are only sort of Venn diagrams

SQL is a standard for querying databases. Despite questionable pronouncements that SQL is Turing complete, I hesitate to call it a language because its power is in using boolean logic in dealing with tables of data whose columns point to each other.

And often Venn diagrams, the go-to visualization for set operations, are used to help explain the process of table JOINs.

The interesting things is that set operations and table joins are not really the same thing. They're related but just not the same. Set operations, which are pretty much the same as boolean/logical operations, are simple to visualize. The picture is the universe of elements, a circle surrounds a group (a set) of elements with a property, and a set operation does something to one or more sets to make a new set.

(from Modern Dilettante)

SQL also has set operations that combine tables as though they were sets: UNION, INTERSECTION, DIFFERENCE. They simply do the same as the set operations; two tables with identical column labels have their rows combined into a single new table (UNION means all rows in both, INTERSECTION where the column/row entries match in value, etc).

But this is not how Venn diagrams are usually presented to explain SQL. UNION, INTERSECTION, etc, are not the most useful of operations (the WHERE clause of a SELECT is where the booleans are most commonly used). Venn diagrams are most often used to explain JOINs. A SQL JOIN first matches on a field from one table and a field from another (presumably a field of the same type or kind).

(source Codeproject)

These Venn diagrams explain the difference between inner, outer, left and right joins perfectly...except they are just from a different world than the traditional set operations. A JOIN is intended to merge the information appropriately in the n by m relation (where the size of A is n and size of B is m). The universe isn't the set of rows of both A and B together. The universe is the product of rows in both. And the difference between inner, outer, etc, is purely with how the JOIN deals with NULL/missing elements in A or B.

An INNER JOIN keeps rows of AxB only where both A and B rows exist. A LEFT JOIN is only when the A part exists (B may or may not), similarly for RIGHT JOIN. An OUTER JOIN doesn't care if either a corresponding A or B exists. So the boolean idea does apply but in a strange way, only with respect to the NULL condition of the matching field. If the value of the field from A has no matching value in the field for B, then B is NULL or missing then (and vice versa).

So the Venn diagrams for SQL operations, I can't really say they are true Venn diagrams; they don't show the state of a consistent property over all elements of the universe. Or rather the universe is a bit more complicated (depends on A and B, their cross product) and the property being booleanized is whether element of one table is NULL. You can't just take an arbitrary universe of elements (with properties. With JOINs, you have to create the universe, the product, first before examining the elements (and whether the A part or B part of the new row is null or not.

Tuesday, November 17, 2015

Abused dataviz: periodic table and subway map

In addition to wordles, here are two more often abused data visualizations, the periodic table and the subway map.

Both diagram methods are intended to show that among a set of entities, there are many subsets, for the most part mutually exclusive but they have informative intersections. Think of the Venn diagram as the canonical diagram of subset relations. A subway map should have very few intersections (only a handful of entities are in more than one subset, the interchanges or transfer stations). The periodic table has a lot more structure, in fact, as a special case two-dimensional table, the full set can be split into mutually exclusive subsets in two distinct ways.

Take for example the original periodic table.

(from Science Notes)
What a great invention. Mendeleev compiled a bunch of disparate facts, similarities of elements, into a single visualization. The dataviz wasn't perfect, because there were gaps. But the picture was almost a theory, an extrapolation from data, that by 'testing' (further exploration) was confirmed by elements that fit nicely in those gaps. There have been attempts at organizing that chemical information in different ways but Mendeleev's holds primacy.

Nowadays, a periodic table is used for organizing a large set of items that have some similarities. Except the similarities have only tenuous systematic patterns. The point to the chemical table is that they fit nicely into rows and columns according to number of shells and number of electrons in outer shells (which predicts chemical properties nicely). The modern use of these periodic tables seem not to care what patterns in reality there are, just that pretty colors and list. Often the items in a column are not really related, and often they don't go from simple to weighty.

(from Expand via pinterest http://www.xpand.com.au/ )
In this example, the table is simply chart junk. The colors specify the mutually exclusive subsets but the rows and columns say absolutely nothing about the entities.

The periodic table of dataviz has some attempt at using the structure appropriately in the far left and far right columns, but in between its a mess. The site is great for examples, I'm only criticizing the use of the periodic table as the viz method. Note that almost all periodic table viz's use lockstep the funny unbalanced form of the table rather than fit it to the data (the properties of the entities). Instead the entities are shoehorned usually without any reason at all.

The point to a periodic table is that everything in a row should somehow be similar, and everything in a column should also somehow be similar. Also there should be some kind of progression from simple to complex down a column.

When is it appropriate to have a periodic table? When your set of items has two clear dimensions. There can be lots of gaps, or more in one position than another. But the two dimensions need to be clear. Also use those labels! Make sure everything in a column needs to be related. The rows don't necessarily have to be exactly related but at least of roughly the same complexity.
---

Subway maps are diagrams of connectivity of train systems. As a dataviz, they show that certain sets have a handful of points of intersection. Within a subset (shown by a line or track in the system) all the items are related. So when two lines intersect, that item must be a member of both subsets. Unfortunately, many 'subway' maps don't even bother with convention. They'll group items on a line that are only tenuously related, and then a 'transfer point' (an entity on two or more lines) ends up having little to do with either.

What makes a 'subway' map good is when the subsets have very few common entities. That will translate to only a few interchanges, making the diagram easier to create and less busy. It's a plus if you can order the entities along a line in a meaningful fashion (there is some inherent ordering).

Most uses of the 'subway map' dataviz, just like with the periodic table viz, either take a literal subway map (London's usually for obvious for obvious dataviz design homage) and shoehorn entities in, or make up their own but don't bother to make the lines and interchanges act like sets and intersections.

(from Becoming a data scientist) Using some domain knowledge, the items on each colored line aren't very coherent subsets, and their interchanges aren't really common between the two intersecting lines. There is quite a bit of overlap among these entities, lots of subset relations and intersections, but they are unfortunately not even bothered with.

A subway map viz is appropriate for a set of entities if those entities separate nicely into mutually exclusive subsets, with a handful of single entity intersections. If there are many intersections, then there are many constraints on how the lines meet each other.

Consider each entity as as having a list of features. If all the entities have a single feature that partitions the set (these are the subway lines) with very few entities with more than one line (the transfers) then the subway map is appropriate. If all the entities have two features, each partitioning the set in two distinct ways, then a periodic table is appropriate.

These dataviz strategies may well be meaningless chart junk simply to display a list with some structure. They are certainly esthetically pleasing (just like wordles!), but for the most part used irrelevantly. Most lists of entities are easily separated into sublists, with little extra structure, r quite a lot of complicated structure. The subway viz is good is there is a very little bit of common properties. The periodic table is good if there are two mostly coherent discrete dimensions, they don't have to be numbers.

The primary complaint is that the template is ostensibly knowledge based (scientific looking, 'sciency') but that the data poured into them just doesn't have that structure; the structure is a red herring. The dataviz should add something, should give you knowledge about the entities. If the items are on the same subway line, they should have some commonality. An entity in a periodic table should be similar somehow to the other entities in the same column and also the same row.

The alternative, when there is not enough structure, is a simple set of lists. If there is too much structure (lots of common features with little discernible pattern) is to use a venn diagram which captures all the possible intersections.

Or maybe I'm just complaining about incoherent sets and it's not even at the level of the top level structure being a red herring. It's a red herring that it's a red herring.

Friday, August 7, 2015

What is wrong, terribly wrong, with wordles

I love wordles! They're so cool, like making artwork out of a big long text that I don't want to bother reading! I can see what's really important in a text by what's most common! And I get that in a flash!

I can't stand wordles. They're so mindless and dumbing down. Any good text will have a variety of vocabulary. frequency is misleading, texts are not just dumb bags of words.

These are extremely tendentious. I believe them both. But what I'll explain is what is problematic with them as data visualization.

What's a wordle? Also known as a tag cloud or word cloud, it's a graphic design method that takes a document, determines the frequencies of the unique words in that document, and mooshes the text of the words into an image, some vertical, the size of the word text in proportion to its frequency in the document. So from some document we get the dry list of individual word frequencies:

Wordle 127
word 35
words 30
cloud 28
students 22
clouds 22
Day 18
lessons 12
fused 12
adjectives 6
historical 5
classroom 5
even 4
see 4
...

This can be converted into a barchart:

which is the Zipf curve of the document.

Now comes the cool graphic the wordle. instead of boring bars, make the word itself and its size tell you how important it is. Mushing them all together and letting the natural instinct of readability draw your eye to what's important:

It is certainly esthetically pleasing, a bit Mondrian, with a jazzy visual rhythm. The algorithm to lay out the words is clever in simplicity, and the resulting image allows some simple inference about a text.

But what is the point of a wordle and how successful is it for what ever points it might have?
If the point is that it is a piece of art, then I've made a case for it already. A new wordle for each new document is a bit derivative though, with too many barely distinguishable varieties. One here or there is great, but a number of them is numbing.

How is it as a data visualization? How well does it relate the data?

The ostensible purpose of a wordle is to show you the relative frequency of words in a document. What is actually done is to show you the obvious top two or three most frequent words. All other words are essentially ignored.

That may very well be the best part of the wordle, that it presents essential information (the two or three most frequent) in an esthetically pleasing manner. The size of a word pulls your eye towards it because it is easier to read, and if it is readable, there's no unreading it (it forces its meaning on you).

- the eye is encouraged to dance around. this may account for the esthetics, but it is an annoyance for comparison.
- Vertical presentation of a word almost guarantees that you can't read it.
- comparison of size is even more difficult than a pie chart. two words not even exactly next to each other are difficult to compare (the word length itself is not the frequency but it accounts for the relative noticeability.

So really the information that can be pulled out of a wordle is: the most frequent word (which does usually outweigh all others in most documents), the second and third most frequent, but you're not sure which is which, and maybe one or two in the top ten but maybe you missed some.

Under this analysis, this is a Type V error in Fung's Visualization Trifecta Checkup, where the data and questions are well defined, but the visualization (the V) just isn't right.

So instead of complaining, what would be a better method, one that would actually address the stated purpose of showing relative frequencies?

The simplest (and least graphically pleasing) is the source list of stats: a text list, one word per line followed by its count in the document. Because numbers themselves are hard to judge easily in a list (but lengths are), maybe using a barchart sorted by frequency, and then maybe cut off at about 10 or so. The screen space taken up by the frequency list is about the same as the wordle image itself and allows extraction of a lot more information. All the information is in this list, and it is all readable, and all comparisons can be made very easily. Surely there are frequency questions that can be asked that are not easily answered by the list, but what might be slightly difficult for the list is impossible for the wordle.

What this says is that wordles are really good at showing you the top couple of words in an esthetically pleasing manner; what it puts in your head is mostly 'X is the most common, and Y is maybe a little less common' and thats the extent of its specificity.

But if you want to know even minimally less vague comparisons, and more than 2 words, a wordle does not do it that well.

Or to put it more bluntly, a wordle is popular because it is beautiful, not true.

TL;DR: A wordle is estheticaly pleasing but is not even as good as a piechart for transmitting information.

Least Uninteresting Number