Tuesday, August 2, 2016

SQL JOIN Venn diagrams are only sort of Venn diagrams

SQL is a standard for querying databases. Despite questionable pronouncements that SQL is Turing complete, I hesitate to call it a language because its power is in using boolean logic in dealing with tables of data whose columns point to each other.

And often Venn diagrams, the go-to visualization for set operations, are used to help explain the process of table JOINs.

The interesting things is that set operations and table joins are not really the same thing. They're related but just not the same. Set operations, which are pretty much the same as boolean/logical operations, are simple to visualize. The picture is the universe of elements, a circle surrounds a group (a set) of elements with a property, and a set operation does something to one or more sets to make a new set.

(from Modern Dilettante)

SQL also has set operations that combine tables as though they were sets: UNION, INTERSECTION, DIFFERENCE. They simply do the same as the set operations; two tables with identical column labels have their rows combined into a single new table (UNION means all rows in both, INTERSECTION where the column/row entries match in value, etc).

But this is not how Venn diagrams are usually presented to explain SQL. UNION, INTERSECTION, etc, are not the most useful of operations (the WHERE clause of a SELECT is where the booleans are most commonly used). Venn diagrams are most often used to explain JOINs. A SQL JOIN first matches on a field from one table and a field from another (presumably a field of the same type or kind).


(source Codeproject)

These Venn diagrams explain the difference between inner, outer, left and right joins perfectly...except they are just from a different world than the traditional set operations.  A JOIN is intended to merge the information appropriately in the n by m relation (where the size of A is n and size of B is m). The universe isn't the set of rows of both A and B together. The universe is the product of rows in both. And the difference between inner, outer, etc, is purely with how the JOIN deals with NULL/missing elements in A or B.

An INNER JOIN keeps rows of AxB only where both A and B rows exist. A LEFT JOIN is only when the A part exists (B may or may not), similarly for RIGHT JOIN. An OUTER JOIN doesn't care if either a corresponding A or B exists. So the boolean idea does apply but in a strange way, only with respect to the NULL condition of the matching field. If the value of the field from A has no matching value in the field for B, then B is NULL or missing then (and vice versa).

So the Venn diagrams for SQL operations, I can't really say they are true Venn diagrams; they don't show the state of a consistent property over all elements of the universe. Or rather the universe is a bit more complicated (depends on A and B, their cross product) and the property being booleanized is whether element of one table is NULL. You can't just take an arbitrary universe of elements (with properties. With JOINs, you have to create the universe, the product, first before examining the elements (and whether the A part or B part of the new row is null or not.

No comments: