Tuesday, May 31, 2016

Deep Learning: Not as good, not as bad as you think.

Deep Learning is a new (let's say 1990, but common only since 2005) ML method for identification (categorization, function creation) used mostly in vision and NLP.

Deep Learning is a label given to traditional neural nets that have many more internal nodes than ever before, usually designed in layers to feed one set of learned 'features' into the next.

There's a lot of hype:

Deep Learning is a great new method that is very successful.

but

Deep Learning has been overhyped.

and even worse:

Deep Learning has Deep Flaws

but

(Deep Learning's deep flaws)'s deep flaws

Let's look at details.

Here's the topology of a vision deep learning net:

(from Eindhoven)

Yann LeCun
What's missing from deep learning?
1. Theory 2. Reasoning, structured prediction 3. Memory, short-term/working/episodic memory 4. Unsupervised learning that actually works

From all that, what is it? Is DL a unicorn that will solve all our ML needs? Or s DL an overhyped fraud?

With all such questions, the truth is somewhere between the two extremes, we just have to figure out which way it leans.

Yes, there is a lot of hype. It feels like whatever real world problem there is, world hunger, global warming, DL will solve it. That's just not the case. DL's are a predictive model machine, very good at learning a function (with lots of training data). The function may be yes or no, or even a continuous function, but still it's take an input and give an output that's likely to be right or close to right. Not all real world problems fit that (parts of them surely do, but that's not 'solving' the real world problem.

Also, DL's take a lot of tweaking and babysitting. There are lots of parameters (number of nodes, topology of layers, learning methods, special gimmicks like autoencoding, convolution, LSTM, etc etc with lots of their own params). And there are lots of engineering methods that have made DLs successful, but these methods aren't specific to DL. Lots of better data, better software environments, super fast computing environments, etc etc.

However, there are few methods nowadays that are as successful across broad applications as DL. They really are very successful at what they do and I expect lots of applications to be improved considerably with a DL.

Also, for all the tweaking and engineering that needs to be done (as oppose to the comparatively out of the box implementations of regression, SVMs and random trees), there are all sorts of tools publicly available to make that tweaking much easier: Caffe, Theano libraries like Keras or LasagneTorch,  Nervana’s Neon, CGT, or Mocha in Julia.


So there are lots of problems with DLs. But they're the best we have right now and do stunningly well.

Kinds of Data: there are more than just the basic four

The science of statistics specifies that data points come from four basic types:

  • nominal - these are incomparable labels like truth (yes, no), color (red, blue green), country (UK, France Germany, Italy). There is no relation among these elements there other than that they are in the same set. All you know about them is their names and that a name is different or the same as another.
  • ordinal - only a rank is known (1st, 2nd, 3rd...) and nothing else (we don't know how far ahead 1st is from 2nd), just the order, like finishing order in a race.
  • interval - we know the distance between any two elements A - B  like the height.
  • ratio - we also know the ratio of two numbers where for example A can be twice B, like half-life of an element.
Notice how I describe these both mathematically and conceptually, because often a set selected from a mathematical domain, like the reals, can be interpreted in any one of these. For example, from the reals, they obviously have a ratio by division, and a distance by difference, can be ordered by 'less than', and can be categorical by using cutoffs, say >= 0 for yes and < 0 for no.

Of course, as with most systematizations, this list came after years of using methods that were created to work with whatever data was at hand, and then when the data just didn't work with those methods, new analogous method were created, or entirely new methods created for quite different purposes.

Statistical procedures seem geared to work with one of these types. Chi-squared on contingency tables are good for categorical data. Wilcoxon signed ranks for ordinals, t-tests for integral data, Poisson for count data. But mostly there are just two kinds discrete and continuous which fall to nominal/categorical statistics and pretty much all the rest of statistics respectively.

Existing science isn't as deliberate as a current systematization, as monday-morning quarterbacking/textbook-writing may make it seem. It's more incremental, and filling in gaps as needed rather than laying out the system ahead of time. You have a problem and you use a tool that works good enough right now, you develop that tool incrementally until it metastasizes well beyond it's initial conception. Contingency tables are great ways of summarizing tabular data, but you may want to do a significance test like all the t-test guys. 

---

Any kind of systematization is an oversimplification, forgetting possibly irrelevant details to make different things look alike, and placing a particular item into that systematization is also forgetting possibly irrelevant details to make it look like one of a few categories. But sometimes those details are not so irrelevant.

Binary data is a subset of nominal data, with just two categories. Two by two contingency tables and logistic regression are especially designed to deal with them.  Some multinomial categories will have some minimal relationship, say geographic location with countries, or wavelength for colors (colors are very complex because the brain processes them by multiple systems involving the wavelength, opponent process pairs, or beyond. Rank data is ordinal by definition, but when encoded as numbers, can be processed as interval or even ratio data (depending on the interpretation desired.

These four data types work very well for statistics. But it seems underspecified. We're used to measuring quantities or counting objects so all those categorical and interval methods apply so well. But there's so much more structure to the way things can be measured. Not humanities-style vague, wordy, qualitative description. Perfectly exact, just not necessarily a number.

There is an existing method for description of data. A very rich description method. It's mathematical notation. If data should be treated continuously, use R. If a vector over integers, Z^n. If an ordinal set, then that's a total order. If categorical, then you have a simple set. If the elements are related to each other one on one but in a complex restricted manner, then maybe a graph is the way to notate things. if the elements allow certain operations but not others, then maybe it's from a particular algebra, a Hilbert algebra, or instead a Banach algebra. 

Measurement is not always in the elementary numbers we count or measure or weigh with. There can be quite a bit more structure in the measurements than just a number.

There are no exact synonyms

There are no exact synonyms.

That may sound a little extreme, especially given that thesauruses exist.

There is no pair of words where one can replace the other in all circumstances.

'Bail out the canoe with a bucket'

Can you replace 'bucket' with 'pail'. Of course. But can you say 'kick the pail'? No, of course not, that would be wrong. You can't always replace a word with its purported synonym.

Well, OK, there are some circumstances where there are exact synonyms. In technical circles, especially the sciences and math, there is a special way of attaching a word to a definition. In technical areas one 'stipulates' a definition of a word. That is, you give a word a definition that is simply a shorthand for replacing a word with its definition. You're stating authoritatively that for a word, you must treat it like an exact replacement. Often these technical terms are supposed to be evocative or metaphorical, supposed to give you a good idea of the intended meaning. But you can have whatever mental connotations that help you remember the true meaning but the true meaning is what has been stipulated, it doesn't matter, A=B and that's all there is no more no less.

But with non-technical words, there is no stipulation. A word is just a trigger for some associations. And if it sounds different, then there is no way it can be identical in all situations. Different stimuli can give different responses.

I will even go so far as to say that even a given word is often not its own synonym because all words have multiple meanings.  I'm not even talking about homophones (words that are spelled differently, but sound the same, like 'horse' and 'hoarse') or the other side homographs (words that are spelled the same but can have different pronunciations and meanings, like 'bow' a knot in a ribbon or tie, and 'bow' the front of a ship). I mean a word that is spelled and pronounced the same but has a different but related meaning. For example, 'run' is a verb to move fast by your legs, but is also a noun for a  long rip in a stocking or a small stream.

The point is that if you desire a synonym, you can get that from a thesaurus, but it may not slot in perfectly as a replacement. And even a single word may have many associations and alternate meanings that it is not good in that slot itself.

Thursday, May 5, 2016

Free will?: Science assumes determinism

Whatever the philosophical decisions made about free will versus determinism, science (or factual knowledge) attempts to discover everything that is deterministic, and to that end almost assumes determinism.

The only thing counter to this presumed determinism is its literal negation, non-determinism, which is modeled using probability. And probability is just shorthand for what we don't know or can control yet. This applies all the way from physics to sociology.

Wednesday, May 4, 2016

What's the point in these theorems?

Sometimes math is weird. Often you know exactly why a particular math thing is interesting. Like it's so obvious that algebraic geometry is there to help figure out where really weird multinomials intersect. But other times, even for simple things for which there's lots of research and historical precedence for concern, I just don't get it. Here's a list of things I just don't get. I don't understand the point of pursuing them. I understand the mathematical process, I just don't get the point:

  • Craig's Interpolation Theorem in logic, if a implies c, then there exists b such that a implies b and b implies c and b only involves the intersection of vars from a and c
  • Curry's paradox and Löb's theorem - I have trouble following the elementary proofs of these. They seem to say you can prove anything "'if X is the case then Santa Claus exists' proves Santa Claus exists' or something
  • Herbrand's theorem - proves universals using examples?
  • the Deduction theorem - it just seems so obvious. It's just Modus Ponens, right?
  • quadratic reciprocity - allows computation of square roots in modulo arithmetic. Why you would want to do that, I don't know


I want to understand these things, and I can (usually) follow step by step manipulations, but I just don't get what they are for and what the point is.


Two-pass forward-backward approximation systems

A number of approximation methods work in a two stage cycle, a forward pass to compute the test on the system, and then a backward pass to update the system with the error of the test with respect to the supervised true answer.

In the backward pass, un update function moves in the opposite direction of the edges, update weights as it goes a long.

Neural NetworksDeep Learning - a directed (usually acyclic) graph with weighted edges used to compute a function. In the forward pass, the values at any node are computed as the dot product of the value at the source nodes and the edge weights, then a simple threshold function (in topological sort order). starting from the input nodes (no in-edges) This computes the values at the output nodes (no out-edges). Traditional neural nets have a layer of input nodes, a single layer of hidden (interior) nodes, and a layer of output nodes with no edges directly from input to output. Deep neural nets have more layers. Arbitrary undesigned graphs are not usually not very successful.

Expectation-Maximization approximation of parameters of a statistical model. First the expected value of the likelihood function is calculated, then the model parameters are calculated to maximize that function

Primal Dual linear programming for maximizing a target function restricted by a set of linear constraints. The difficulty is dealing with the possibly large set of constraints and large set of dimensions.

Kalman filter successive measurement refinement. This is usually applied to position measurement with slightly fallible sensors. From an initial (fuzzy) position and direction of an object at time t, the position/direction at time t+1 is predicted y combining the prediction of movement from time t plus a fuzzy sensing at time t+1. Combined, the variance is lessened.

Are any of these even more alike than just the forward-backward pattern? Are there any other algorithms that are superficially similar, have a two-step iterative process?

Update (5/24/2016): this blog post seems to give pointers to how backprop, primal-dual LP, and Kalman filters are interderivable

Sunday, May 1, 2016

My pet language peeves/non-peeves


It really bugs me when other people say (my inner prescriptivist):
  • for 'often, pronounced 'off ten' instead of the correct 'off en'
  • 'Between you and I' instead of the correct 'between you and me'
  • pronouncing 'forward' as 'foh ward' (no first 'r')
  • 'comparable' pronounced 'com `pair able' not ' `com pruh ble'
  • Dwarfs roofs baθs instead of dwarves, rooves, baðz
  • pronouncing 'processes' as 'prah cess eez' instead of the correct 'prah cess ehz'
  • 'my bad' and 'back in the day'
  • whilst/amongst
  • Nutella as New-tella. I prefer  Nuh-tella

Conflicted
  • 'irregardless' - obviously a sign of not caring about words, but it still takes me a second to register it as 'wrong'

It really doesn't bother me at all to say:
  • 'Hopefully' to modify a sentence
  • Feb you ary

I only recently learned how t pronounce correctly:

  • awry. I used to say 'aw ree instead of uh 'wry


Wow, is that it? I was sure these lists would be a lot longer.