Monday, August 14, 2017

The Great English Muffin Shift

The Americans and British are separated by a common language. This has been attributed to Churchill, Shaw, and Wilde, all of whom stole from the best, but has never been attributed to Mencken, who should have said it but said other things instead.

There's the differences in pronunciation (Americans pronounce all 'r's, and Brits take a royal 'bahth'), and grammar (Americans go to the hospital, and Brits go to hospital), and there's all sorts of vocabulary differences, lorries and lifts and petrol.

But one primary difference is in vocabulary of food. Zucchini/courgette, eggplant/aubergine, let's call the whole thing off.  A number of baked bread products have different names in the two varieties. What's so special is that they form a chain, as though some higher force pushed in a word at one end of the sausage machine, forcing all the little sausages to move one sausage over, a Great English Muffin Shift. It goes like this:

A cookie in the US is a biscuit in the UK and
biscuit...scone and
muffin...scone, a slightly different kind of scone and
muffin fairy cake, a slightly different kind of muffin and
English muffin...crumpet, because in the UK, you're there already you don't need to specify English.

What 'cookie' means to Brits, and 'crumpet' to Americans, I don't know. Yes, the sausage machine seems to go in reverse there and then start forward again, sometimes the machinery gets stuck.

There's also the Great Fried Potato Migration: what are called 'fries' in the US are called 'chips' in the UK, and 'chips' in the US are called 'crisps' in the UK.

As far as I can tell 'crisps' means nothing to an American beyond you must be talking about something crispy but why would you call it that directly. And 'fries' to a Brit must elicit a 'Pardon me, but fried what?'


Friday, August 11, 2017

Taxonomy of Chatbots

Chatbots are a recent trend in user interface. To contrast with a two-dimensional visual UI, a chatbot is a linear time based interface, where the user does an action, there is a response from the system and then the user may act further and so on with the system. The term 'chatbot' comes from a typing 'chat' system that acts like a Turing test robot in an online question response sequence. Some of the things that are called 'chatbots' don't superficially seem like this (they don't all attempt to be linguistic systems), but they are a linear action-response loop, which seems to be the defining characteristic/

By recent trend, I mean, as usual with technology, they've been around forever (Weizenbaum's ELIZA Rogerian psychotherapist from mid 60's, phone menus or IVF from 70's). But as of 2017, there is an explosion of available chatbot technology and, orthogonally, chatbot marketing.

The point here is to to give a superficial systematization of the different things labeled 'chatbot' with examples.

There are two distinguishing characteristics of chatbots that are only leniently considered defining: sequential response and natural language input (either by text or speech). These two might be combined to be called more formally a Linguistic User Interface (LUI) in contrast with a graphical user interface (GUI). The natural language part underlying many of these is some kind of speech-to-text (S2T) mechanism to get words from speech and some NLP processing to match the words to the expected dialog. The leniency about sequential may come down to a single step (the shortest of sequences possibly not even considered a sequence at all) and about language (a label for a button is language right?). With those caveats, on to the taxonomy.

  • linguistic interfaces
    • Siri/Alexa/OK Google - intent/entity/action/dialog. stateless giving, commands to evoke an action. Development of the system involves specifying: an 'intent', something that you want to happen, the entities involved (contacts, apps, dates, messages), and actions (the code the really executes based on all that information. Oh, and the more obvious thing, a list of all the obvious varieties of sentences that a person could utter for this. The limitation is that there is no memory of context from one action request to the next.
    • chatroom bots - listeners in a chatroom (mostly populated by people writing text). 
      • helper commands - This kind of chatbot simply listens to text and if a particular string matches, executes an action. This doesn't need S2T, and usually no NLP. It relies on text pattern matching (usually regexes) to extract strings of interest. Usually it turns out the implementation is even simpler and just uses a special character to signal a command for a CLI (command line interface) follows.
      • conversational bot
        • Like Eliza, finds keywords or more complicated structures in a sentence and tries to respond to it in a human like fashion (good grammar, makes sense). The latest ML and machine translation techniques (RNN, LSTM, NER) seem to apply best here.
        • 'AGI' - artificial general Intelligence- these exist only in TV/Movies. 
  • menu trees - structured tree-like set of possibilities, 'Choose You Own Adventure'. These are very much like (or exactly) finite state automata, where the internal state of the machine, and presumably but not necessarily mirroring the mental state of the user, is changed by a simple action of the user. The user is following a path through the system.
    • phone menus - Historically, these are menus, a set of choices, spoken to you, expecting a response of a touch-tone number (Dual-Tone Multi-Frequency - DTMF. A recording lists a number of options and the phone user is expected to press one of the numbers associated with that option. Then another option is provided and so on until an 'end' option is chosen or you're transferred to a human operator.  Interactive Voice Response or IVR is this same interface allowing responses by voice also. A next level of feature augmentation is to allow the user to speak a sentence to go to the desired subtree quickly, skipping over some steps. This shows how the strict computery menu as implemented on a phone is slowly evolving towards a conversation.
    • app workflow - some desktop/phone apps offer an interface that leads you through data entry sequentially. The user is provided with a set of buttons with labels, and the choice of button leads to a different next question depending. Instead of buttons, one might enter some short text, but again this can lead to different new questions by the interface. The text is not intended to be a full sentence, but simply a vocabulary item, allowing a more open-ended set of possibilities than a strict set of buttons without the necessity of parsing. This is the least chatty of chatbots, but like the phone menus may be considered a sequential but non linguistic UI that can be considered a precursor to a more language based one.

It seems strange to call all these bots. I find it natural to call only the conversational bots by the label 'chatbots'. It turns out that marketers have used the term 'chatbot' for all of these. They surely all share some aspects of a chatbot, but it doesn't feel like the name until you're actually chatting.

Wednesday, August 9, 2017

Butterfly in all the languages of the world

Etymologically, some words are universal. The word 'mother' seems to have some version of an 'm' word in every language (despite the counterintuitive experience that 'm' is not usually the first linguistic sound an infant learns to make).

Some words will stay mostly the same within a historical group: pronouns and numbers tend to maintain meaning through centuries of phonetic changes.

Some words are unique to one language when other languages in the family keep the generic. 'Dog' in English is unique to English, but 'hound', from the Indo-European 'hund' (GE)/'canis' (LA)/'sag' (PE) remains elsewhere.

But are there words, or rather concepts, that are unique in every language. That is, is there a concept, such that in every language, the word for the concept is unique to that language and not shared by others?

If the idea that concept and word are not the same bothers you because, well, a word says what its concept is, then the following should convince you otherwise. Wait...instead just consider that a language foreign to you has mostly different words to you for the same concepts. Therefore words and concepts are not the same. Anyway, on to the main topic...)

Consider the word 'butterfly'. Sorry, consider the insect that in English is referred to as 'butterfly'. In English it is called ... yes, yes, I just said it. It's the usual English word made of two words. 'Butter' and 'fly'. There are all sorts of etymological theories:

  • the insect is a fly the color of butter (some very particular species I presume)
  • they hang out near butter
  • they literally 'flutter by' and people are goofy and pulled a spoonerism
  • the word as borrowed from Dutch who called it 'boterschijte' or, translated back, 'butter shit' because the insect's shit looks like butter, again presumably for some particular species whose shit I have not seen).
All somewhat sounding a little too convenient, like folk etymologies rather than scholarly exegeses. Except that Dutch one. Where did that come from?

But that's just English. The fun thing is is that most languages have their own strange fancy word for 'butterfly', seemingly not borrowed from any other nearby language.
  • Romance
    • Latin: papilio
    • Italian: farfalle
    • French: papillon
    • Spanish: mariposa, 
    • Catalan: papallona,parpalhòla
    • Portuguese: borboleta
    • Romanian: fluture
  • Germanic
    • German: Schmetterling
    • Dutch: vlinder (note not boterschijte)
    • Danish/Norwegian: sommerfugl
    • Swedish: fjäril
    • Icelandic: fiðrildi
  • Slavic
    • Bulgarian: peperuda
    • Serbian/Croatian/Bosnian: leptir
    • Czech/Slovak/Polish: motýl
    • Belarussian: matyliok
    • Ukrainian: metelyk
    • Russian: babochka
  • Celtic
    • Irish: féileacán
    • Scots-Gaelic: dealan-dè
    • Welsh: glöyn byw
For every one of these mostly distinct entries (yes, yes, Slavic has a couple of derivatives of 'motil', and Romance of 'papilionem') there is an obscure etymology, mostly made up, just like the English one. The German 'Schmetterling' seems to come from 'schmettern' meaning 'make a loud noise' or 'strike' (butterflies tend to be quiet) but 'schmetter' is from an older Saxon dialect word usage, having to do with milk products, following the old folk belief that witches fly about in the form of butterflies, in order to steal milk and cream. A bit fanciful and sounds like my great aunt made it up. But then 'schmetten' is a dialect word for cream, deriving from the Czech “smetana”. So it's obvious! Cream, butter, butterfly! Which is to say nothing is obvious and it all sounds made up.

The Irish 'féileacán' also has multiple explanations. Maybe it is from 'feileach' which means 'festive' (butterflies certainly are festive) or it could come from 'eitleach' for flying. A possible sound change but not borne out elsewhere in Irish.

So, what's the point? Take any other language from your own. Almost the definition of it being another language is that there's a different word for everything. But for 'nearby' languages, really most of the words are cognate, just changed slightly, and it is only a handful of words that stand out as being different (e.g. English vs Scots English). The point is that the animal called 'butterfly' in English seems to have few cognates even in nearby languages. What is the explanation? What makes those insects so special? And even if they are special (they are!), aren't there other animals that are as special? A bear is pretty special especially if it's running after you. 

Te direction this is going in is that of all the words in the world, 'butterfly' has no cognates among any languages. By looking at the list that is obviously not true: motyl/matyliok, papilio/papillon/papallona, and others. But it does show that the word seems to vary quite a lot, as though a butterfly really brings out creative neologisms in everyone.

Linguistic note: I stopped at the European of Indo-European only because of familiarity and ease in checking. It would be instructive descriptive (that is non-theoretical) linguistics to investigate:
  • other close families like the many close languages of India, Indic or separately Dravidian, or Chinese
  • very close varieties (mutually intelligible dialects) to see if 'butterfly' is so volatile even in very close languages
  • compare other concepts in a structured manner, e.g. one-for-one against mother, five, dog, fly to see if butterfly really is special (or is it a pattern that's not really a pattern and lots of other middlingly common words have a similar situation

(OK I lied at the beginning. 'Mother' is not considered a language universal by any linguist. It is certainly maintained as the main 'mom' word within Indo-European. But any 'm-' words in other languages are considered by linguists to be coincidences. There does seem to be some lexical universals over all human languages but currently there is only considered to be one, 'huh?'...so far)

Tuesday, August 1, 2017

Statistical Rumsfeld: Now We Know!

No, not a poor punk band name, but Statistical Rumsfeld, popularized by his usage but not created by him, is a way of talking about what you know about your own knowledge:

...there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns – the ones we don't know we don't know.

- things we know that we know them: this is data. We've looked, and seen, and we are aware that we've looked and seen and verified and removed doubt. Is it 'yes' or 'no'? Look at the thermometer.

- things we know that we don't know. We know we don't know what's behind the curtain. We know we don't know what the capital of Chad is. We know we don't know what somebody is thinking before they tell us (even sometimes afterwards). We know the boundaries of this darkness. We know he range of possibilities. This is like a probability density; we don't know the particular value of a coin flip but, we know that 1/2 will be one side and half the other. That's something.

- things we don't know that we don't know. We have no idea. We don't know how to look for the value, we don't know the distribution, we don't know what the range is, we don't even know if it's a number. Totally unexpected. A black swan.

Something is left out. you have things that you know and things that you don't know, and you can either know that or not. Two things, with two possibilities for each, four in total. The one that is missing is itself: unknown knowns. 

- things you didn't realize you knew. You didn't know you knew that, did you? Unconscious knowledge. A hidden talent you weren't even aware of. The pattern in the data that was always there.

Or better, in a handy chart:

Things
KnownUnknown
Do you know about them?
KnownKnown Knowns:
Facts, data
Known Unknowns:
Parameters, Distributions, Probabilities
UnknownUnknown Knowns:
Unconscious knowledge
Unknown Unknowns:
Hidden Variables, Black Swans