Friday, June 13, 2008

The invisible character bug

Computers suck. In this case... I have a file that ostensibly has really long lines, or rather the original data has really long lines, but what I have the lines are split, with a '-' to show that the line was split, plus a new line.

e.g. a file like this:

dfasdfasdfaasdfsregaregeagrerg242342423ytuyutuy
qqweqweqweqsdadsasdasdasdzxczcxzcx
I get it like this:
dfasdfasdfaasdf-
sregaregeagrerg-
242342423-
ytuyutuy
qqweqweqweq-
sdadsasdasdasd-
zxczcxzcx-
Fine. So I can't just remove newlines, so a simple sed oneliner won't work. But a little looking on the web gets me a summary of quick sed oneliners which has exactly what I'm looking for but would never in a million years have figured out on my own:
# if a line ends with a backslash, append the next line to it
 sed -e :a -e '/-$/N; s/-\n//; ta'
It looks for the dash followed by the end of line (in sed fashion, the new line character is not part of a line), and if found appends that line -and- an actual new line character to the search space, which is then searched for by the next 's/...' and removed (and then a little 'goto'ing' which I never new existed in sed before).

Great. Except it doesn't work. Why not? because..well, before the explanation, I have to complain about the hours and hours (well, 3) that I spent doing the 'debugging by permutation', trying all the possibilities of small changes, maybe it's for a different shell or slightly different sed version, or whatever. OK, that's enough...on with the solution...

Like all the Sherlock Holmes stories, there's always a tiny bit of information that the author doesn't tell you until the very end, which of course if anybody knew already would have solved the problem...the file I received was in -MSDOS- format, meaning simply that new lines are denoted by -2- characters, carriage return -and- line feed (or \r \n, or \x0d \x0a).

So the sed was correctly finding '-' at the end of a line, and appending the next line, but it couldn't find '-\n' and remove it because it really needed to look for '-\r\n'.

That is, an invisible character. You can't see it but you have to know about it to correctly solve the problem. In my very dim memory of the far past, it seems like this used to be a 'joke' bug, a possibility to blame something unknowable on (because you can't -see- it), when the bug is probably really a thinko.

Anyway, hours wasted on trivialities.

That is all.

No comments: