Dec 19 2013

Tell the Story

I love stories. I love telling stories, and I love listening to stories. I learn from stories. I believe that we, as a programming community, don’t tell enough personal stories around the campfire.

Some may think it heresy, some may think “well duh”, but I believe that the programming profession is still really primitive. At the level of stone knives and bear skins . I air quote around the “engineer” part of Software Engineer. We employ voodoo (How many times have you restarted Xcode to fix a Weird Mysterious Problem? And it worked!).

Fundamentally we’re still dealing with punched cards. I can easily imagine taking my Objective-C code and xib file XML, putting them on cards, and feeding them linearly into a room-filling mainframe for compiling and linking. Sure, our card punches are much better than they were before (IDEs with refactoring), but it’s still just productivity increases within an order of magnitude or two. A great deal of software development’s difficulty comes from mapping this linear sequence of the lines of code in our card decks to the complex multidimensional pathways of the machine’s behavior at runtime. Do I have any ideas for making software development better that isn’t just a slightly better card punch? Unfortunately, no. Just grousing.

As an industry we keep reinventing stuff thinking it’s new. The mouse was introduced in 1968. Code Contracts looks remarkably similar to Design By Contract from the Eiffel language, which I read about in the late 80′s from the first edition of Bertrand Meyer’s book Object-Oriented Software Construction. There’s always a new silver bullet. Maybe structured programming, object-oriented programming, 4GLs, functional programming will save us. Functional programming, one of the newest development darlings, can trace its lineage back to the late 50s. What’s old is new again.

I Love To Hear The Story

So, back to the stories. The main way in pre-literate societies to disseminate things like knowledge, culture and codes of behavior are through an oral tradition. I’ve learned a lot of my interesting esoterica listening to the war stories of other programmers. Everybody has a story about That Bug that took a month to track down and was fixed in a dozen lines of code. My favorite story that I tell with my “Thoughts on Debugging” talk is two programmers fixing a show-stopper bug over 24 hours, with the fix being a two-character change in the source code that resulted in a two-bit change in the compiled program. I’ll tell ya’ll this story next year. I’m amazed and dismayed that in the current state of software development a two bit change in a multi-megabyte program could have major repercussions.

My favorite interview question is “Hey, what’s your favorite, or least most interesting bug?” These kinds of stories are fascinating. You can often get an insight in the candidate’s personality and thought processes (and communication skills, too). If nothing else, it’s fun to see an otherwise-introverted nerd totally light up, telling tales of the latest hunt. Outside of an interview situation, these kinds of stories are ways of passing small, but powerful, units of knowledge between members of the same tribe. “Here is a specific thing that cost me some of my life, and here’s what I learned from it.”

I was sitting at a table with Mike Ash once, a long time ago at a conference or an NSCoderNight, and I asked him if he had any good bug stories. He regaled me with a tale involving a pointer assignment to BOOL slicing off the bottom byte and turning a true value into falsehood. That story started my mini-crusade to let the world know about that dark corner of Objective-C. To me, it’s wasteful to burn time on a problem like that. Visualize your most respected programmers. Now imagine what they could accomplish if they didn’t have to spend significant portions of their lives chasing down bugs like “This value is false only if this string pointer is page-aligned.” We can start by informing others, and perhaps eventually the problem will get fixed by those with the power. (Which happened in 64-bit iOS. I don’t presume to take any credit for it, but yay!)

I have a friend, Jeff Szuhay. He was a regular CocoaHead from the group’s inception in Pittsburgh until he had to move out of state for a new job. He decorated his cube with a strand of Christmas blinking lights, but he would only plug them in if he solved a really tough bug or implemented a significant feature. You knew there was an interesting story to be heard if you saw the lights on in Jeff’s cube. I learned a lot about debugging technique and software design by hearing Jeff’s latest victories. Because most everybody eventually stopped by to hear the story, he was able to communicate “these are issues with our product that we can fix” kinds of ideas which might have otherwise been ignored in blanket email form.

This transfer of little chunklets of knowledge is why I volunteer as a debugger and make myself available whenever I visit a Big Nerd Ranch location. I’ve had the pleasure on going on some pretty wicked bug hunts. When you’re disassembling chunks of Core Data, or having to turn on slow animations in the simulator to even begin tracking down the problem, you know you’re going to learn something when it’s all over.

This is also why Ive been writing the occasional blog post. A couple of recent stories are debugging war stories. I’ve got a lot of stories in my head, and it feels good to get them out of there to make room for more.

I totally encourage you to tell your story, either here in the comments, on your blog, or at the table at the next conference or NSCoderNight you go to. What is your Epic Bug? What did you learn from it, and what can the rest of us learn from it too?

4 Comments

  1. SysVr4

    My favorite bug story is, unfortunately, not one where I am the protagonist.

    I believe it was the summer of 1997. My best friend and I, both computer science students, were running a small, regional ISP. Thanks to the Telecommunications Act of 1996, we also started a nascent CLEC to feed lines to the ISP. But apparently to run a phone company, you need actually need a phone switch. Or at least, that was the case at the time.

    Not having a cash horde sufficient for the likes of Nortel and Lucent telephone switches (think mid-seven figures), we opted for a very small outfit out of Jackson, TN named DTI to provide our switch. They had a small (as in, hundreds of trunks, not thousands) Class 4 switch which could do SS7, feature group D, PRI, etc. It was perfect for our needs and fit the budget so we took a leap of faith that between us and their engineers we could actually get it to work. A still-very-expensive leap of faith, but one that would only enslave us for 1 or 2 lifetimes if it didn’t work.

    Months later, the switch is installed and everything goes surprisingly smoothly. All the lines came up, SS7 dialogue progressed swimmingly between our baby switch and the monster DMS-250 and Luscent 5e’s in the Bell NOCs, etc. So without further ado, we throw the proverbial switch and go live, sending customer phone calls through our new DTI switch to the modem banks. As I recall, that was a positively fine time for prayer.

    In the subsequent days, we received many compliments from customers on the stability of their connections, improved connection speeds, etc. Everything seemed to be going absolutely according to plan, which should have been the first indication that, er, “stuff” was about to get real. It was about that time that we noticed multiple T1 trunks dumping every call on it (23 or 24 each, depending on configuration) simultaneously. Awesome. This is the stuff that just makes a customer’s day, let me tell you.

    Wait it’s ok, the line comes back up, and look here…calls are coming back already, no problem. Then some short, seemingly random amount of time later, boom, calls dead again. We had the wherewithal to determine this was only on the trunks connected to a local Lucent 5e switch. But that’s about where our debugging “expertise” stopped… time to call DTI.

    Enter Mr. Liu.

    Mr. Liu was an engineer of DTI who was deeply involved in all the code in that switch. He spoke considerably more assembly than he did English, but if you had to pick one person to come to your location to debug something, you’d pick Mr. Liu. I hope every day that Mr. Liu is now a citizen of the US so that we don’t have to worry about guys with his level of talent hacking us from mainland China. But I digress.

    Now, I should tell you at this point that there are three likely reasons why this switch was within our price range:

    1. It was very small
    2. DTI was a small and relatively unknown company
    3. The interface was complete crap

    In terms of routing, programming, or debugging, virtually everything you did was one in a series of textual commands, mostly in hex, through a 9600 8N1 serial interface. Debugging? You had one option. You simply turned on debugging for one trunk or another and you got the live hexadecimal output of that trunk (or trunks). Think, The Matrix, only, the bytes were flowing by much, much faster.

    So Mr. Liu is sitting at the terminal for our new switch, with the debugger on, watching the hex scroll by. Mostly we’re just waiting for the “event” to take place again and drop the calls so that he can break out the whoop ass. He’s calmly reading the bit leaves flowing by on his screen, and I’d swear he could tell you what each of them meant in real-time if we had asked. Then suddenly, it happened – the calls dropped.

    He scrolled back up in the terminal program, read over a few lines of gibberish looking 0xOMGWTF looking code, nodded, and pointed to a particular byte (or two?) on the screen. He turned to us and said in broken English (paraphrasing): “There… bug in 5e SS7. We fix for you.”

    Apparently there was a minor bug in the SS7 implementation of the Lucent 5e (which DTI later proved to us was contrary to the official protocol), which caused a byte or two in a dense signaling stream to go “wonky”, to use a technical term. That byte or two (and I’d give anything if I could remember exactly which they were) caused the DTI switch to think the line was down and all current calls were dropped.

    Knowing it would be impossible to get behemoths like Lucent and Bell to fix their switches in a timely fashion, DTI coded a workaround for the Lucent bug which had us up and running mere days later. I don’t remember exactly how long it too, but it was fast enough that Mr. Liu probably called in the machine code himself on the drive home.

    Mr. Liu, wherever you are, I hope you’re doing well. Oh, and what is DEADBEEF in Mandarin anyway?

  2. Another cool story – Description of reverse engineering a phototypesetter in the Bell Labs unix days – http://www.youtube.com/watch?v=CVxeuwlvf8w

Leave a Comment

Join the discussion. Do not worry, your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>