Heisenbug

We all know what a heisenbug is.

Well, computer programmers know all about them. C/C++ programmers especially, and doubly especially users of Microsoft's Visual Studio. You write a bunch of code; you test it and check it and test it and test it some more. And then you create a production version, and BOOM! – it doesn't work.

Although named for Werner Heisenberg, who gave us his eponymous Uncertainty Principle, a heisenbug is more properly an example of the "observer effect". In many cases, you can't measure something in a system without affecting that system. For example, stick a big thermometer in a small sample of fluid, and the fluid will quickly come to the temperature of the larger thermometer. Which is probably not what you had in mind.

Likewise in computer programming. In Visual Studio, for example, compiling in Debug mode means all compiler optimizations are turned off. Switch to Release mode, and they're turned on – and the code starts taking new twists and turns. Sometimes that leads to different behavior. Go back to Debug mode in order to try to find the problem...and the problem goes away.

Here's what that's like:

Smacking one's face on the keyboard is not likely to improve the situation. But when you are in it, you just don't care.

I had one once that was actually a lot of fun, and taught me a lot.

This was back in the 1970s again. The same Honcho from the Tale of the Line Printer came to me and said he had a crazy bug that he just couldn't find. My task: "Find it."

The problem was a minicomputer assembly-language program. It was running like gangbusters, except for the part where you needed to input some data from the console Teletype. So, I got myself set up. He gave me the fanfold printout of the most recent compilation; I double-checked the contents of a few locations in memory (using, by the way, that same Teletype) against the printout, just to make sure the binary matched the printout.

All was cool.

When I ran Bruce's code, it behaved as advertised: It did what it was supposed to do, except for the part where it seemed to be just flat out ignoring the Teletype. My first thought was to make sure the program was actually reaching the "input from Teletype" routine. So I started patching the code so that I could insert some monitoring statements.

Assembling the code was a bit of a pain. You needed to create an edit deck on a keypunch, find the right reel of tape with the source code, mount it on our administrative machine, do the edit, reassemble, relink, dump the binary onto another tape, rewind and remove the tape, mount it on the target machine, and load it. So we didn't do that. Instead, we patched the code.

Patching meant finding the problem, coming up with the right code (I would write that code onto the fanfold paper with red ink, making it easier to find when, sometimes weeks later, we'd create the edit decks and reassemble) assembling the code by hand (most of us knew by heart most of the octal codes for most of the machine language instructions), writing the patch and its opcodes onto the fanfold paper (this time in black ink) and then using the monitor program on the Teletype to put those octal instructions into the right locations.

If, as was usual, the fix was too large to fit into the space where the original code lived, you put in a jump to an unused area, put the new code in there, and then jumped back to working original code. Patching: The name of the game when edit-and-assemble meant an hour or so of administrative work.

So, I put in something or other to make sure I had arrived at the Teletype SENSE instruction, which meant moving that SENSE instruction to the patch area. And when I ran it, well, not only did I arrive at my "did I get here?" code...the program started working properly.

Huh?

I put the original code back in. Stopped working. Put the patch back in. Works. Removed my debugging code from the patch. Still works.

Double-checked it. With the SENSE instruction in its original location: Doesn't work. Jump to patch area, do same instruction, then jump back to original location: Works.

I then started stripping things down. I ignored all of Bruce's code, and created a simple, simple little sniplet of code:

AGAIN: SENSE 001,ECHO     ; IF TELETYPE INPUT IS READY JUMP TO ECHO
       JMP AGAIN
ECHO: INA 001          ; GET CHARACTER FROM TELETYPE
       OUTA 001          ; SEND IT TO TELETYPE
       JMP AGAIN

In general, it worked just fine. But if I put it in memory so that the SENSE instruction was exactly where Bruce's assembly had accidently put it; it didn't work. Move it even one location away, and it worked fine.

Now that, my friends, is a Heisenbug!

The incredibly important lesson I was learning was that sometimes computers lie! Just because the code ought to work doesn't mean it is going to work. Sometimes the computer cheats. Consider playing chess with a homicidal maniac: you move your pawn to King 4 and he pulls out a gun and shoots you. Checkmate.

Said another way: I was now looking for a hardware problem.

The marvelous thing about being, nominally, a computer programmer, but with an engineering degree, is that you don't necessarily get to punt on the fourth down. (How many computer programmers does it take to screw in a light bulb? None: It's a hardware problem.) I doubt it would have occurred to me to call Varian support anyway; it was just getting fun. So I rummaged around and found the schematics for the Varian 620/L minicomputer. I found a Varian extender card, which lets you plug the boards in outside the rack so that you can get at the components. And I dragged over an oscilloscope.

First thing was to plug the main CPU board into the extender card and make sure the program still failed – and worked! – when I expected it to. (Different behavior when a card is eight or ten inches away from the backplane is a very common and to-be-expected form of the observer effect!) A big sigh of relief when it kept behaving as it had before. And so I started poking around.

Here's how a computer works: An address in the Program Counter is presented to a bunch of drivers so that the address value goes out on the Address Bus. Those addresses get decoded and, in a minicomputer with core memory, do a lot of different things. In particular, a peculiarity of magnetic core memory is that reading a core actually erases the core, and how's that for an observer effect! Unseen to the programmer are the machinations that cause the value that was just read read -- and erased -- to be written back to the core memory.

Meanwhile, the instruction just read from memory itself gets decoded by a bunch of logic that looks at the bits of the opcode and determines what happens next. An ADD instruction does one thing; a JMP does another. And a SENSE instruction includes the bit address of the sense line to be examined; that data goes to a set of multiplexors that select the line and present its HI/LO value to the logic that takes a jump if the bit is HI or falls through to the next instruction if the bit is LO.

What I eventually found is that the particular hunk of memory containing the SENSE instruction was running somewhat slower than the rest of memory. Because the instruction was a little late coming out of that bank of core, the main decode was a little late figuring out it was a SENSE instruction, which meant the sense line decode was a little late getting to the multiplexors, which meant by the time the SENSE hardware was deciding on HI versus LO, the actual value of the Teletype INPUT READY bit wasn't coming through the multiplexors.

It ultimately turned out that the machine's design wasn't up to specification. The memory had a certain specified access time, but my analysis of the schematics showed that if the memory actually performed at the specification's worst case, no SENSE instruction would ever work. The only reason it worked was because all real memory ran significantly faster than the spec! This particular machine had one bank of memory with some elements just slow enough to fail.

So, after sorting it all out, I figured out that I could buy back some time in the SENSE logic decode by replacing some ordinary 7408 AND gates with some significantly faster 74S08 Schottky gates. So I unsoldered a couple of chips; soldered in the replacements...

Problem solved.

Looking back on it, this was pretty heady stuff for somebody not long out of college. The machine cost about as much as I was going to make in six or so years. If I had goofed up, I could have done some significant warranty-voiding damage. But the problem was there, and it needed to be solved, so...

Last Modified: 2019-02-27

General

Tales from the crypt

Heisenbug