Monday, November 23, 2015

Don't Fix That Bug!

When writing a program, everyone encounters bugs. Usually the first thing you will try to do is fix the bugs. Don't do that. Fixing bugs should be the last thing you do. Yes, you will need to fix them, but it should, quite literally, be the last thing you do.

Fixing a bug is like looking for your lost car keys. You always find them in the last place you look. Why? Well, of course, because you stop looking once you find them. Unless you then look in several other places just so you can say it wasn't in the last place you looked.

The problem with fixing bugs is that bugs aren't simply a flaw in your code, they're an opportunity. Especially the interesting bugs. The bugs that take all day or longer to track down are great bugs. Make good use of them.

Use your bugs to make your code more robust. When you have a bug, especially an interesting bug in a large complicated program, the odds are good that you will have other similar bugs, either hidden in the code already, or waiting to be introduced by patches in the future.

Fixing a bug is like relying on a firewall to provide your security. Sure, a firewall is important, but anyone who pays attention to security knows that it's just one aspect of security. Good security has many layers of protection, because we all know that no one layer of security will be perfect.

So face reality. When you have an interesting bug, it's likely impossible to fix it. Sure you can probably track it down and eliminate the one instance of it that has shown up, but like a hydra, it will raise another head and bite you again, no matter how many you cut off.


You can never fix the bug. So what do you do? You deploy a multi-layered strategy for dealing with the bug.

Document the errors.

When you have an error, grab as much useful information about what may have caused it as you can. Put that in an error log or somewhere that you'll have access to it. And keep in mind that you might not remember what's going on when you look at the error several years later, so be verbose with saving the error data. A few minutes invested now adding a line or two of text and some color-coded formatting can save hours down the road.

Working with enterprise-class products, error handling is done at many levels, and the more interesting the bug is, the more attention should be paid at each of level.

If the error happens in the field, make sure you provide enough information to the customer that the customer will know what corrective action is appropriate. Keep in mind that the first thing many people do with an error message is paste it in to Google to see how others have dealt with it. With this in mind, make sure all error messages have unique strings that facilitate searching. You can create web pages for each error message so that the anyone hitting the bug will find your documentation.  (You can then use the web server logs as an indication of how often different errors are showing up in the field.)

If your product can connect home for support, make sure that you provide documentation for the product support engineers, so that they know how to handle the error. In many cases, you'll want them not only to know how to fix the particular case, but also have them collect enough information for you to track down and fix that instance of the bug.

Add debugging tools.

When this error happens on a live system, how difficult is it to get the information that you need to tell what is going on? Can you spend some time now to make it easier to get that information? There's nothing more frustrating than having an instance of a difficult-to-recreate bug show up, and not be able to get the data you need to find out what caused it.

Again, you need to have a good mechanism for saving as much information about the state of the program at the time of the error as possible. A stack trace is great for many errors. Knowing values of key variables is always useful.

Render the bug harmless.

Make your code more robust so that the entire category of bug is no longer a critical show-stopper. When this bug trips, have the program detect that something went wrong and take corrective action.

I encountered a recent example of this. I was developing some code roughly analogous to the code that an operating system uses to suspend to disk when a laptop is put to sleep. The product needed to handle hardware failures of the storage medium during the save. I had a memory corruption bug that was causing writes to occasionally fail. Before tracking down the bug, I used the random corruption as motivation to implement the fault handling. By the time I tracked down the actual memory corruption bug, the code was working correctly despite the bug. It was very noisy, with lots of errors being reported, but nothing that stopped the end result from working correctly.

A similar example dates back to the 1970s. IBM was putting out their first mainframe system. They needed it to the most stable computer system ever built. It wasn't. It crashed all the time. Their solution was to create a fault-handling system that could catch most of the crashes and recover them. The result was perhaps the most successful system ever launched.

If it's the last thing you do, fix the bug.

Once you've done everything to deal with the bug outside of fixing it, fix it. But don't be surprised if months or years later, you get reports of the same bug. You only fixed the one instance that triggered. But if you did things right, it won't be a critical customer issue. No need to rush a patch out to the field. Just a bit of noise to correct in the next release.

No comments:

Post a Comment