When writing a
program, everyone encounters bugs. Usually the first thing you will
try to do is fix the bugs. Don't do that. Fixing bugs should be the
last thing you do. Yes, you will need to fix them, but it should,
quite literally, be the last thing you do.
Fixing a bug is like
looking for your lost car keys. You always find them in the last
place you look. Why? Well, of course, because you stop looking once
you find them. Unless you then look in several other places just so
you can say it wasn't in the last place you looked.
The problem with
fixing bugs is that bugs aren't simply a flaw in your code, they're
an opportunity. Especially the interesting bugs. The bugs that take
all day or longer to track down are great bugs. Make good use of
them.
Use your bugs to
make your code more robust. When you have a bug, especially an
interesting bug in a large complicated program, the odds are good
that you will have other similar bugs, either hidden in the code
already, or waiting to be introduced by patches in the future.
Fixing a bug is like
relying on a firewall to provide your security. Sure, a firewall is
important, but anyone who pays attention to security knows that it's
just one aspect of security. Good security has many layers of
protection, because we all know that no one layer of security will be
perfect.
So face reality.
When you have an interesting bug, it's likely impossible to fix it.
Sure you can probably track it down and eliminate the one instance of
it that has shown up, but like a hydra, it will raise another head
and bite you again, no matter how many you cut off.
You can never fix
the bug. So what do you do? You deploy a multi-layered strategy for
dealing with the bug.
Document the
errors.
When you have an
error, grab as much useful information about what may have caused it
as you can. Put that in an error log or somewhere that you'll have
access to it. And keep in mind that you might not remember what's
going on when you look at the error several years later, so be
verbose with saving the error data. A few minutes invested now
adding a line or two of text and some color-coded formatting can save
hours down the road.
Working with
enterprise-class products, error handling is done at many levels, and
the more interesting the bug is, the more attention should be paid at
each of level.
If the error happens
in the field, make sure you provide enough information to the
customer that the customer will know what corrective action is
appropriate. Keep in mind that the first thing many people do with
an error message is paste it in to Google to see how others have
dealt with it. With this in mind, make sure all error messages have
unique strings that facilitate searching. You can create web pages
for each error message so that the anyone hitting the bug will find
your documentation. (You can then use the web server logs as an indication of how often different errors are showing up in the field.)
If your product can
connect home for support, make sure that you provide documentation
for the product support engineers, so that they know how to handle
the error. In many cases, you'll want them not only to know how to
fix the particular case, but also have them collect enough
information for you to track down and fix that instance of the bug.
Add debugging
tools.
When this error
happens on a live system, how difficult is it to get the information
that you need to tell what is going on? Can you spend some time now
to make it easier to get that information? There's nothing more
frustrating than having an instance of a difficult-to-recreate bug
show up, and not be able to get the data you need to find out what
caused it.
Again, you need to
have a good mechanism for saving as much information about the state
of the program at the time of the error as possible. A stack trace
is great for many errors. Knowing values of key variables is always
useful.
Render the bug
harmless.
Make your code more
robust so that the entire category of bug is no longer a critical
show-stopper. When this bug trips, have the program detect that
something went wrong and take corrective action.
I encountered a
recent example of this. I was developing some code roughly analogous
to the code that an operating system uses to suspend to disk when a
laptop is put to sleep. The product needed to handle hardware
failures of the storage medium during the save. I had a memory
corruption bug that was causing writes to occasionally fail. Before
tracking down the bug, I used the random corruption as motivation to
implement the fault handling. By the time I tracked down the actual
memory corruption bug, the code was working correctly despite the
bug. It was very noisy, with lots of errors being reported, but
nothing that stopped the end result from working correctly.
A similar example
dates back to the 1970s. IBM was putting out their first mainframe
system. They needed it to the most stable computer system ever
built. It wasn't. It crashed all the time. Their solution was to
create a fault-handling system that could catch most of the crashes
and recover them. The result was perhaps the most successful system
ever launched.
If it's the last
thing you do, fix the bug.
Once you've done
everything to deal with the bug outside of fixing it, fix it. But
don't be surprised if months or years later, you get reports of the
same bug. You only fixed the one instance that triggered. But if
you did things right, it won't be a critical customer issue. No need
to rush a patch out to the field. Just a bit of noise to correct in
the next release.
No comments:
Post a Comment