Mad Man with a Compiler: software engineering

Showing posts with label software engineering. Show all posts

Sunday, February 4, 2024

Lessons from Ancient File Systems

In my previous post, I mentioned that I found a number of oddities when digging through the details of various Atari 8-bit file systems. I read through the specifications I could find online, and ran the actual code in emulators to verify and discover details when the specifications were unclear or incorrect. There were some surprising finds.

I looked at:

Atari DOS 1.0
Atari DOS 2.0s
Atari DOS 2.0d
Atari DOS 2.5
MyDOS 4.5
Atari DOS 3
Atari DOS 4 (prototype, never released)
DOS XE
LiteDOS
Sparta DOS

Atari DOS 1.0 started it all off back in 1979. It supported single-sided, single-density disks that consisted of 720 sectors holding 128 bytes each. It has some bugs and implementation limitations, so DOS 2.0s ('s' for single density) soon replaced it, making some key changes. Soon other DOS versions started appearing, some trying to maintain backwards compatibility, and others trying completely new approaches. I won't go into the technical details, but I do want to highlight what I think are the most interesting design decisions.

Keep in mind that DOS 1 was designed when they were planning on selling the base Atari 800 with only 8K of RAM. (The 400 would have been only 4K, but marketed with a cassette drive instead.) So the design minimized the need for additional sector buffers. Hence the strategy for files was to use the last few bytes of each sector as metadata, including the link to the next sector in the file.

Atari used 6 bits at the end of each sector to hold a file number. This was the index of the file in the directory. (The other two bits and another byte together pointed at the next sector of the file.) This was presented in documentation as being there to detect file corruption, which seemed like a waste, as that sort of file corruption almost never happened. But I'm now pretty sure that wasn't the reason the file number was there. Instead, it was to support "note" and "point" commands. Atari believed people would use files as a database, and with the sector chaining, it was impossible to jump around in a file without reading it linearly. So a program could "note" its position in a file (sector and offset) and then return there later with the point command. This is where verifying that the file number in the user-supplied sector number matched the entry used when opening the file came in. Without that number, there would be no verification that a "point" would end up in the right file.

In DOS 1, the last byte of each sector was a sector sequence number with the high bit cleared. This was a nice idea, and would have been useful if someone were writing a tool to undelete files. If the high bit were set, it was the last sector of the file, and the low bits were the number of data bytes in the sector. In DOS 2, the last byte was instead always the number of data bytes in the sector. This would seem to have made the code simpler, but actually the opposite, as DOS 2 also included compatibility code for reading DOS 1 files (as did DOS 2.5). The real reason is that allowed partially filled sectors in the middle of files, which happened when a file was opened for append, as it would start with an empty write buffer instead of filling it with the data from the last sector. (DOS 1 didn't have an append option at all.) I would have thought it would have been simpler to implement append without partial sectors and avoid needing compatibility code, but apparently not.

Several less-popular versions of DOS didn't support zero-length files. If you opened a new file for writing and closed it without first writing anything, the file was deleted. That seems crazy today, but at the time, that was not one of the quirks people complained about; mostly those DOS versions were unpopular because of incompatibility and issues like wasted space from internal fragmentation.

Only two versions of DOS supported time stamps on files, DOS XE and Sparta DOS. Nobody wanted to type in the date and time on every boot, and a battery-backed real-time clock wasn't a standard feature of any of the 8-bit computers that I'm aware of (certainly not the Atari).

Several versions of DOS supported subdirectories. This was mostly for versions that supported larger disks, like MyDOS and SpartaDOS.

One quirk of Atari DOS (1, 2, and 4) is putting the directory and free sector bitmap table on a track in the middle of the disk. The theory was that this would minimize seek time between reading the directory and reading a file. Not a horrible idea, though it was horribly mangled by an off-by-one error because sector numbers on the drive were 1-based instead of 0-based as the original coders believed, so there was a seek between the sector bitmaps and the directory. They doubled down on this in DOS 4, where the location of the directory changes depending on the disk format, so larger disks have the directory at higher sector numbers. There was a lot wrong with DOS 4; it probably would have been quite unpopular if it had made it past being a prototype without significant changes.

The biggest problem with the official Atari versions of DOS is that they never anticipated future needs. I can forgive DOS 1, as it was developed on a short time frame for an 8K computer, and that locked in DOS 2 for compatibility reasons. But DOS 3 was a disaster. Atari had a new drive that could format 130K disks as well as the old 90K disks, so they needed a new DOS that would take advantage of the new space. But they ignored other drives with 256-byte double-density sectors (180K) and the possibility of double-sided disks (360K) that were already on the market for other computers. Better engineers would have designed something to grow and meet future needs, but instead they just implemented the minimum requirements. So when Atari was later designing the 1450XLD computer which did have 360K drives, they had to have a new DOS again. While that was never released, they did develop a DOS for it, which again, only handled the specific formats that that drive supported, with no thoughts for the future.

On the other hand, the most successful DOS versions for the retro computing Atari community now are precisely the ones that could scale to any drive, namely MyDOS and SpartaDOS. MyDOS retained DOS 2 compatibility, but extended it to support upto 16MB drives and adding subdirectories. SpartaDOS dropped compatibility, but was able to support 512-byte sectors, allowing for 32MB drive support. (The Atari's I/O protocols limited the system to a 16-bit field for sector numbers.)

So the big lesson here is to always plan for the future. Listen to the requirements for the current product, but then design with the assumption that you'll be asked to expand the requirements in the future. If you don't, users may be cursing you when the code is released. But who knows? If you do it right, people may still be using your code in 40 years.

Tuesday, July 30, 2019

Merge Conflicts for Non-Coders

I was discussing my day at work with my family, and I had spent much of the day resolving conflicts after doing a merge of two divergent branches. Of course, neither my wife nor my son understood what I was talking about, so I came up with a very simple analogy to explain what I was doing.

Imagine a teacher wrote out a plot outline for a book, and then assigned each student to write one chapter. The student writing chapter 5 decides that the heroes need to use a shovel, so he make a small change in the draft of chapter 3 so that they find a shovel at the beach. Meanwhile the student writing chapter 3 decides she doesn't like the beach scene, so she changes it so that they go for a hike in the woods instead.

Eventually the students have to put all their work together. The conflicting changes demonstrate two types of conflicts:

First, in adding in the shovel in chapter 3, the first student made changes to a paragraph that the other student also modified in changing the scene from the beach to the woods. When using an automated merge tool, this sort of conflict is automatically detected, and the person doing the merge is shown the conflict.

Second, if in resolving the direct conflict, it's possible that the shovel is removed from chapter 3. The merge had been completed, but no automated tools will tell you that the shovel used in chapter 5 now comes out of nowhere. If this were computer code, this would either show up at compile time ("Error: Reference to undefined object "shovel" in chapter 5, paragraph 12."), or worse, would introduce a bug that only shows up when the code runs under certain circumstances. If the book merge left the shovel out, hopefully it would be found by the editor (testing by the quality assurance team for a software project), not left for readers to be confused by (a bug in the field).

The right analogy can often make it much easier to explain things. If you like this one, or have questions about other aspects of software engineering that you would like me to find similar analogies for, please comment.

Monday, November 23, 2015

Don't Fix That Bug!

When writing a program, everyone encounters bugs. Usually the first thing you will try to do is fix the bugs. Don't do that. Fixing bugs should be the last thing you do. Yes, you will need to fix them, but it should, quite literally, be the last thing you do.

Fixing a bug is like looking for your lost car keys. You always find them in the last place you look. Why? Well, of course, because you stop looking once you find them. Unless you then look in several other places just so you can say it wasn't in the last place you looked.

The problem with fixing bugs is that bugs aren't simply a flaw in your code, they're an opportunity. Especially the interesting bugs. The bugs that take all day or longer to track down are great bugs. Make good use of them.

Use your bugs to make your code more robust. When you have a bug, especially an interesting bug in a large complicated program, the odds are good that you will have other similar bugs, either hidden in the code already, or waiting to be introduced by patches in the future.

Fixing a bug is like relying on a firewall to provide your security. Sure, a firewall is important, but anyone who pays attention to security knows that it's just one aspect of security. Good security has many layers of protection, because we all know that no one layer of security will be perfect.

So face reality. When you have an interesting bug, it's likely impossible to fix it. Sure you can probably track it down and eliminate the one instance of it that has shown up, but like a hydra, it will raise another head and bite you again, no matter how many you cut off.

You can never fix the bug. So what do you do? You deploy a multi-layered strategy for dealing with the bug.

Document the errors.

When you have an error, grab as much useful information about what may have caused it as you can. Put that in an error log or somewhere that you'll have access to it. And keep in mind that you might not remember what's going on when you look at the error several years later, so be verbose with saving the error data. A few minutes invested now adding a line or two of text and some color-coded formatting can save hours down the road.

Working with enterprise-class products, error handling is done at many levels, and the more interesting the bug is, the more attention should be paid at each of level.

If the error happens in the field, make sure you provide enough information to the customer that the customer will know what corrective action is appropriate. Keep in mind that the first thing many people do with an error message is paste it in to Google to see how others have dealt with it. With this in mind, make sure all error messages have unique strings that facilitate searching. You can create web pages for each error message so that the anyone hitting the bug will find your documentation. (You can then use the web server logs as an indication of how often different errors are showing up in the field.)

If your product can connect home for support, make sure that you provide documentation for the product support engineers, so that they know how to handle the error. In many cases, you'll want them not only to know how to fix the particular case, but also have them collect enough information for you to track down and fix that instance of the bug.

Add debugging tools.

When this error happens on a live system, how difficult is it to get the information that you need to tell what is going on? Can you spend some time now to make it easier to get that information? There's nothing more frustrating than having an instance of a difficult-to-recreate bug show up, and not be able to get the data you need to find out what caused it.

Again, you need to have a good mechanism for saving as much information about the state of the program at the time of the error as possible. A stack trace is great for many errors. Knowing values of key variables is always useful.

Render the bug harmless.

Make your code more robust so that the entire category of bug is no longer a critical show-stopper. When this bug trips, have the program detect that something went wrong and take corrective action.

I encountered a recent example of this. I was developing some code roughly analogous to the code that an operating system uses to suspend to disk when a laptop is put to sleep. The product needed to handle hardware failures of the storage medium during the save. I had a memory corruption bug that was causing writes to occasionally fail. Before tracking down the bug, I used the random corruption as motivation to implement the fault handling. By the time I tracked down the actual memory corruption bug, the code was working correctly despite the bug. It was very noisy, with lots of errors being reported, but nothing that stopped the end result from working correctly.

A similar example dates back to the 1970s. IBM was putting out their first mainframe system. They needed it to the most stable computer system ever built. It wasn't. It crashed all the time. Their solution was to create a fault-handling system that could catch most of the crashes and recover them. The result was perhaps the most successful system ever launched.

If it's the last thing you do, fix the bug.

Once you've done everything to deal with the bug outside of fixing it, fix it. But don't be surprised if months or years later, you get reports of the same bug. You only fixed the one instance that triggered. But if you did things right, it won't be a critical customer issue. No need to rush a patch out to the field. Just a bit of noise to correct in the next release.