The Hunt For Blue Screen

This story began about 15 months ago, in November 2006. That was the time when Microsoft was getting very close to releasing Windows Vista, and it was the time for me to start getting serious about making sure my applications were compatible with it.

At that time I was using two computers for the development and testing, one with two single-core Intel processors, and another one with one single-core AMD x64 processor. Both were set up for development and testing of my programs: I was using the first one to test the 32-bit versions, and the second one to test the 64-bit editions of my programs. Since many people reported that Vista was more hardware hungry than XP, I thought it was a good occasion for me to also get a more powerful computer that would run Vista reasonably well. So I bought a new Core 2 Duo (dual core) processor, a motherboard to support it (P5L-MX from ASUS), a new video card to support the Aero user interface of Vista, put them together in a spare computer case I had, loaded up Vista Release Candidate on it, and started working on porting my applications to Vista.

It all went well for a while, except every couple of days or so my new powerful computer would once of a sudden “blue screen” and reboot.

After it happened a few times, I fired up WinDbg and loaded a few latest minidumps into it. They indicated that the crashes were happening in the FASTFAT.SYS driver, and the common reason for the errors was IRQL_NOT_LESS_OR_EQUAL, a common reason for a crash caused by a sloppily written device driver. It seemed like a bug in the FASTFAT.SYS driver shipped with the pre-release version of Vista. I decided there was not much I could do but hope that the bug would be fixed in the final (RTM) Vista release.

A couple of months later the RTM release of Vista became available, so I’ve reformatted the hard drive to get rid of the release candidate of Vista, installed a fresh copy of Vista RTM on it, and started using it.

In a day or two the same crashes started to happen again.

Figuring Microsoft would not release a new version of Vista with a buggy version of such an important driver as FASTFAT.SYS, I started looking for another reason. What made it difficult was that the blue screens appeared not very often, sometimes a week would go by and I started to hope I finally found out the reason, but inevitably, it would crash again no matter what I tried. And I tried a plenty:

I vacuumed the inside of the case and reseated the processor and the RAM modules.

The blue screens kept happening.

I ran the memtest program to check the RAM for errors for a few hours, it did not find any problems with the RAM.

The blue screens kept happening.

I installed the SpeedFan program to monitor the temperature of the hardware components. Although it did not show an overheating, I added another fan to the case.

The blue screens kept happening.

I’ve replaced the video card with another one.

The blue screens kept happening.

I’ve bought a new SATA hard drive (previously I was using an old IDE drive), and moved the Vista installation to it.

The blue screens kept happening.

I thought that maybe I got a faulty motherboard, so I bought a new one, this time P5LD2, again from ASUS. I also picked up another Core 2 Duo processor and a new set of RAM modules to go with it. I reinstalled Vista RTM from scratch, and set up my development environment, and started working as usual.

The blue screens kept happening.

As you can see, at that point I already had two computers which were giving me the blue screens every couple of days or so. I ran out of the new theories about the reason for the crashes, and I returned to the one I started with: the bug was probably in the FASTFAT driver of Vista after all, maybe I should have waited till Vista SP1 was out before switching to Vista as my main development platform. I started thinking about switching back to Windows XP. It so happened that at that time (in September 2007) I was locked out of both of my Vista computers by the buggy Genuine Advantage code of Windows Vista (I plan to share that experience of mine in a separate post, later on, stay tuned). That made the decision to switch back to XP real easy.

I was using Windows XP for several years, and never had a problem like that before, so imagine my surprise that after I’ve reinstalled Windows XP on each of my new computers, the blue screens started to happen almost from the day one. As before, they were occurring in the FASTFAT.SYS driver. It made it clear for me that I was blaming Vista in vain, it did not introduce a new bug, or, at least, if the bug was there, it was not Vista-specific.

I started analyzing the similarities between the two new computers, hoping that would give me a clue. They had different motherboards (although from the same manufacturer), slightly different processors, different RAM modules, different video cards, different hard drives (one was using a WD SATA drive, another one a Maxtor IDE drive). I came up with the idea that maybe I got very unlucky and I got two faulty motherboards. Luckily, at that time the built-in network adapter on one of the motherboards died, and I took this opportunity to RMA the motherboard back to ASUS. I got the replacement back in a few days, and installed it.

The blue screens kept happening.

Thinking that getting three faulty motherboards in a row was very unlikely, I started to try other things. Even though my two old computers were plugged in the same UPS device as the new ones and were working just fine, I thought maybe the new computers were more sensitive to the quality of the power they were getting.

I replaced a cheapo generic power supply in one of the new computers with a considerably more expensive and supposedly better one from Antec.

The blue screens kept happening.

I bought a new, more powerful UPS, specifically for use by the new computers.

The blue screens kept happening.

Out of desperation, I started all over and repeated every troubleshooting step I did before, with each of the crashing systems: reseated the modules, replaced the cables, ran the memtest.

The blue screens kept happening.

At that point, about a month ago, I ran out of ideas. I was ready to surrender and just live with it. Or maybe throw out both of the new computers I’ve built and buy a completely new one, and I was seriously contemplating that, when on January 15 it hit me: what that FASTFAT.SYS driver was doing there anyway? All of my hard drives have been formatted with the NTFS file system, I didn’t remember formatting a drive with the FAT or FAT32 system recently. Why would Windows load the FASTFAT driver?

I reviewed the properties of the drives listed in My Computer, and sure enough, there was one of them formatted with the FAT32 system. It was a virtual encrypted drive I created a while back with the TrueCrypt software. I used the drive as a backup place for sensitive files of mine. Periodically, I would burn the image to a DVD-R disc, to make a backup of it. And yes, there was a copy of this image on each of the new computers experiencing the crashes.

I reformatted the encrypted volumes with the NTFS file system.

The blue screens stopped.

It’s been almost a month since I’ve made the last change, and I have not had a single blue screen. Previously, they were happening every couple of days. I’m very confident now that I’ve found the culprit that caused me so much grief. I believe the following list describes the common conditions for the blue screens to occur:

  1. The computer should have a multi-core processor, such as Intel Core 2 Duo.
  2. The computer should have TrueCrypt 4.3a installed, and there should be an encrypted FAT32 volume mounted.

Why do I think the first condition is important? Because previously I was using TrueCrypt with FAT32 virtual drives for several years on the computers that had single-core processors, and never experienced such crashes with them. Only when I switched to the Core 2 Duo processors the crashes started to occur.

I’ve looked through the source code of TrueCrypt 4.3a and noticed that its driver was compiled with the NT_UP switch in its Makefile. This is definitely wrong. It means that the driver was targeted at the uni-processor systems. Since the multi-core processors are essentially multi-processors, defining NT_UP means asking for trouble.

Why did the crashes stopped after I’ve reformatted the encrypted drives with the NTFS file system? I don’t know. Apparently the NTFS file system driver is more robust and can tolerate the imperfect drivers such as the ones compiled with the NT_UP switch. Why didn’t I get crashes with my old two-processor computer? Again, I don’t know. Maybe the old computer was not fast enough for the error conditions to occur so frequently, and when it did crash once in a blue moon, I just dismissed that as something insignificant and did not pay attention to it.

Now, I noticed that a few days ago a new version of TrueCrypt 5.0 had been released. It uses a new driver build procedure, that does not seem to have the NT_UP flag anymore. This is good. However, looking through their support forums it seems like the new version introduced quite a few new bugs. I guess I’ll postpone upgrading to it until version 5.1 comes out. I want to get some rest from the blue screens for awhile :-)

Update: April 15, 2008

A few days ago I decided to try the latest release of TrueCrypt, 5.1a. I reformatted the NTFS encrypted volume back to FAT32 and the blue screens started to occur almost immediately. After two days of bluescreening, I reformatted the volume back to NTFS, and the blue screens stopped. It looks like TrueCrypt 5.1a still causes this problem. HTH.

Update: May 13, 2008

A week ago I started another experiment : connected a spare hard drive about the same size as the TrueCrypt volume I use, formatted the hard drive with the FAT32 system (just like the TrueCrypt volume that was giving me the blue screens), and copied everything from the encrypted volume to that hard drive. Then I dismounted the TrueCrypt volume, and assigned its drive letter to the hard drive I’ve just attached. Restarted the computer and kept using it as before, the only difference was that instead of a FAT32-formatted encrypted volume I was now using a regular FAT32-formatted unencrypted hard drive. A week passed by, no blue screens. Today I copied everything back from the hard drive to the FAT32-formatted TrueCrypt volume, and disconnected the extra hard drive. About an hour later a blue screen occurred. I think that proves conclusively that TrueCrypt is the real culprit behind these blue screens. HTH.

Update: July 16, 2008.

A week ago I’ve installed a new version 6.0a of TrueCrypt. One of the new things in it was an updated device driver with the improved support for the multi-core processors. That gave me the hope that this version might have finally fixed this bug. For a week it was running¬† smooth, no BSoDs, even though I’ve switched to using a FAT-formatted encrypted volume. I was thinking about reporting success here, but today – boom, blue screen with IRQL_NOT_LESS_OR_EQUAL status in fastfat.sys.¬† So I’m switching back to the NTFS volume and reporting for now that version 6.0a of TrueCrypt still has not fixed this problem. HTH.

Did you know? Our USB Encryption Software can protect any USB flash drive or any other external drive with a password and strong encryption. Try it free for 30 days! Read more…

Be Sociable, Share!

51 thoughts on “The Hunt For Blue Screen

  1. RAF

    Thanks for your advice – ran into the same problem with truecrypt 7.0a and by reformatting the drive to NTFS the BSOD disappeared on a the toshiba running 64bit Win 7. Your advice saved me immeasurable heartache.

Leave a Reply

Your email address will not be published. Required fields are marked *

IMPORTANT! Do not put any URLs in the comment, or it will be rejected. Before submitting, please solve the following simple math problem (so we know that you are not a spambot) :-)

What is 10 - 2 ?
Please leave these fields as-is: