Friday, September 08, 2006

AppleCare not up to Sun Service

One of the big reasons I was so excited to move my desktop environment to Mac OS X was its underlying UNIX operating environment. Being a UNIX guy, I'm well aware of how well instrumented it is, and how surgically it can (often) be debugged. This, of course is in contrast to the Windows world where it has become not only common, but almost accepted that troubleshooting step #1 is to reboot and sacrifice a chicken.

Over time my G5 was starting to crash with increasing frequency. At first it was once in a rare while, although I was still surprised that it happened. Recently it accelerated to the point where it crashed almost once per day. Given how much I paid to have rock solid hardware, and AppleCare behind it, this was not acceptable to me. So, I finally grew tired of my Mac's grey text-box of death and called my friendly AppleCare representative.

I began by telling my story, and adding a detail I felt was critical. Each time the system crashes I generated a crash dump report, and the stack trace always pointed back to the USB driver. I grabbed the text below from another site as an example, but it's very similar to what I was seeing.

1 Unresolved kernel trap(cpu 0): 0x300 - Data access DAR=0xdeadbeef PC=0x0e692550
2 Latest crash info for cpu 0:
3 Exception state (sv=0x0EB5DA00)
4 PC=0x0E692550; MSR=0x00009030; DAR=0xDEADBEEF; DSISR=0x42000000; LR=0x0E692530;
R1=0x081DBC20; XCP=0x0000000C (0x300 - Data access)
5 Backtrace:
6 0x0E6924A8 0x00213A88 0x00213884 0x002141D4 0x00214830
0x00204CB0 0x00204C74
7 Kernel loadable modules in backtrace (with dependencies):
9 dependency:
10 Proceeding back via exception chain:
11 Exception state (sv=0x0EB5DA00)
12 previously dumped as "Latest" state. skipping...
13 Exception state (sv=0x0EB64A00)
14 PC=0x00000000; MSR=0x0000D030; DAR=0x00000000;
DSISR=0x00000000; LR=0x00000000; R1=0x00000000; XCP=0x00000000 (Unknown)

Note that USB line? I was seeing it in EVERY crash. This tends to be something worth investigating. In my case I have a USB card reader attached to the USB ports in the bottom of my CinemaDisplay, and a Palm Pilot USB cable plugged into the front of my case. Another observation I made was that each time my system crashed it was essentially idle. It usually happened at night, or while I was at work. I would return to the sound of a jet engine coming from that silver box.

My first suggestion was that we look at the panic logs and try to identify the faulty components, but this didn't get too much traction. AppleCare is set up so that if the basic rubber stamp checks (slightly better than a reboot, but not by much) fail, they redirect you to a local store. In my case this wasn't appealing. It's a 40 minute drive from here, and the issue is intermittant. I could end up being without my Mac for more than a week if things went well.

So, we went through and erased all the caches and preferences, then reset the NVRAM. My system was brought back to factory specs, although I really did almost nothing abnormal to it. I don't use funky extensions or other hacks; I use mainstream well supported stuff.

Much to my surprise, the system has been stable since the activities. I'll be the first to eat my words, but I'm not used to voodoo troubleshooting. This was like chemo-therapy where we just bombard the system in hopes of getting all the cancerous code. I'm used to working in a surgical environment where we see CPU 0 corrupting data on an interval that indicates it needs replacing.

As much as I'm a hopeless fan of Solaris, I have to say that I don't think it's a huge quality difference between Sun and Apple that gives me ths uneasy feeling about this experience; I think it's the quality of Sun Service. They are used to dealing with mission critical servers more than art-critical desktops. No offense to the Mac world - I'm one of you... But it's a very different world.

No comments: