Thursday, October 19, 2006

To err is human

I am writing this post as a catharsis and purification. A centering of my spiritual engineering energy that may otherwise be out of balance. Three days ago I made a typo which eliminated the /etc directory on a fairly important server. It was amazing how long that server continued to plug away after being lobatomized. Let me take you through the story as I relive the moment, and ensure that I learn from it.

Like many of the tasks I juggle, this was to be a short time-slice effort. I needed a distraction from a longer term project, and wanted to bite off a small piece of something that didn't require significant thought. Part of our Jumpstart environment deploys a tar archive to the client which is later unpacked and massaged by a custom script. My task was to eliminate the usr/local/etc directory from that archive and than recreate it. As my fingers systematically hit the keys, one extraneous finger made its imprint on the keyboard.

"r" "m" "-" "r" "." "/" "e" "t" "c".

The world slowed down as my finger hit enter, and I felt my heart stop beating. I believe I actually flat-lined that morning. Could it be? Had I really deleted /etc? Yes. I had. The command I entered was: "rm -r . /etc". I removed the current working directory and the server's /etc directory.

Why was I using elevated privileges for mundane work? The tarball had root-owned files in it. This is a downfall of our approach at the moment. When using pkgadd format, anyone can own the files which are given attributes at installation time. This makes day to day maintenance much safer. Ironically, I was editing the archive because I had just created a package to replace the files I was deleting. It was almost as if the prior bad practice were vomiting on me as I excercised it from the server.

Fortunately we had an excellent SA on hand to boot from CD and restore the missing file system, and it was back in business a relatively short while later. Eningeering nad operations are segregated in duties at my current site, so I was unable to clean up my own mess. A very humbling experience indeed, and this is what it taught me:

(1) Mirrored operating system disks are a good thing, but they don't protect you from human error propogating mistakes across both disks. While I've been a bit critical of maintaining a third contingency disk, there are other similar solutions which I have a heightened respect for.

(2) Whenever executing commands using RBAC, sudo, or the root account, count to three before hitting enter. No matter how much longer it takes to get your work done, no matter how good you are with UNIX, and no moatter how long it has been since you made a mistake, counting to three will always be quicker than restoring a file system from tape.

No comments: