Wednesday, January 16, 2008

Got dependencies?

As anyone who follows my blog has learned, I'm a packaging junkie. I write packages for everything I deploy, and happily plunk them into my JumpStart server where they perform their duty in a predictable and maintainable way. Does it get any better than this? Well, actually... It got pretty rough this week as my troubleshooting skills enjoyed a solid workout.

We are in the process of moving our standard deployment from Solaris 9 with SRM project containers to Solaris 10 Zones. Now Zones are really surprisingly simple for the rich myriad of benefits they provide. It takes very little time to get to the point where you can set up a test box with multiple zones in a basic Solaris environment. But what if you don't have a basic environment?

We have packaged everything. Most of our /etc files that aren't stock are edited by class action scripts or postinstall routines. We've deployed hundreds of servers with this configuration and it worked flawlessly until we started deploying Zones. Then suddenly, our zones couldn't talk to the Directory, were missing resolv.conf, and all kinds of other cascading problems. Ugh. Where to start?

The first thing I did was check the error messages. Pretty clever, no?

The file /export/zones/testzone01/root/var/sadm/system/logs/install_log contains a log of the zone installation.

Ok, let's see what's in the log...

WARNING: setting mode of /var/spool/cron/crontabs/root to default mode (644)
ERROR: attribute verification of /export/zones/testzone01/root/var/spool/cro
n/crontabs/root failed
pathname does not exist

Installation of CGHtest on zone testzone01 partially failed.

Well that's strange. Let's see what the registry says about root's crontab:

$ grep "/var/spool/cron/crontabs/root" /var/sadm/install/contents
/var/spool/cron/crontabs/root e cron 0600 root sys 48 3760 1200431631 SUNWcsr:cronroot

The root crontab is owned by SUNWcsr. So it turns out that when the global zone creates the sparse root zone, it tried to install CGHtest before SUNWcsr was installed. The root of my problem was slipping a habit that I'm normally quite vigorous about enforcing.

Any time I use question marks in my prototype file to inherit attributes from an already-installed package I always, always, always stop immediately to look up that inherited object in the registry and add its parent package to a depend file. Here's an example prototype file displaying inheritance:

$ cat prototype
i pkginfo
i depend
i copyright
i i.cron
i r.cron
d none var ? ? ?
d none var/spool ? ? ?
d none var/spool/cron ? ? ?
d none var/spool/cron/crontabs ? ? ?
e cron var/spool/cron/crontabs/root ? ? ?

This was never a problem in our traditional JumpStart environment because the JET custom packages are installed after SUNWcsu. Fortunately, it's easy to control the order (dependency) of package installations using the depend(4) file.

$ cat depend
P SUNWcsr Core Solaris, (Root)
$ grep depend ./prototype
i depend

Problem solved! Of course, the problem never would have happened if I'd remembered my Jedi training. It's good to be humbled on a semi-regular basis.

No comments: