Solaris Jedi: 2007

Monday, October 29, 2007

The nature of systems engineering

I was reading an interesting passage from the Tao Te Ching this morning which, I believe, has great applicability to the nature of systems engineering.

We join spokes in a wheel,
but it is the center hole
that makes the wagon move.

We shape clay into a pot,
but it is the emptiness inside
that holds whatever we want.

We hammer wood for a house,
but it is the inner space
that makes it livable.

We work with being,
but non-being is what we use.

What does this mean to those of us who wield keyboards at the battle of the command line? Probably more things than can be said. To start the pondering I'd like offer two thoughts.

(1) The external perspective: Remember that the business your systems support don't think in terms of IOPS, MB/sec, or LoC. The IT organization does not exist to amaze itself. It exists to enable a business process. As you learn about Perl, Zones, and ZFS, are you also learning the business those technologies support?

(2) The internal perspective: Have you ever met an administrator or engineer whose wall is decorated with certifications, and yet you would not trust them to configure IPMP on a server you were responsible for? Have you met anyone who could write code as fluently as you speak your native language, and yet they could not effectively translate business requirements into functionality without great effort? As you learn the technologies you need to execute your job, are you also learning universal skills such as troubleshooting, and communication?

What other areas of our trade does this parable apply to?

Thursday, September 20, 2007

Oracle's writing on the wall

I received an email newsletter this morning with the headline, "Oracle support betrays a preference for Linux and x86." Sun and Oracle seem to have a love hate relationship driven primarily by thir symbiosis rather than their ideals. This appears to be another chapter in that long story.

The article referenced by the newsletter mentions the fact that Oracle 11g is currently only available for Linux. That's a very interesting move considering the size of the Oracle installed base on Solaris. Not only the population size, but the class of customer. More than one global enterprise is running Oracle on enterprise class Solaris hardware.

I can't help but speculate that we're leading up to a boost in Sun's emphasis on PostgreSQL. First we saw its inclusion in the base Solaris 10 software. This is no small thing; even compilers are distributed separately. Postgres' own FAQ recommends use of Sun's compilers over GCC on the Sparc platform. It's practically heresy to recommend an open source product be compiled on anything other than GCC, so again this is not to be dismissed. Finally, I'll draw your attention to the release announcement for Solaris 10, Update 4 where enhancements to PostgreSQL DTrace probes are released. If this doesn't look like building up a rebellion, I don't know what does.

I give Sun a lot of credit for investing heavily in PostgreSQL and bringing some serious competition to Oracle. Evolution is based upon competition, and I'm happy to see the Sun species evolving into a new predator.

The trouble with packages and auto-pilot

I stumbled into a very interesting problem and resolution this morning which I think deserves some attention. I didn't work on the diagnosis and research, so I'm summarizing from an email thread. We use a Citrix server to share out GNOME environments from our development server. It's particularly nice when you're working from home and the VPN kicks you out, or if you're using public wifi and your connection is spotty.

At some point a week or two ago people began to notice that they couldn't connect to GNOME. This took a little while to unfold because some people keep sessions opened for extended periods of time, but eventually we discovered that it was dead for everyone. After eliminating license server issues there was only one thing we could come up with that had been done to the server.

A colleague had installed a current version of FireFox on the server because Sun's desktop environment is often very slow to integrate application software updates. He used the packages from Blastwave.org. Note that I say packages: a plural word. Indeed, FireFox turned out to be more than twenty packages when delivered by Blastwave.

The foundation of Blastwave is their packaging system, pkg-get. If you have any stick time in the Linux world you're probably familiar with something like Yum, apt-get, or up2date. These tools know how to connect to software servers through http, https, ftp, firewalls, proxies, etc. They also know how to resolve package dependencies. This can be very convenient on a Linux system where a single source handles the OS packaging and application packaging.

In contrast, Solaris provides pkgadd. Pkgadd can not resolve dependencies. It only knows how to retrieve packages from a specified URL, but does not have any ability to retrieve packages from a Sun resource. Pkgadd is a bit antiquated by modern UNIX standards unless coupled with the Sun Connection which is not quite the same thing. This huge gap between Linux packaging systems and Sun's pkgadd inspired Blastwave's packaging system and repository.

Blastwave provides many packages that are provided by the Solaris OS. The difference is that they provide more frequent and convenient updates. If you need bleeding edge features in the tools you install, Sun's usr/sfw/* and /opt/sfw/* packages will probably not help. I tend to think that it's more the exception than the norm to require updates that frequently. I know there are exceptions here and there, but overall, how often do you really need a new version of wget, or gtar? Although I love having latest and greatest "stuff", I even use the old Mozilla browser in Solaris and rarely have any problems.

When my colleague innocently asked Blastwave to install the latest FireFox package, it installed a fairly significant list of packages. One of them was fam, the file alteration monitor. For those who may not be familiar with FAM, it is described as follows (from the FAM web site):

GUI tools should not mislead the user; they should display the current state of the system, even when changes to the system originate from outside of the tools themselves. FAM helps make GUI tools more usable by notifying them when the files they're interested in are created, modified, executed, and removed.

We eventually discovered that fam installs an inetd service. I don't know, or care what that service is doing. What I do know is that I did not want a new service running. As a result of installing the Blastwave FireFox package and its slew of dependencies we ended up with a new service running and had absolutely no warning that it was happening. That service somehow conflicts with, and breaks GNOME. It turns out that there is an OpenSolaris bug describing the same symptoms.

Ignoring the obvious concerns about a simple desktop web browser requiring 20 package dependencies and breaking GNOME, I have a much larger concern. Turning up an inetd service creates a new attack vector for a server. Whether or not that is acceptable is a question of risk management. In many cases it doesn't matter. In our data center, servers must pass an external probe scan to be in production and adding services requires change requests. So for our purposes, the changes are not acceptable, and we will need to back them out. We are also imposing a ban on blastwave within our data center servers. It's simply not an acceptable framework for a mission critical server environment.

Whether or not you deem it reasonable to install an inetd service to run FireFox, it's hard to justify the intuitive nature of a web browser requiring the inetd service. Note that fam is NOT a FireFox dependency in other distribution channels. Of course, this kind of thing can be caught with good change management using a promote-to-production path, which is how we found this issue on our development server.

While Solaris' pkgadd facility is not as convenient as some of the Limux systems, it forces you to make conscious changes to a system rather than hitting auto-pilot and hoping for the best. I would love to see Solaris' packaging facility evolve into a tool with the capabilities of its Linux counterparts, but only for the freeware / OSS packages that are built and distributed by Sun (of which there are quite a few). I'd also like to see the ability to configure additional repositories (such as a local server for custom packages), as long as it's not set that way out of the box. I guess its time for me to start exploring Update Connection's capabilities.

My suggestions are as follows: First, beware the autopilot. Second, keep Blastwave on the workstations, and as far away as possible from the critical servers.

Thursday, August 23, 2007

DVD upgrade adventures

I had an irresistible opportunity to rescue an Ultra 60 workstation from a trash nap recently. This is the sort of thing I really shouldn't do because I'm trying to reduce my data center footprint. On the other hand, it's such a cool workstation that I had to do it. This box was reported to be unable to boot, but I'm pretty good with hardware repairs, so decided to go for it.

Although it took forever to get through the process, the classic method worked. I can't count how many systems in this era seemed to have problems that turned out to be solved by reseating memory or CPUs. I did both, and it came to life like a resuscitated drowning victim.

Next stop, storage. I replaced the 9GB disks with 36GB disks from the unused half of my D1000 array. This was going too easy. As I was poking around the drive bay I noticed that the cable had been removed from the CD-ROM. Not a good sign. Tracing to the other end of that ribbon I noticed that someone must have been having a bad day as it was half ripped from the daughter board's crimping. Confirmed ugliness.

Being the fatal optimist I grabbed my tool kit and carefully pressed the ribbon back down onto its pins. Next stop, the drive bay. I reconnected he CD-ROM thinking that it might work... Nope. This one had a bad case of indigestion and spit out any disks I inserted. What's worse, once it spit them out, the drive tray could not be closed. Stick a fork in it - it's toasted.

I borrowed a Sun DVD from my 420r just to test out the SCSI channel, and successfully loaded Solaris 10, so it looks like the drive needs to be replaced. Next stop: eBay. I picked up a Pioneer DVD-302, which is one of the few remaining SCSI DVD options out there. I could have bought a Sun DVD, but they are all grey, and this case is beige. Can't compromise the aesthetics. (I'm really in bad shape, aren't I?). The drive arrived, looking shiny and new. I managed to get the thing installed, but it's not happy.

Booting from a DVD results in error messages like "Short read. 0x0 chars read". Eventually the retries end, and it complains about errors finding interpreter, and "Elf64 read error". Booting from a CD-ROM gets a little farther along before it spits out "incomplete read- retrying", and "vn_rdwr failed with error 0x5". Oddly, it does seem to be working once the OS is loaded, so this appears to be an incompatibility at the OBP level.

What annoyed me the most in this whole exercise was not finding anything in an hour of Google searches that indicated anyone had even attempted such an upgrade. I know there are quite a few U60s still kicking around out there, and I'd have to think their owners would be looking for DVD capability and higher speeds. I must have thought wrong. If you happen to be reading this post and have experience with a SCSI DVD-ROM being bootable in a Sun Ultra workstation I'd love to hear about it.

I guess I'll just have to keep looking for a beige Sun DVD-ROM on eBay, but so far the pickings are slim. Wish me luck.

Thursday, August 16, 2007

IBM Sees the Light?

Wow, I didn't see this coming.

IBM and Sun today jointly announced that IBM will offer and fully support Solaris on their compatible hardware lines. This raises some interesting dust clouds.

What does this mean to AIX, IBM's flagship UNIX? Personally, I think it means little. Sun supports Windows and Linux on their hardware, but those of use who have been with Sun for a long time still prefer Sparc in most cases. I believe the same will be true of IBM and Solaris.

How will Solaris compete with the investment IBM has already made in optimizing their previously supported operating environments? There's no way it will be on the same level right out of the gates, but when you consider the OpenSolaris model, it becomes clear that IBM will not have to jump through hoops to make it happen, and I believe they will. When IBM announced that they would be supporting Linux it was initially a bit of a surprise because of the inherent undermining ot AIX. And yet, they have contributed some incredible advances to Linux's abilities in the enterprise data center. I think of IBM as the mature mentor that helped Linux to grow up.

Now Solaris is no padawan looking for a master to study under, so that makes for a different game. But there's no question in my mind that IBM will have a serious group of Jedi coders participating openly and actively in the OpenSolaris community, and that can only help Sun and Solaris.

Of course, the down side is the precident this sets leading towards too broad a foundation. Using Linux as an example we see a massive code base that tries to support as much hardware as possible. There are basic laws of software engineering, just as there are laws of physics, and the more lines of code you have, the more potential you have for bugs, integration issues, and regression failures. Doesn't matter how good your developers, the probability still goes up. I'd hate to see Solaris supporting everything Linux does; I'd like it to stay focused in its sweet spot of quality hardware, which despite my preferences, I think IBM hardware is in alignment with.

What does this mean for Linux in the Enterprise? Well, I think Linux has a tough climb ahead of it as it stares up the cliff at Solaris' backside. Linux was developed on PCs by people who aren't typically in an enterprise. You could argue that the coders went to Linux because they coudln't afford at home what they had at work, but the bottom line is still the same.

Linux does not have a lot of "stick time" on servers built at the scale of Sun's high end servers like the Enterprise 20k. On the other hand, Solaris has been running on multiprocessor systems since before Linux was a twitch in Linus Torvald's ear. You have to spend time working with servers that have 20GB of RAM and 64 processors before you can even anticipate the kinds of problems that occur. Linux just doesn't have that kind of time in a data center. I'm not saying they can't get there, I'm just saying you have to pay your dues to provide stability at the high end.

Keeping all that in mind, put yourself in IBM's shoes. AIX is not gaining market share, although its a rock solid enterprise class operating environment. Linux brought IBM a huge customer base, and helped them to sell Intel hardware. Unfortunately, it didn't really put them in the data center where they belong. Along comes Solaris with the openness of Linux, and the opportunity to leverage it quickly - just as they did with Linux. But this time, they start at the upper end of scalability and bypass that climb altogether. Where would you put your resources in the long run?

Wearing my purely speculative hat, I think this announcement was a big strike against Linux in the Enterprise, and a foreshadowing of Solaris' long term viability. As more and more products come on-line with the Internet, data centers are only going to grow. And as that continues to happen, consolidation will be the only way to drive utilization up and costs down. The natural extension of this prophesy is that the operating environment that scales best and stays stable is going to be the evolutionary top of the food chain. And I think that Solaris will be in that seat.

IPMP, anyone?

While scanning various news feeds this morning I ran into a story regarding a computer breakdown at the LAX airport. The first article suggested the problem was a network card failure, and the second article suggested the problem was a switch failure.

In either case, the result was 17,000 - 20,000 (varies by atricle) international passengers being stranded for a fairly significant duration. But wait, it gets better... "The system was restored about nine hours later, only to give out again late Sunday for about 80 minutes, until about 1:15 a.m. Monday." Two failures, both stopping passengers at an incredibly busy airport.

I'd like to offer my consulting services to LAX for free, and recommend that they move an obviously critical function over to servers running the Solaris operating environment where they can enjoy the benefits of IP MultiPathing (IPMP). A properly architected system would have had redundant switches, and multiple network interfaces, each connected to a unique switch. The failures indicated would have cause no interruption to service. This is server design 101.

What, you may ask, would be the cost of this highly advcanced architecture? Well, of course it depends on the cost of the switches you run because you'd need two, but on the server side its free, and included with Solaris. I run IPMP on the servers in my basement, and my wife can assure any who may ask, my IT budget is far less than that of the mighty LAX airport.

Friday, August 03, 2007

When will Wall Street wake up?

In case you had any lingering doubts as to whether Sun has been doing the right thing by embracing the Open Source model, take a moment to peruse an entry from Jonathan Schwartz' Blog. I'll quote the part that caught my attention:

As you may have seen, we've announced our fourth quarter and full fiscal year results ... We grew revenue, expanded gross margins, streamlined our operating expenses - and closed the year with an 8% operating profit in Q4, more than double what some thought to be an aggressive target a year ago.

We did this while driving significant product transitions, going after new markets and product areas, and best of all, while aggressively moving the whole company to open source software (leading me to hope we can officially put to rest the question, "how will you make money?").

It is extremely frustrating to me that public companies must deal with putting their fate in the hands of a group of analysts who have such limited understanding of the ecosystems of information technology. Wall street has been so timid about Sun since the bubble burst, largely because their fear of the past clouds their ability to see the future (or the present, for that matter).

Today Sun has the best product portfolio I've ever seen. They also have the financial metrics to prove their strategy is good. I have invested more than ten years of my life in their products, and I can say with no hesitation that I plan to continue that investment for the next ten years as well. The only question I have is when the rest of the industry will catch up.

Friday, July 27, 2007

Picking a terminal server

My plans to equip the lab with older, but solid equipment has been going very well thus far. It's not cheap, but it's going to be very functional. The two Netra X1 servers are doing a great job, and I'm really enjoying having a LOM. I wish my "big iron" 420R had a LOM, but a Sun serial port still beats an x86 BIOS program. And what could be cooler than accessing those serial LOM devices through a terminal server? (Yes, I suppose a modern Sun server with an Ethernet LOM would be cooler, but don't burst my bubble, ok?).

So now that I've accumulated these boxes and am beginning to use them on a regular basis, you can imagine that patching wasn't far behind. Patching is one of many activities where a console connection comes in pretty handy. To make a long story short, I quickly grew tired of trucking my laptop downstairs, attaching a serial cable to it, and then performing an elaborate contortion routine to find the LOM port in the back of my rack while pressing my face through cobwebs. Been there before? Yes. I have decided that I need a terminal server.

So, what is my ideal terminal server? Well, there's a few requirements. It must be a quiet, low power device - no giant noisy fans need apply. I need an 8-port device, but 16 would give me room to grow if the price is right. I don't care too much about security protocols - this is a home lab that sits behind a firewall, and all my systems can be reprovisioned from a flash archive in a heartbeat. Should be easy right?

The first thing I learned is there are a LOT of 32-48 port high end (not old!) term servers available, primarily Cyclades devices. These look like Ferraris to me, and I dream of winning an auction for about $50 and attaching that puppy to my rack. Not going to happen... The next thing I noticed is a bunch of really old Xyplex and Perle devices. These rack up, but I read a bunch of horror stories, and got the idea from a few USENET postings that they are loud. I found a few other older devices, but they all had something that didn't seem right to me. It was time to get drastic...

I went with plan "C". In this case, the C stands for Cisco. Turns out that with some auction patience, a properly equipped Cisco 2509 (8 port) or 2511 (16 port) can be had with cables for around $150 or less. That's right at my pain threshold, but acceptable given what it provides. This solution appears to be hit or miss with the issue of spontaneous break signals halting the Sparc machines, which usually happens if the TS powers down, but the kbd command can be used to configure an alternate break sequence and avoid the issue.

The other appealing feature seems to be that I can configure reverse-telnet. This would allow me to run a command like "telnet termserver 2001" to get to port 1. Much more convenient than authenticating to a termserver and navigating annoying menus. And finally, being a full size 19" box I can rack it up without coming up with some combination of plywood and duct-tape. Suh-weet.

The downside? Well, ssh would be more cool than Telnet, but I can swallow my pride. Who knows? Maybe there's a Cisco update that would provide this. It might be a loud device. I have no diea. Another issue which decrements the coeficient of cool: It requires an AUI adapter to convert to an Ethernet RJ45 port. On the other hand, there's probably a lot of new SAs in the world who would look at that like a vintage muscle car... "Whoa - is that a REAL aui adapter, dude? You're must be hard core." Um, yeah. Maybe not. Although the loudness and power consumption concern me, I think I can live with these issues if it works, which I'm reasonably confident it will.

Now, to set up an eBay search and begin the hunt...

Thursday, July 26, 2007

Learning to think in Z

In the traditional disk mounting world we had a device uner the /dev directory which is mounted on a (aptly named) mount point. For example:


# mount /dev/dsk/c0t2d0s0 /export/install

On a large database server you might see the common convention of mounting disks with /uXX names...


# ls -1d /u*
/u01
/u02
/u03
/u04

This is the frame of reference I used when walking into the building of my new JumpStart server. My goal was to stick as close as possible to standard mount points. The first file system was to be mounted on /export/install. The second file system would serve as my home directory, and I didn't much care where it lived since I'd use the auto mounter.

The default zfs configuration is to mount a complete pool under its pool name. I tried to be creative in coming up with a naming convention, but slipped into mediocrity with a "z##" name. Hey, I'm tired of seeing /u##; It's amazing what a difference one letter can make in spicing up a server. Having come up with my name, I created the pool from my second disk:


# zpool create z01 c0t2d0 
# zfs create z01/install
# zfs create z01/home
# Hmm, why not make my home its own fs?
# zfs create z01/home/cgh

Wow. That was easy!

But now there's a sort of a problem. I can't quite get past seeing the JumpStart directory under /z01. It's not intuitive there. The world of Solaris sysadmins looks for JumpStart files in /export/install. So, how can we get this sweet ZFS file system to show up where I want it? Turns out this is pretty easy as well.


# zfs set mountpoint=/export/install z01/install

It even unmounts and remounts the file system for me. Oh yes, I'm a fan at this point.

One thing that's interesting is that once you move a mountpoint from its default, it can be easy to "loose" that file system. For example, if I list the contents of z01 at this point, I only see home. "install" no longer shows up there because its mounted on /export/install. In this example it's hard to loose anything, but on a large production server there could be many pools and many file systems. As you would expect, there's an easy command to list the file systems and their mount point:


# zfs list
NAME                   USED  AVAIL  REFER  MOUNTPOINT
z01                   1.61M  36.7G  26.5K  /z01
z01/home              1.49M  36.7G  1.45M  /z01/home
z01/home/cgh          35.5K  36.7G  35.5K  /z01/home/cgh
z01/install           28.5K  36.7G  28.5K  /export/install

I decided to leave the z01/home in place and just repoint the auto-mounter. From zero to "get it done!" in about 20 minutes with some play time. I love it.

First impressions of ZFS

If you're anything like me, you cling to that which you know while yearning for that which you haven't yet dabbled in. Tonight was a small victory for my self discipline, and a great example of why I think I'm going to be good friends with ZFS.

I've been mentally moving forward with a new JumpStart server layout for a while now. This server would have very little need for horsepower with storage space being what I really needed. It's main purpose is to help me consistently provision lab environments here at home for projects. I ended up selecting a Netra X1, which is very inexpensive on eBay. It's a nice low power draw platform that has plenty of power, and one less common feature among the Sun lines: IDE (PATA) drives. Yes, I mean that in a good way.

I was able to load it up with a 40gb boot drive and 120gb data disk to house install media images, flash archives, home directories, and some crude backups for the rest of the lab environment. The cost of a SCSI disk in that size is insane by comparison, and would provide no advantage for the tiny demand it would be charged with. I jumpstarted the hardware from another Sun machine, then loaded the Jupmstart Enterprise Toolkit (JET) and prepared to boogie.

Ahh, but now the moral dilemma rears its ugly head. How to manage that data disk? I haven't spent much time playing with Solaris Volume Manager (SVM) soft partitions, but enough to know it was a snap and would do the job. On the other hand, I've been twitching to learn ZFS, and this could be just the excuse I needed to get started.

The hard part about this decision was deciding whether or not I perceived ZFS to be an abyss, or a simple technology. I can't count the number of times I've done something silly like saying, "Oh sure, we could write a quick Perl script to do that." Only to find that two months later I'd grossly underestimated the complexity. I'm a chronic and pathological optimist.

I'm happy to report ZFS was painless and a pleasure to use. I'm still in shock from the simplicity. This is fun... I don't miss Linux at all.

Monday, July 16, 2007

Inconsistency in prtdiag output

I've been doing a lot of work recently writing Perl scripts to mine data from local Explorer repositories. It's a phenominal resource as a sort of RAW input to a configuration DB, and with Perl it's a snap to pull out data. My latest excecise was pretty trivial. I need to yank out the memory size field from prtdiag for each system, then dump it into an XML feed that serves one of our databases.

The information resides in the prtdiag-v.out file, and looks something like this:


fooserver{sysconfig}$ more ./prtdiag-v.out
System Configuration:  Sun Microsystems  sun4u Sun Fire E20K
System clock frequency: 150 MHz
Memory size: 65536 Megabytes

So, we throw together a little Perl script that does this:


sub get_memory_size {
   my $explodir=shift();
   my $prtdiagfile="$explodir/sysconfig/prtdiag-v.out";
   my $line;
   my $memsize;

   if ( -e "$prtdiagfile" ) {
      open(PRTDIAG,$prtdiagfile);
      while () {
         chomp;
         last if ( $_ =~ /^Memory size:\s/ );
      };
      close(PRTDIAG);
      s/Memory size:\s//g; # Kill the label
      s/\s+$//;  # Remove any trailing whitespace
      return $_;
   } else {
      # We did noit find the prtdiag file.
      return 0;
   } #end if

} #end get_memory_size

No problem!

Then I put together a simple loop to check what I'd found... Now help me understand why this can't be simple and consistent? Here's some of the variety:


[2GB]
[6144 Megabytes]
[512MB]

Can't we just agree to use either MB or GB? Or if we're in a verbose frame of mind, Megabytes or Gigabytes. My response is to normalize the exceptions I can locate so that it comes out consistently with GB or MB, but I wonder whether this will remain a stable interface?

What I find even more entertaining is a daydream of an engineering team sitting around a table having a serious debate about changing the output from Megabytes to MB. With such a controversial topic, I'd imagine the debate was heated.

Sunday, June 17, 2007

Sun Certification: To dig, or not to dig? (part 1 of 2)

Certification can be an almost religious debate amongst the technical community. One faction believes whole heartedly that the measure of a technologist is his ambition, and list of accomplishments. The other camp believes to death that certifications are a demonstration of professional commitment and a common ground from which to base skill assessments. I am currently a Sun Certified System Administrator (SCSA) for Solaris 7 and Solaris 9, as well as a Sun Certified Network Administrator (SCNA) for Solaris 7. I have been studying avidly for the upgrade exam which will add Solaris 10 to my SCSA listing which I hope to pass soon so I can reclaim the studying hours for other more interesting tasks.

I think just about everyone who has been in the field for a reasonable length of time has encountered the certification specter... You know, the guy who has his Masters degree in CS or IS/IT, MCSE, CCNA, SCSA, and a few others tossed in for good measure. They look like a lesser god on paper, but then you notice that once they log on to a system they can't write a script to save their life, and forget that shutting off the SSH daemon during business hours is a bad thing. These academic savants are a big reason why certifications have a bad name. In my mind they demonstrate the basis for the phrase, "just because you CAN doesn't mean you SHOULD." A certification, in my mind is a commitment to understand the best practices and core tools within a product and apply that knowledge actively to your solutions and daily work. A classic example is the proper use of init scripts - something that the majority of system administrators I have crossed paths with never learned. This information is found easily in the Solaris System Administration documentation collection, so why is no one practicing it? In this case, it has nothing to do with it being a bad practice to follow... It's just a topic people do not bother to understand beyond the minimum required to make it work.

On the other hand, I have known many top-notch Solaris professionals who are not certified. They can run circles around me in both theory and practice, but never took the additional step. I don't respect them any less because they have demonstrated a commitment to their field through practice. What I don't respect is the "average" SA who believes they could write the kernel scheduler in half the lines of code, but hasn't accomplished anything more advanced than setting up Apache Virtual Servers and using Veritas' Volume Manager to unencapsulate a root disk.

I've listened to this type of person lecturing from their soapbox about how they don't need a certification to prove their skills. Uh huh. But it might take the edge off the cowboy hat, and create a spark of thought-discipline. You see, being certified does not mean that you have to practice everything you learned. It means you have taken the time to understand in depth one way of doing things. The alternative is spending no time studying, and simply absorbing that which you cross paths with.

Another reason certifications have a bad name is that they do not address the real world. Exactly how could a one-hour exam possibly compress all the operational knowledge one gathers by the time they are ready to be certified? By now anyone reading my blog should be free of an doubt that I love Sun Microsystems. Having reminded you of that point first, I will now say that I am not a fan of Sun's certification strategy in the Solaris Operating System Track. I am basing my study on Sun's Web Based training curriculum, which I find generally outstanding as a substitue for instructor lead education. My beef is not with the vehicle, but the curriculum.

As an example, I would estimate that one third of the training materials consumed my time with how to accomplish a task in the Solaris Management Console (SMC). SMC is an interesting idea which flew about as well as a snail tied to a brick. It's not all bad, but it's not all that useful. I don't mind the option to use a GUI, but the amount of time spent on it in the curriculum is rediculous when considered against the amount of use SMC gets in the real world.

Is it good to know how to use SMC? Of course! Especially for it's ability to manage local accounts (but it stinks for network information systems like NIS+ or LDAP). But let's not worry about memorizing all of its menus and screens. One of UNIX advantages is its ability to be remotely managed over a serial connection. I'd never hire a UNIX SA who couldn't do his job proficiently over a 9600-8-n-1 connection.

Here's another sore spot for me... One of Solaris 10's most incredible features is ZFS. I have not begun to expand in my mind the full effect it will have on the industry, and it's not just a series of commands to memorize - it's an entirely new way to manage storage. And yet, there is NO coverage of it on the Solaris 10 exam. Are you KIDDING me?

Thankfully zones are covered, and I'm told that the exam had a good number of related questions. However, the coverage isn't very deep, and sticks to the commands more than the theory. That's unfortunate because it's easy to look up a man page, but hard to design a well thought out consolidation platform. I'd say that sentence sums up my thoughts on certification strategies on many levels.

Resource management is another feature which seems conspicuously absent from the certification curriculum. Although it is very complex (aren't zones as well?), it is a very powerful Solaris feature which I believe is a competitive advantage for Sun. So why not expect a certified administrator to know how to use it? The idea isn't to make everyone feel good with a title on their business card, it is to demonstrate that someone has differentiated themselves by demonstrating a defined level of skill.

What else would be important for an SA to have cursory knowledge of? DTRACE, any one? I don't expect every competent Solaris administrator to be able to write advanced D scripts, or memorize the seemingly infinite number of probes available in Solaris, but for the love of McNealy, can't we even expect them to know what kind of problem it solves? Can't we even establish what a probe is, and why Solaris is WAY ahead of Linux in that respect?

Finally, the emphasis on memorizing obscure command line options really grates on me. This is really what undermines the technical merit of Sun's Solaris Certifications. There are so many commands and concepts that deserve coverage, it seems a shame to take up space with questions like, "What is the option you must give to ufsdump in order to ensure /etc/dumpdates is updated when usng UFS snapshots." I don't know anyone practicing in the field who wouldn't look that up in the man pages even if they THOUGHT they knew the option.

I could go on, but I think the point is made: The exam is lacking in strategic substance. As a result, most folks describe it as a memorization excercise. That's a shame, because the exam COULD be a differentiating ground for Solaris professionals as well as a way for Sun to ensure the compelling features of Solaris are being leveraged to their fullest. And yet, I'm getting ready to take my third SCSA exam. Why?

I maintain the currency of my Solaris Certifications because I believe a professional seeks to understand standards in their field, whether good or bad. As a professional Solaris system architect, the SCSA and SCNA exams are at the core of my practice whether I choose to follow or deviate from their content. I also believe that a certification tells my customers (or employers) that I demonstrate a certain level of competence, even if the bar is not as high as I would like to see it.

I believe deeply in the importance of standards and certifications as a vehicle to advancing the maturity of Systems Engineering practices as applied to system administration. And although Sun's certifications are not there yet, I will continue to support them for what they do provide, and what I hope they will provide in the future; A vehicle to advance the maturity of the industry.

Part 2 of this article will discuss my recommendations for improving Sun's Certifications. As usual I have a few ideas up my sleeves. Stay tuned...

Monday, May 21, 2007

Who's on first? Identifying port utilization in Solaris

Setting up Apache2 on Solaris 10 is normally about as challenging as brushing your teeth. But in this case, I was humbled by an unexpected troubleshooting adventure. I needed to transfer a TWiki site from an Apache server running on Solaris 9 to an Apache2 server running on Solaris 10. Sounds pretty straight forward, but I abandoned discipline at one point in the game and that detour came back to bite me.

I started carelessly thinking that to make things simple I would just use the legacy apache. This would save any initial headaches with module incompatibilities (if any existed). So, I started out copying the config file in place and trying to start the daemons. It didn't work, and after a few minutes of fiddling with the new httpd.conf I changed course. My reasoning went something like this, "If I'm going to spend much time fiddling, I might as well fiddle with Apache2 and have something better than I started with." And so it began.

I stopped the legacy Apache daemons and followed a similar process with Apache2, ending with the same result: No daemons. I did some fiddling and located a minor typo I'd made in the configuration which is not of consequence to this story. I issued a "svcadm restart apache2" command. Yeeha! Now I had five httpd processes just chomping at the bit for a chance to serve those Wiki pages.

Or did I? It turned out that no matter what I did with my web browser remotely or locally I couldn't get a response. So, I tried a quick telnet to port 80 to see what there was to see... And of course I received a response, so all must be well. Somewhere in my troubleshooting process I made two mistakes:

First, I didn't remove the httpd.conf file from /etc/apache, which means the legacy Apache starts up and conflicts with Apache2 on a reboot. I've already written an article that goes into some detail about why the current legacy Apache's integration isn't ideal, so I won't expand on my frustration in this one. This problem was quickly solved, and could have been avoided if I had adhered to my Jedi training.

Second, I assumed that when I directed a Telnet session to port 80 it was reaching the Apache2 server. In fact, it was not. I shut down the Apache2 server and again issues the Telnet command to port 80. Surprise! The same greeting appeared. So, some process on the system had claimed port 80 before Apache could do so. Now... To find it!

Linux distributions typically ship with the lsof utility. This provides a quick and convenient way to identify what process is using what TCP port. Solaris doesn't have lsof in the integrated Open Source software (/usr/sfw) or the companion CD (/opt/sfw). It's not hard to obtain and compile, but it's just inconvenient enough that I'm inclined not to do it. My next logical question became, "what is the Solaris way to accomplish my goal?".

Solaris has no way to natively solve this issue without a shell script. There are a number of similar scripts available on-line through a quick Google search. None are particularly complex, but complex enough that you wouldn't want to write them every time you need it. Here's what I ended up with:


#!/bin/sh

if [ `/usr/xpg4/bin/id -u` -ne 0 ]; then
   echo "ERROR: This script must run as root to access pfiles command."
   exit 1
fi

if [ $# -eq 1 ]; then
   port=$1
else
   printf "which port?> "
   read port
   echo "Searching for processes using port $port...";
   echo
fi

for pid in `ps -ef -o pid | tail +2`
do
   foundport=`/usr/proc/bin/pfiles $pid 2>&1 | grep "sockname:" | egrep "port: $port$"`
   if [ "$foundport" != "" ];
   then
      echo "proc: $pid, $foundport"
   fi
done

exit 0

When executed, it will produce output similar to the following. Note that it requires root permissions to traverse the proc directories...

cgh@testbox{tmp}$ sudo ./portpid 80
proc: 902,      sockname: AF_INET 0.0.0.0  port: 80
        sockname: AF_INET 192.168.1.4  port: 80
        sockname: AF_INET 127.0.0.1  port: 80

A quick "ps -ef " command told be that our Citrix server was to blame for the port conflict...


cgh@testbox{tmp}$ ps -ef |  nawk '$2 ~ /^902$/ {print $0}'
 ctxsrvr   902     1   0   May 18 ?           7:00 /opt/CTXSmf/slib/ctxxmld

Ah ha! Problem solved. I'd like to see the Solaris engineering team add a "p" command, or an option to an existing command to make this functionality a standard part of Solaris. Another option would be to integrate the Linux syntax for the fuser command to make this possible.

Friday, May 18, 2007

Apache in Solaris 10: 3 Simple Things I Would Change

The Apache legacy run control script in Solaris 10 (/etc/init.d/apache) provides an excellent example of a few practices to avoid when writing init scripts.

Take a look at the code snippet below:


if [ ! -f ${CONF_FILE} ]; then
       exit 0
fi

Are you kidding me? Of course this is easy to debug, but let's look at what it does anyway: If the configuration file is missing, when you ask to start Apache, and it will exit with a code of zero when it doesn't find the /etc/apache/httpd.conf file. In case you didn't catch the first four words of this paragraph I'll repeat them. Are you kidding me?

Here's a simple improvement...


if [ ! -f ${CONF_FILE} ]; then
       echo "ERROR: ${CONF_FILE} not found.  Exiting."
       exit 1
fi

The first change was to exit with a non-zero status. Zero is the UNIX standard exit code representing successful completion. If the configuration file is missing and you request a startup, it should NOT exit with a zero status.

The second change is to provide a concise error message indicating why the exit code is going to be zero. There is no benefit to bolstering the cryptic nature of UNIX. In my mind the best systems are designed such that a tired SA at 4AM has a reasonable chance of accurate debug and corrective action.

Having said all this, the reason the code is necessarily convoluted because the not-yet-configured service has an active set of init scripts in the run control directories.


cgh@testbox{etc}$ ls -i /etc/init.d/apache     21813 /etc/init.d/apache*
cgh@testbox{etc}$ find /etc/rc?.d -inum 2813
/etc/rc0.d/K16apache
/etc/rc1.d/K16apache
/etc/rc2.d/K16apache
/etc/rc3.d/S50apache
/etc/rcS.d/K16apache

So the root cause of our problem is that someone decided to make it easy for someone who doesn't understand the Solaris Run Control facility to start Apache by simply creating the httpd.conf file. Is that really a good idea? I would argue that for many reasons it's a bad practice. If a service is not configured to run, it should not be active in any run level.

The third detail I would change is Solaris' default behavior of installing active sym-links in the legacy rc directories, and instead use an SMF manifest that adheres to standards.

None of this impacts the otherwise excellent web server that Sun has integrated into their OS, and I'm grateful that Sun has provided it in their standard OS rather than leaving it to the semi-integrated Companion CD. I woudl, however, like to see that integration brought up to Jedi standards.

5/21/07 Postscript: I probably should have made it clear that the Apache2 server is implemented nicely using SMF, and is probably what you ought be to using on Solaris 10 if you've decided to forego the JES Web Server. I don't think that excuses the older Apache server from maintaining Jedi discipline, but it does move the issue a bit toward the background.

Monday, May 07, 2007

Turn off the LAMP and Reuse Acronyms

I've never been a fan of the LAMP acronym because it's too restrictive. It gives the impression that to be socially responsible in the Linux community one needs to be a LAMP developer.

In this month's Linux Journal magazine I found an article explaining one perspective on why PostgreSQL is a more desirable database than MySQL. I've had the exact same thought process for years now. Truth be known, I also prefer Perl development to PHP, and I prefer running the stack on Solaris over Linux. I guess that SAPP doesn't have the same sexy ring as LAMP. There's probably an odd trademark thing with an ERP company as well.

Now before you get too bent out of shape, I am aware that the acronym has some poetic license with it, and people often swap Perl and PHP, and in theory any other letter can be swapped out. Why invent a new acronym that doesn't convey the real idea when a perfectly good acronym already exists?

There is nothing wrong with simply stating that an application is built on an Open Source Stack. The acronym OSS (Open Source Software) is well known and conveys a lot more than LAMP. It stands for a methodology rather than a point solution, and embraces the foundation that made "LAMP" so successful. Why limit yourself to MySQL and PHP? Wouldn't you be more valuable as an architect capable of leveraging the most appropriate components Open Source has to offer?

Tuesday, April 17, 2007

Another Round with the Laptop

Having recently switched a bunch of older Fedora Core servers to CentOS 4, I became very excited to see the announcement that CentOS 5 has been released. I am very pleased with the polish and stability of CentOS 4 and thought 5 would be a perfect update for my (then) Ubuntu laptop. My laptop is a fairly old IBM Thinkpad T23 - certainly not bleeding edge hardware.

The display was maxed out at 800x600 rather than the native 1400x1050 I'm accustomed to and my NetGear wifi card was recognized, but not functional. It comes as no surprise that none of my Thinkpad buttons worked. I spent the past 10 years of my hacking career tweaking Linux boxes, and I was using X windows in the days when you actually risked toasting your monitor with a bad config. So, yes, I could make X work. And I have gone through the Windows wifi card firmware dance, so yes, I could make the Wifi card work as well. I could also recompile the kernel and add Thinkpad buttons and ACPI events. Yuck. My rebuild just turned into a lot of research and work.

On a whim I thought I'd try Fedora Core 6 just in case my problems stemmed from the more conservative approach CentOS takes. Same problems, although I love the eye candy in Fedora - their art is great.

I didn't bother to try Solaris x86 because I know it won't detect any of my special hardware. Someday I'd love to be able to use Solaris outside of work, but that day hasn't come yet on the desktop.

So here I am, back with Ubuntu. It detects everything and works right out of the box. Within an hour I had a fully functional laptop, and despite my 1 GHz CPU in a 5GHz world it performs great. I suppose a year from now I'll try it again since I prefer a Red Hat based distribution to a Debian base. But what matters most comes down to two words: "It Works."

Thanks, Ubuntu.

Sunday, April 01, 2007

Does Your Virualization Strategy Have a Blind Spot?

I just finished reading a very interesting article about virtualization. It described the test results of running two sample web applications under virtual and physical environments. The idea was simply to check how the virtualization affected the application's performance. The results are interesting.

Before I go further, let's clarify that this article is about Windows IIS and Win2k3 server, so it is not about Solaris, or any other flavor UNIX beyond the fact that the underlying OS for VMWare in this case was CentOS. None the less I believe the story's moral is highly applicable to any operating environment, definitely including UNIX.

There is a lot of detail in the article, but if we can assume their methodology to be relevant, we can reduce it down to a short extract from its conclusion: "a virtualized server running a typical web application may experience a 43% loss of total capacity when compared to a native server running on equivalent hardware."

Wow.

Of course, I have no reason to think that this function carries over directly to Zones, Xen, or even VMWare running Linux. To believe this without research would be to abandon one's Jedi training. It would be a very valuable experiment to try such an excercise though, because I have no reason (at this moment) to think it would not have some relevance.

The big point here I want to make is that I would bet there are a great many sites out there who dilligently track low hanging fruit metrics like CPU utilization, and use that metric both in planning and asessing their virtualization projects. Server "A" runs an application and on average sits at 10% utilization. Lets call it 100Mhz and find it a home on the consolidation farm. Not so fast.

The problem is that if you completely ignore business metrics, you will fail to identify this article's identified flaw in virtualization. Hey, we moved from 8 servers down to one, and its at 80% all the time. Success! But what if that new server is running at 80% and only processing 50% of the volume it used to? I would suggest that the article I referenced might be a great reason to reflect on your site's metric stratgey and see if you have the ability to accurately asess a consolidaton stratgy. If not, you could be making an expensive transition.

Tuesday, March 06, 2007

On Proper Use of English

While my mind was entering a virtual screen-saver (yes, oddly enough my mind seems to run a GUI over its CLI interface) I overheard a conversation in the next cubicle. A project manager said, "...and then we give it to the engineers for solutioning". I started to think about how often I'd heard that expression and quickly realized the count was quite high.

Shortly after that exciting revelation (much more exciting than the status bar I watched during an in-progress installation) I recalled another expression I'm finding has become prevalent on our site: The act of "dialoguing". I'm not completely sure of the spelling of this colloquialism, but I'm hearing it all the time.

If I'm understanding them correctly, my job as an engineer entails dialoguing with customers to gain requirements which I can use in the solutioning process. Ain't that it?

From the Merriam Webster Dictionary, we learn that "solution" is actually a noun. Just to remove any ambiguity from my position on this matter, let me also state that a noun in this scenario cannot moonlight as a verb and still retain its dignity.

Continuing on our lexical journey, the MW Dictionary also has an entry for dialogue. It's simply amazing to find such an artifact considering how few people have been able to study this elusive part of speech. But, in the interest of open sharing of knowledge I'd like to share what it contains. Dialog is also a noun. Amazing!

Having solved this perplexing grammar mystery I can now return to designing a solution to the problem of how to virtualize our Directory Service. I hope you've enjoyed this dialogue.

Wednesday, February 28, 2007

JET: Controlling custom_files with a custom extension

Any site running Sun hardware with more than one system should be looking at JumpStart to ensure that systems can be rebuilt consistently. the corollary to this is that any site running JumpStart environment should be using Sun's Jumpstart Enterprise Toolkit (JET). JET provides a consistent framemwork for accomplishing most common tasks, and a consistent framework to write extensions within. Standards and discipline are good.

One of the modules which comes with JET is called simly enough, custom. The custom module allows you specify either packages or files which should be added to a server during any of N predetermined reboots. This allows you to ensure that a change which requires a reboot can be made prior to a dependent process being started. Sounds good so far.

Following a recent Solaris 9 server build I was perusing the system for problems by auditing log files. In the messages file I discovered some lines indicating that a Kerberos problem was rearing its ugly head:
Kerberos mechanism library initialization error: No profile file open.
Our site does not use Kerberos, so it had to be a recent configuration change - not surprising considering we had just updated the patch set. After some research I arrived at BugID 5020096. This bug indicates that the issue can be resolved by removing some offending lines from /etc/krb5/krb5.conf.

This should be easy enough to fix in future builds. Just add the modified krb5.conf to the JET template's custom_files variable, and we'll be in good shape. Ahh, not so fast. How will we know what the file originally contained? A true Solaris Jedi will always manage an audit trail of his activities. If I were making the change manually I would copy the file to file.orig, or file.datestamp. Automation is not an excuse for abandoning discipline.

The trouble with JET is that its custom module's functionality for installing files is limited to two operations: overwrite or append. Overwrite simply clobbers any file which may exist. For example, to install the /etc/motd file I would palce my custom file in the configured JET file location, then add a line like this to the JET template:
custom_files_1="motd:o:/etc/motd"

motd is a fairly harmless little file, but knowing little about Kerberos, I dind't want to blindly whack the original file. The right solution to this problem lies in creating a simple extension to the JET toolkit. I began by examining the code from the custom module. Two modules specifically are relevant to this project: install, and postinstall. Within them is a simple case statement which handles the "o" or "a" functionality:


case ${mode} in
   a) case ${fn2} in
      /etc/hosts)   JS_merge_hosts ${filefound};;
                     *)  JS_cat ${filefound} ${ROOTDIR}${fn2}
      ;;
      esac;;
   o) JS_cp ${filefound} ${ROOTDIR}${fn2};;
esac

So, when I use an "o" in my custom_files module, it called JS_cp. I now needed to find the library which contains these core functions. Eventually, a colleague and I traced it back to /opt/SUNWjet/utils/lib. Looking at the JS_cp function revealed exactly what I expected: a simple copy routine wrapped in some voodoo.

Feeling a bt optimistic, I copied JS_cp to JS_cp_preserve and modified the code a bit so it would first check to see if the destination file exists, and if so, backup the file with a datestamp. Once the backup was in place, the original copy operation was performed. This was very trivial shell scripting. Here's what I ended up with:


if [ "$#" != "2" ]; then
        JS_error "`basename $0`: Illegal Arguments. Usage:  "
fi

JS_FROM=$1
JS_TO=$2

JS_display "Copying file `echo ${JS_FROM} | sed -e \"s?^${SI_CONFIG_DIR}/??\"` to ${JS_TO}"

if [ -f ${JS_TO} ] ; then
   datestamp="`/usr/bin/date +%Y%m%d`"
   /bin/cp -p ${JS_TO} ${JS_TO}.jet.${datestamp}
   case $? in
      0) # Success
         JS_display "Successfully preserved ${JS_TO}.jet.${datestamp}"
         ;;
      1) # Failure
         JS_display "WARNING: Failed to preserve original file ${JS_TO}"
         ;;
   esac
fi

/bin/cp -p ${JS_FROM} ${JS_TO}

if [ "$?" != "0" ]; then
        JS_error "JS_cp:\t\tError occured while copying ${JS_FROM} to ${JS_TO}"
fi

Next, I returned to the install and postinstall code, and modified the case statements to accept a "b" operation (b for backup). I then executed a test Jump and was very pleased to see my JET extension had worked! I can now have custom_files install the workaround krb5.conf, and maintain a backup of the original. Here's the modified code:


case ${mode} in
   a) case ${fn2} in
      /etc/hosts) JS_merge_hosts ${filefound};;
               *) JS_cat ${filefound} ${ROOTDIR}${fn2};;
      esac;;
   o) JS_cp ${filefound} ${ROOTDIR}${fn2}
   b) JS_cp_preserve ${filefound} ${ROOTDIR}${fn2};;
esac

Note that you need to make this modification in both /opt/SUNWjet/Products/custom/isntall and postinstall.

Now, all I need to do it specify something in the custom_files module like this:
custom_files_N="krb5.workaround:b:/etc/krb5/krb5.conf"

And I will get a clean backup of the original file. Such a simple tweak - I hope the Sun folks who maintain JET will add something similar. While some limitations of JET can be frustrating, its intuitive layout and ease of extension make it something I grow more fond of each time I use it.

Thursday, February 01, 2007

Frustration with Solaris Packages

I have a love / hate relationship with Solaris packaging. When you need to crank out a simple package I find it much easier to deal with than RPM. I also like the simple file system or streams based stucture vs. the binary mode of RPM. All things considered, it gets the job done, and has been a tremendous help in standardizing our provisioning system. There are, however, a few things that I'm not crazy about.

In Sun's model, the package is used to release functionality while the patch is a vehicle to fix existing functionality. If I want to add feature X to my software, I need to release a new package version. In contrast, if feature X is broken in a package then I need to release a patch. Seems simple on the surface.

One place this model gets sketchy is if I have the following situation: Package FOO needs to be updated to a new revision, but package BAR depends on it, and is required for system operations. In this case I need to first remove package BAR, then update package FOO, and finally, reinstall package BAR. In my mind this causes an unjustified level of system disruption. An RPM or dpkg based system would use an update option to perform this in-place. I'm told that there's an "in place upgrade" capability in the Solaris packaging system, but I haven't yet discovered it or found it documented. I will be looking though.

I have also noticed documentation gaps in the use of patches. Sun does provide instructions on how to produce a patch-package, but they omit naming conventions. Clearly, it would be a bad thing to produce package 123456-01 and then have Sun release the same one. This conflict could be very disruptive to a patch process. It seems that by selecting an upper range (ie 90001-01) you can have safety similar to selecting a 10.0.0.0 network address. I'd feel quite a bit better if Sun woudl explicitly define this range so we'd know it was safe. In the interim, I've been fixing bugs by creating minor revisions of packages rather than using patches.

The last point I wanted to touch on in this article is the use of package prototypes. In packaging nomenclature, a prototype file is the list of files included in the package, and their ownership and permission attributes. Here's an example of a prototype I'm durrently working for a custom sendmail solution:


d none etc 0755 root sys
d none etc/mail 0755 root mail
f none etc/mail/foo-client-v10sun.cf 0644 root bin
f none etc/mail/foo-server-v10sun.cf 0644 root bin
d none usr 0755 root sys
d none usr/lib 0755 root bin
d none usr/lib/mail 0755 root mail
d none usr/lib/mail/cf 0755 root mail
f none usr/lib/mail/cf/proto.m4 0444 root mail
f none usr/lib/mail/cf/foo.m4 0644 root mail
f none usr/lib/mail/cf/foo-client-v10sun.mc 0644 root mail
f none usr/lib/mail/cf/foo-server-v10sun.mc 0644 root mail

Pay particular attention to what I call placeholder lines. Those are lines in the prototype referring to directories which this package depends on, but are really part of another package by virtue of already being registered. Of course, a directory like /usr is chocked full of nested package dependencies:


# grep ^/usr /var/sadm/install/contents  head
/usr d none 0755 root sys FJSVvplu SUNWctlu SUNWcsr TSBWvplu SUNWocfd SUNWncft SUNWGlib SUNWgcmn SUNWGtku SUNWctpls SUNWxwdv SUNWpl5u SUNWcpp FJSVcpc SUNWopl5p FJSVcpcx FJSVmdb FJSVmdbx IPLTadman SUNWowbcp SUNWpamsc SUNWpamsx SUNWpcmcu IPLTdsman SUNWadmj SUNWmcdev SUNWjsnmp SUNWtftp SUNWbsu SUNWpd SUNWsckmu SUNWpdx SUNWpiclh SUNWuxflu SUNWuxfl1 SUNWeurf SUNW1251f SUNWuxfl2 SUNWuxfl4 SUNWuxfle SUNWmgapp SUNWrmui SUNWpiclx SUNWpl5p SUNWTcl SUNWjpg SUNWTiff SUNWTk SUNWaccu SUNWaclg SUNWadmap SUNWpng SUNWpool SUNWpoolx SUNWant SUNWrcmdc SUNWpppd SUNWpppdu SUNWpppdt SUNWpppdx SUNWpppg SUNWfns SUNWsadml SUNWapct SUNWascmn SUNWasac SUNWqosu SUNWjaf SUNWjmail SUNWxsrt SUNWxrgrt SUNWxrpcrt SUNWiqfs SUNWiqjx SUNWiqu SUNWiquc SUNWiqum SUNWjaxp SUNWasu SUNWasdem SUNWrmodu SUNWrmwbx SUNWrpm SUNWrsg SUNWfnsx SUNWrsgx SUNWdfbh SUNWsadmi SUNWi15cs SUNWsadmx SUNWi1cs ... (lines omitted)

That was a tiny fraction of the list...

I don't think there is anything wrong with declaring a package as being dependent on a pre-existing directory, but I have a problem with how easy it is for a new package to overwrite the intende dattributes of that directory. Note that in my custom package's prototype I need to declare the attributes for /usr. This typocally means that I need to look at a clean operating system on the platform I intend to deploy on (ie - consistent Solaris revision) and pick the attributes from there.

I'd like to see the packaging facility accept a prorotype entry that has no attributes, and instead inherit the attributes from the package which initially registered the directory. This would minimize the chances of stray patches and packages conflicting with intended system permissions.

Having spent all this time complaining, let me end on a positive note by reinforcing how much efficiency we have gained by moving from tarballs and custom scripts to version controlled packages. I'd do it again in a heartbeat. I'm hoping Jedi discipline will eventually reverse the chaos inherent to the current packaging architecture.