Wednesday, December 20, 2006

Perl: Beware when counting array elements

Sometimes it's the simple things that put a halt to my productivity. In fact, it's rarely the big things. Master Yoda gave us an excellent phrase to consider when addressing a programming problem: "Judge me by my size do you? And well you should NOT!".

When I need to check how big an array is in Perl, it's quite intuitive to access that value using the $#array_name special variable. It works like a charm. But what happens when the array you need to count elements in is a referenced array?

A typical de-reference operation would be @$array_reference. This gives us the underlying array. Forgetting my Jedi training, I allowed the Dark Side to i$nfluence me as I flailed through syntactic permutations.


Much to my dismay, these futile expressions continued to plague me. I then focused my energies, and put the blast shield down.

...From WikiQuote

Ben: Remember a Jedi can feel the Force flowing through him.
Luke: You mean it controls your actions?
Ben: Partially, but it also obeys your commands.
Han: Hokey religions and ancient weapons are no match for a good blaster at your side, kid.
Luke: [to Han] You don't believe in the Force, do you?
Han: Kid, I've flown from one side of this galaxy to the other. I've seen a lot of strange stuff, but I've never seen anything to make me believe there's one all-powerful Force controlling everything. There's no mystical energy field controls my destiny! It's all a lot of simple tricks and nonsense.
Ben: I suggest you try it again Luke. This time, let go your conscious self and act on instinct.
[Ben places a helmet on Luke's head with the blast-shield down to blind him]
Luke: But with the blast shield down I can't see a thing!
Ben: Your eyes can deceive you. Don't trust them. Stretch out with your feelings.
[Luke calmly evades and deflects three pulses from the remote, successfully using the Force]
Han: I call it luck.
Ben: In my experience, there's no such thing as luck.
Han: Look. Good against remotes is one thing. Good against the living…that's something else.
Ben: [to Luke] You've taken your first step into a larger world.

I wrote a quick code fragment to isolate the problem and experiment with it. In doing so, I was reminded of the true nature of dereferencing and array contexts.

#!/usr/bin/perl -w
use strict;

my @a = qw(one two three four five six);
my $aref = \@a;

print "Number of elements in the array: $#a\n";
print "Number of elements in the array (ref): " . @$aref . "\n";

Looking at this code, ask yourself what values will be produced in each print statement. With carelessness likened to use of a blaster, one may assume they will produce the same integer. In fact they will not; A fact whose presence will be known through the Force. Let's take a closer look at these expressions.

@a is a literal array. The $# special variable is a direct vector to the last element used by that array. Knowing that Perl indexes its arrays beginning at zero, we then know this line will print the integer 5.

@$aref is a de-referencing operation. It represents the literal array, rather than the scalar that points to it. In this case, we are printing the array in scalar context, which will evaluate to the number of elements in the array: 6.

A simple problem with a not-so-simple explanation. What if your code had thousands of elements in it... Would your test plan have covered this condition when you decided to use referenced arrays rather than literals in one of your code branches? Remember that the Jedi Code tells us ...there is no ignorance; there is knowledge. Code, and learn.

Friday, October 20, 2006

Packaging delinquency in 3rd party software

We're working hard at the moment on migrating unmanaged scripts and solutions into Solaris pkgadd format, a.k.a. "packages". This has a number of benefits. First, it avoids the need to have complicated manual installation routines; By including preinstall / postinstall scriptsinstallation can be automated. Second, it is easy to know what revision of a software environment is installed right down to the installation process. Third, it ensures that not only can I ensure software is installed with the ight attributes and in the right places, but also that I can validate at a later time things are still as intended using pkgchk. I refer to this as managed files vs. unmanaged files.

The easy part was packaging our custom site scripts. We standardized on a hierarchy under /opt which contained bin, man, etc, lib, and sbin subdirectories. We then inventoried the unmanaged scripts and determined which were still valid and which could be discarded. The "keepers" were then incorperated into packages which were fairly simple overall.

Phase two has been an asessment of what unmanaged files for third party applications are being deployed. We found a lot of opportunity here and decided to start with the utility software like monitoring, security, and other non-revenue generating software. Here is where we have been uncovering nothing less than a mess.

For some reason, third party software providers in the UNIX space seem determined to make it impossible to manage their files. We've seen many interesting perversons of best practices that I thought woudl be interesting to collect in one place.

One product choose to adopt a package management solution called LSM, which I believe stands for Linux Software Manager. Note that this solution is for Solaris which has a perfectly good vendor provided and supported standard for software management. It turns out to be quite a technical feat to reversen engineer the format of LSM and directly convert to packaged.

Another product did not use any software management, but went so far as to encrypt their pre-installation bundles so as to make it impossible to install via a standards-based management system. This really blew my mind. What could be of such critical intellectual property in an installation routine that it justified encryption? And wouldn't the real IP be available once the software was installed anyway?

We routinely encounter software that has highly interactive installation processes that become cumbersome to integrate into packages because on each new release the routines would need to be re-ported to pre/post install scripts. The idea of managing software is to reduce work - not increase maintenance.

The biggest thorn in our side is Oracle. It's deployed everywhere in our environment and is intalled manually each time because we're told that's just how its done, and from observation it seems to be in the PITA bucket as far as automation goes. Contrast this to PostgreSQL which as of Solaris 10 (6/06) is integrated into the Operating Environment in clean packages.

So here's a message to all you third party software developers who provide Solaris solutions: Sun publishes an excellent guide to software packaging that any reasonably technical person could use to master the process in a few hours. Let me summarize a few key points in advance:

(1) You don't need to have a conversation to install software. Just copy the files, then configure it later.

(2) Sometimes I don't want to have a conversation. Let me put the answers in a file and feed that instead.

(3) Some of your customers have too many systems to install manually.

(4) Don't use a non-standard solution when the OS vendor provides a perfectly usable solution.

Thursday, October 19, 2006

To err is human

I am writing this post as a catharsis and purification. A centering of my spiritual engineering energy that may otherwise be out of balance. Three days ago I made a typo which eliminated the /etc directory on a fairly important server. It was amazing how long that server continued to plug away after being lobatomized. Let me take you through the story as I relive the moment, and ensure that I learn from it.

Like many of the tasks I juggle, this was to be a short time-slice effort. I needed a distraction from a longer term project, and wanted to bite off a small piece of something that didn't require significant thought. Part of our Jumpstart environment deploys a tar archive to the client which is later unpacked and massaged by a custom script. My task was to eliminate the usr/local/etc directory from that archive and than recreate it. As my fingers systematically hit the keys, one extraneous finger made its imprint on the keyboard.

"r" "m" "-" "r" "." "/" "e" "t" "c".

The world slowed down as my finger hit enter, and I felt my heart stop beating. I believe I actually flat-lined that morning. Could it be? Had I really deleted /etc? Yes. I had. The command I entered was: "rm -r . /etc". I removed the current working directory and the server's /etc directory.

Why was I using elevated privileges for mundane work? The tarball had root-owned files in it. This is a downfall of our approach at the moment. When using pkgadd format, anyone can own the files which are given attributes at installation time. This makes day to day maintenance much safer. Ironically, I was editing the archive because I had just created a package to replace the files I was deleting. It was almost as if the prior bad practice were vomiting on me as I excercised it from the server.

Fortunately we had an excellent SA on hand to boot from CD and restore the missing file system, and it was back in business a relatively short while later. Eningeering nad operations are segregated in duties at my current site, so I was unable to clean up my own mess. A very humbling experience indeed, and this is what it taught me:

(1) Mirrored operating system disks are a good thing, but they don't protect you from human error propogating mistakes across both disks. While I've been a bit critical of maintaining a third contingency disk, there are other similar solutions which I have a heightened respect for.

(2) Whenever executing commands using RBAC, sudo, or the root account, count to three before hitting enter. No matter how much longer it takes to get your work done, no matter how good you are with UNIX, and no moatter how long it has been since you made a mistake, counting to three will always be quicker than restoring a file system from tape.

Wednesday, October 18, 2006

Sun loves Oracle, Sun loves PostgreSQL

Sun and Oracle have announced they will work together for another ten years. Not only that, but there's a new bundle in town that includes Oracle Enterprise with Sun servers. I haven't exactly figured out what it means to have software included for free that will require a support contract; Is that still free? But there were words indicating that processor count may not be relevant and that's probably where the savings lie. Maybe it's just saving you download time?

I don't really care about the pricing because big companies don't seem to hesitate to throw down dollars for Oracle licensing. What made this interesting to me was that Solaris, as of the 6/06 update now includes PostgreSQL natively - and there's no catches there. If you perform a "full distribution" install you already have an RDBMS. What's more, if you want to take that database into the critical waters of the production pool Sun will offer their world-class software support which means the company that knows their own operating system better than onyone else will also know the RDBMS sitting on top of it. Tres chique, n'est pas?

I'd imagine with Oracle's market share Sun has to play nice in the short term, but I give them a lot of credit for including PostgreSQL and picking a side. Right or wrong, in the age of mediocrity they made a decision. PostgreSQL is a phenominal database that competes aggressively with Oracle in many venues.

I'm fascinated with what the future will hold for relational databases on Solaris now that Sun has picked a side. This isn't just another open source database running on Linux farms - this battle will take place in the big data centers that Linux is just starting to scratch the surface of. I love Linux as much as the next guy, but how many sites do you know of running systems as large as an Enterprise 25K with Linux under the hood? Not too many - it's not in the heritage of Linux kernel - at least not yet.

So, where will this take Postgres? Methinks Oracle had better keep close tabs on Postgres over the next five years or so.

Thursday, September 28, 2006

Adventures on eBay

I'm finally creating a Sparc based lab outside my place of employment. I want to make sure I can work on projects without any hint of conflict of interest of proprietary intellectual capital nonsense. My work is for hire from 8-5, but my ideas remain my own, and I want to be able to implement them outside of work with freedom.

I've been using an old AMD based system with Solaris x86 for a while now, but since my real interest is in Enterprise Engineering I really wanted to have some Sparc based equipment and at least one disk array. This will open the doors to many projects that would only be feasible on expensive modern x86 hardware. Funny how those same features have been on Sparc hardware since the early days...

For budget reasons I've gone with a v240 as my workhorse server. It doesn't have the sweet LOM that many Sparc system enjoy, but it does have four CPUs and four GB RAM. That will give me plenty of horsepower to run a few Oracle databases and work on some Zone projects. A v440 would have been ideal for my goals, but the price point on those systems is WAY too high for the investment returns I'll be looking at in the short-term.

I have the v240 server in my office right now, and have been playing with it a bit before it goes to the basement rack. After attaching a console cable and booting I was met with a hostname from the domain. I was a bit surprised to find that AOL doesn't wipe their disks before sending servers to auction, but after a quick look around I discovered that there isn't really anything on the disks anyway; just a Solaris 8 OS image. I almost skipped checking it out... I have zero interest in cracking, least of all a Solaris 8 image. I did have a small interest in their best practices, but I'm far more interested in my own projects, so the box will be getting a fresh load of Solaris 10 (6/06) this weekend.

My storage array of choice is the D1000. I'm not going to do much that requires massive expansion or throughput. I just need a bunch of disks I can put into ZFS, share for a cluster, and run databases on. The D1000 has a 12 disk Ultra-SCSI backplane that provides plenty of throughput for my needs. And hey, there's not much worry about driver obsolescence. It's such a simple device that it's just going to work as long as Solaris continues to support SCSI. The D200 is very cool, but I just couldn't find a value-add for my project list, so I went with budget. I won the bid for this unit this morning, so I'll have it in a week or so. Then I need to fill it with disks.

I upgraded my measly 16 port 10mb SuperStack II hub to a 24 port Superstack II Switch. This will ensure the servers have plenty of bandwidth to talk amongst themselves. I'm particulary proud of this purchase - only $1 plu s shipping. Works for me.

A few white papers from now I'm planning to add a second v240 for work on SunCluster and alternative clustering solutions. I'll also be adding a pair of Netra T1 AC200 servers for a directory services project.

And that should just about round out the new data center. I'm amazed at how much capability old servers have. CPU technology has progressed so much faster than the remaining system bottlenecks that many of today's systems simply will never show significant CPU utilization. For me, this means that if I'm willing to run servers at a level that makes them sweat, I can accomplish the same work for a fraction of the cost. This is the premise for a white paper I'll be working on that explores the use of old enterprise class hardware in not-for profit or small business shops. Stay tuned!

Wednesday, September 27, 2006

Shells and the path of standards

It seems that shells evoke almost as much controversy as religion. Everyone has a favorite, and for each favorite, there's another that would sooner grind their knuckles on the back of a circuit board than follow.

As long as I've been using Solaris systems I've never found common ground with the C-shell. Its programming capabilities are more limited compared to other options and its syntax is just too far from standards to be worth learning to me. But C is a standard UNIX language, you say. How can I call C-shell non-standard?

I'm a systems engineer, not an application developer (and now's not the time to debate that grey area!). If I were to write a script to traverse the entire Operating Environment, and calculate a distribution for the various shell I think we'd find 97% written in Bourne shell, 2% written in ksh, and 1% written in Bash. Please remember that I'm talking about Solaris here, and not Linux. When in Rome, do as the Romans do.

Because I create a LOT of shell scripts, I'm also painfully aware of Bourne shell's limitations. Not the least of which is its anemic data structure capability. For that reason, I started to use the Korn shell, or ksh. I have used ksh on Solaris for about 10 years both as my interactive shell and as my programming shell. It's still painful for data structures when compared to Perl, but it's MUCH nicer to work with than the Bourne Subset.

In addition to my Solaris career, I have a life in a parallel universe as a Linux engineer, volunteering at a local not-for-profit organization. There we use Linux for servers because Solaris was too expensive when we chose the initial environment. In the Linux world, the Bourne Again Shell, or Bash, is the de-facto standard for good reason. It's a very capable shell with great programming constructs, and a pleasant demeanor for interactive shell use. I dig it.

In fact, I dug it so much that I recently changed my Solaris interactive shell to it. Bash has been included in Solaris for quite a while now, and I consider it a standard-enough component that it's safe to learn. I love the file completion and history browsing with tab and arrow keys. It's so much more intuitive than the equivalent in ksh. I've been happily bashing around for a few months on Solaris, but recently did an about face in my thinking while reading up on Role Based Access Control, or RBAC.

My current site has a sudo to RBAC conversion on the roadmap, which should be a great project. In the RBAC world, you assume a role to complete an activity that is outside your default privileges. Those roles will have one of three shells: pfcsh, pfsh, or pfksh. Did you notice that Bash is not amongst them? Not for the foreseeable future.

So, in a Solaris-standard world that leverages the powerful RBAC facility, you will not have the option of working in a bash compatible shell if you want to perform an activity outside your normal rights. That's enough reason for me to drop the bells and whistles of Bash and go back to ksh. I'll probably miss it for a few weeks before I forget completely.

This experience proved to be yet another reminder for me that staying on the path of standards is full of temptations. While Bash is stable and common, it is not a native Solaris standard. That's fine for an end-user or software developer that operates outside the OS internals, but as a systems engineer my job is to live in the internals, and at this point in time Korn shell seems to give me the best functionality, programming, and standards compliance.

Thursday, September 21, 2006

Revision control bites... Unless you train it!

I've recently begun the task of placing my world under revision control. Keeping track of edits, versions, builds, etc. had become too much of a chore considering that Solaris has a facility to do it for me. The Source Code Control System, or SCCS, is the Solaris standard for revision control in just about every release I'm familiar with. There are many other revision control system out there, each with pros and cons. Popular alternatives include RCS, CVS, and Subversion. In my case, I want to choose a system that I know will be available in all reelases of Solaris on all sites, even in galaxies far, far away. SCCS is the only one that meets my key criteria.

Having chosen my platform for version control, I begain moving projects into its protective custody. All went well for the first few weeks, and I began to develop the the comfort level that usually preceeds a problem. The most recent project I put under SCCS control is a package with a simple preinstall script. The script is responsible for checking for the existance of a file prior to package installation, and making a backup before overwriting it. The backup is named FILENAME.PACKAGENAME.DATESTAMP. To implement this datestamp I set a variable in a backup subroutine as follows:

backup_file () {
DATESTAMP=`date +%Y%m%d%H%M`
test -f "${1}" && /usr/bin/cp ${1} ${1}.CGHfoopkg.${DATESTAMP}
return ${?}
} #end backup_file

After testing successfully on my development machine, I checked the file back in (sccs delget) and rebuilt the package. When I transferred the package to my staging server, the preinstall failed miserably with strange syntax originating at lines two and three of the above segment.

The first thing I questioned was whether there is some difference in versions of the time command between my development server and staging server, but it only took a second to prove that theory null and void.

After some mucking around, I discovered that the problem only exists when the source code was checked in to SCCS. This quickly lead me to realize that SCCS was expanding keywords in my DATESTAMP variable.

SCCS keywords allow you to have SCCS dynamically insert a revision, filename, check-in time, and other metadata when a file is checked in. By placing this into my checked out code:
# SCCS Revision Control:
# %M% %I% %H% %T%

I end up with this when the code is checked-in:
# SCCS Revision Control:
# preinstall 1.5 09/21/06 09:01:46

The problem is, SCCS by default does not stop at the header. It goes all the way through the code. In this case, it was hitting my DATESTAMP variable and changing its value, thus breaking my backup_file function.

It took some digging to find the solution, but I finally discovered the sccs admin command. Using this command I was able to specify that only the first ten lines of the file should be considered for keyword expansion using the following command:
# sccs admin -fs10 preinstal

After making this change, and checking the file back in, my problems were gone. The source stayed clean, and I was once again blissfully coding. My newly trained SCCS repository has been behaving wonderfully ever since.

Monday, September 18, 2006

To install links, or not to install links... That is the Question!

I'm working on a package at the moment that integrates Oracle 9i / 10g startups with Solaris Resource Manager (SRM). One of its functions is providing init scripts to start and stop the database and listeners during reboots. Seems simple at first, but I ran into an interesting dilemma.

At my current site, Oracle is not an automated installation, and not part of our Jumpstart framework. The challenge this presents is that we do not want to install active run control links for software that is not yet installed on the system, but we want to be able to install these packages during the jumpstart.

Ideally, I'd like to see the rc scripts registered via pkgadd so they can be easily identified, and cleanly removed if the package is de-installed. In the end, I had to compromise, but I think it turned out pretty safe because I followed standards in use of those rc / init scripts and used a hard link.

Although it may sound odd at first glance, I chose not to include the rc links in my package prototype. The links are installed in a disabled mode (pre-pended underscore character) by a postinstall script. I chose this route because it creates a placeholder for the startup order I selected rather than hoping whoever does the installation gets it right. This eliminates inconsistency and human error.

Next, I created a preremove script. The preremove works by issuing a find command on the /etc/rc?.d scripts that looks for any files with an inode matching the init script's inode. If proper convention for using hard-links was followed, this method will find the associated rc scripts even if their startup order or link name are changed over the life of the system.

LINKS=`find /etc/rc?.d -inum ${INITSCRIPTINODE} -print`

The final touch was to create a simple script that can activate all the sym-links at once. It leverages the preremove's inode hunting strategy and finds entries that have an initial underscore, and renames them using a simple Sed expression:

NEWLINKNAME=`echo $LINKNAME | sed -e 's/^_//'`

I'll be adding a reverse-function for the enabler script that can disable as well, and this will live in our NFS-accessible admin scripts repository for convenience. No need to distribute software that isn't critical to operation when the network is down.

The only downfall to my solution is not being able to search for the rc links in the Solaris software registry (/var/sadm/install/contents). This would have allowed someone unfamiliar with the solution to identify its components. The problem with registering RC links is that they change a lot over time and it's very difficult to keep the registry current. A link may change its name, start order, may be active, or inactive.

The effort required to manage that kind of a dynamic file would have been almost as large as the project I was packaging, so I decided to keep my scope tight and assume that as an acceptable risk. When we move from Solaris 9 to Solaris 10 this problem will be eliminated by integration with SMF, so no point in getting wrapped around the axles.

Wednesday, September 13, 2006

Auditing challenged with SSH

I recently had a real dilemma thrown at me by the security team at a site I work on. The site's policy dictates that no authentication should take place without a password. Any exceptions require both a business case justification on file, and an expiration. This presents a real challenge with SSH. SSH allows public key authentication as an alternative to passwords, and the private key can be created without a passphrase. In addition, there is no way to enforce a key expiration at a server level (at least, not that I've been able to find).

SSH supports two primary means of authentication. Password authentication is essentially the same process that occurs through Telnet, except we use a secure tunnel instead. Public Key authentication bypasses the system's password channel completely. At first glance, it's easy to say, "so disable public key authentication and be done with it."

There are two benefits to using public key authentication with SSH. First, you have the ability to confiigure your private key (residing on the client) without a passphrase. In doing so, you increase the risk of private key compromise, but enable passwordless authentication. This makes batch jobs much easier because it elminiates an interactive session.

The second benefit is that public key authentication is a two-factored authentication. It combines something you have, a private key, and something you know: a passphrase. As such, it's much more secure than traditional password based authentication which onyl uses single factor authentication. Even if someone compromises your private key file, it can't be used unless they know the passphrase, which SHOULD be a complex phrase.

The problem is that there is no way to control who on a server is authorized to use public key encryption, no way to enforce passphrase complexity, and no way to expire a public key. I could create a public / private key pair on a machine under my desk that isn't subject to production data center security scans and audits. Let's next say that I decided not to use a passphrase, or that the passphrase was weak and easily guessed. Next, assume I'm the DBA for the company's Oracle Fincancials encironment. It would require very little effort for someone to compromise my under-desk system and gain passwordless access to the company's critical systems. This is a low-risk, high payoff scenario that people who know what they're doing would be likely to attempt.

Another issue with the public key encryption capabilities of SSH is that you can enter any passphrase (or no passphrase). With my system password, I can install a crack module into the Pluggable Authentication Module (PAM) stack and enforce very complex passwords. Unfortunately, because key pairs are generated on clients, there is no way to enforce sanity at the server level.

After some research I determined that the version of Sun SSH which ships with Solaris 9 is far less capable than the OpenSSH releases one could obtain and build. Using OpenSSH it is possible to move the default $HOME/.ssh to another base directory (like /var). In doing so, it is much easier to create a root controlled environment where someone can not use the authorized_keys file unless authroized by a superuser. Unfortunately, the maintenance issues created by doing this are not justified, and our policy is to stick with vedor provided and supported software in all but the most extenuating circumstances. Under Solaris 9, there's just no safe and auditable way to allow publick key authentication. In an ideal world, I should be able to not just configure public key authentication at a server leve, but specify which keys were respected by the server's SSH daemon.

There is no solution I can see to the issue of enforcing passphrase complexity, or auditing use of non-interactive key pairs because that part of the process is handled entirely by the client. It's very difficult to convnice a security officer that key generated on an uncontrolled device can be trusted for authentication against Sarbox servers.

Our decision, much to my dismay, was to disable public key authentication site-wide. I feel like we're throwing the baby out with the bath water, but at the same time I understand the need to audit system access, and be able to enforce policy. I'm anxious for our Solaris 9 fleet to turn over to Solaris 10 so we can begin using the more capable version of SSH it includes.

Friday, September 08, 2006

AppleCare not up to Sun Service

One of the big reasons I was so excited to move my desktop environment to Mac OS X was its underlying UNIX operating environment. Being a UNIX guy, I'm well aware of how well instrumented it is, and how surgically it can (often) be debugged. This, of course is in contrast to the Windows world where it has become not only common, but almost accepted that troubleshooting step #1 is to reboot and sacrifice a chicken.

Over time my G5 was starting to crash with increasing frequency. At first it was once in a rare while, although I was still surprised that it happened. Recently it accelerated to the point where it crashed almost once per day. Given how much I paid to have rock solid hardware, and AppleCare behind it, this was not acceptable to me. So, I finally grew tired of my Mac's grey text-box of death and called my friendly AppleCare representative.

I began by telling my story, and adding a detail I felt was critical. Each time the system crashes I generated a crash dump report, and the stack trace always pointed back to the USB driver. I grabbed the text below from another site as an example, but it's very similar to what I was seeing.

1 Unresolved kernel trap(cpu 0): 0x300 - Data access DAR=0xdeadbeef PC=0x0e692550
2 Latest crash info for cpu 0:
3 Exception state (sv=0x0EB5DA00)
4 PC=0x0E692550; MSR=0x00009030; DAR=0xDEADBEEF; DSISR=0x42000000; LR=0x0E692530;
R1=0x081DBC20; XCP=0x0000000C (0x300 - Data access)
5 Backtrace:
6 0x0E6924A8 0x00213A88 0x00213884 0x002141D4 0x00214830
0x00204CB0 0x00204C74
7 Kernel loadable modules in backtrace (with dependencies):
9 dependency:
10 Proceeding back via exception chain:
11 Exception state (sv=0x0EB5DA00)
12 previously dumped as "Latest" state. skipping...
13 Exception state (sv=0x0EB64A00)
14 PC=0x00000000; MSR=0x0000D030; DAR=0x00000000;
DSISR=0x00000000; LR=0x00000000; R1=0x00000000; XCP=0x00000000 (Unknown)

Note that USB line? I was seeing it in EVERY crash. This tends to be something worth investigating. In my case I have a USB card reader attached to the USB ports in the bottom of my CinemaDisplay, and a Palm Pilot USB cable plugged into the front of my case. Another observation I made was that each time my system crashed it was essentially idle. It usually happened at night, or while I was at work. I would return to the sound of a jet engine coming from that silver box.

My first suggestion was that we look at the panic logs and try to identify the faulty components, but this didn't get too much traction. AppleCare is set up so that if the basic rubber stamp checks (slightly better than a reboot, but not by much) fail, they redirect you to a local store. In my case this wasn't appealing. It's a 40 minute drive from here, and the issue is intermittant. I could end up being without my Mac for more than a week if things went well.

So, we went through and erased all the caches and preferences, then reset the NVRAM. My system was brought back to factory specs, although I really did almost nothing abnormal to it. I don't use funky extensions or other hacks; I use mainstream well supported stuff.

Much to my surprise, the system has been stable since the activities. I'll be the first to eat my words, but I'm not used to voodoo troubleshooting. This was like chemo-therapy where we just bombard the system in hopes of getting all the cancerous code. I'm used to working in a surgical environment where we see CPU 0 corrupting data on an interval that indicates it needs replacing.

As much as I'm a hopeless fan of Solaris, I have to say that I don't think it's a huge quality difference between Sun and Apple that gives me ths uneasy feeling about this experience; I think it's the quality of Sun Service. They are used to dealing with mission critical servers more than art-critical desktops. No offense to the Mac world - I'm one of you... But it's a very different world.

Thursday, August 31, 2006

Initology 101: A lesson in proper use of Solaris run control scripts

Starting and stopping applications through init scripts ought to be a simple thing that doesn't cause much debate, but in fact its just the opposite. I routinely see servers with functional but non-standard artifacts nested in the rc directories. I also hear many justifications for these configurations; some reasonable, others somewhat less so. But in the end, I believe that a systems engineering approach to using init scripts will filter the options, and this article intends to do just that.

There are three specific conventions that I want to address:

1. Which run levels should be used for starting and stopping typical applications.
2. Should a symbolic link (sym-link) or hard-link be used?
3. How should a link be disabled

Let's begin with identifying the correct run levels to start and stop a common application. By common application I mean something that is not a core part of the operating system, but rather in the application layer that depends on the operating environment's core features. Oracle and web servers are common examples of what I consider common applications. Knowing that the Solaris Operating Environment has well defined run level states, the first step is to consult the web site for your particular Solaris version and refer to those definitions. Let's take the case of Solaris 9 (9/05) which is that last release in the Solaris 9 series. I am not going to address Solaris 10 in this context because it uses the new Service Management Facility as part of the new Predictive Self Healing feature to replace init scripts.

According to the Solaris 9 (9/04) System Administration Guide: Basic Administration Section 8: Run Levels and Boot Files We have the following run levels and explanations:

Run LevelDescription
0Shut down all processes and power down to ok> prompt (sparc).
SRun as a single user with some file systems mounted and accessible.
1Administrative state with access to all file systems, but no user logins permitted
2Multi-user state. For normal operations. Multiple users can access the system and all file system. All daemons are running except for the NFS server daemons.
3Multi-user state: For normal operations with NFS resources shared. This is the default run level for the Solaris environment.
4Alternative MU state. This is not used by Solaris, but is available for site customization if needed. I recommend NOT using it.
5Power down after shutting down all processes.
6Reboot the system.

In theory, we need to consider that a system may transition from any run level to any other run level. This means that when the system enters run level S, if our application is running, we need to ensure it is stopped. The same thing goes for 0, 1, and 2. Run level three is the conventional system state associated with end user applications being loaded. Putting this into practice, we will need to install the following links to fully integrate with Solaris' run levels:


These will ensure that our application is started in run level 3, and stopped in any other run level. This contrasts with what I see in most data centers where rc scripts are installed to run level 2 or 3 for start up, and 0 for shut down. While this approach can work for reboots it has a down fall. How many times have you been told that before patching you need to reboot a server into single user mode? This is because kill scripts are not installed for all applications for all run level transitions. I still advocate rebooting into single user mode to be safe, but in a perfect world this would not be necessary.

Having selected the run control directories, you are now ready to put the links in place. But wait! You have another decision to make. Should you use a symbolic link or a hard link? There are all kinds of reasons for and against either method if you approach the question from an emotional standpoint. However, as a Solaris Jedi, you do not allow your emotions to control you. You look for standards.

Referring again to the web site, we return to the Solaris 9 (9/04) System Administration Guide: Basic Administration. This time, to Section 8, How to Add a Run Control Script. The examples on the page clearly show how to use the ln command to create a hard link. This is where the discussion should end. You didn't write Solaris, and you didn't do the integration testing. You are disciplined, and you follow standards; This is the way of the Jedi.

I have heard numerous arguments for using sym-links in place of hard links, and I believe each of them stems from not fully understanding how UNIX file system inodes work, and how Solaris commands can be used to understand them. Using the "ls -i" command you can prove that the files reference the same inode, and are thus the same.

cgh@soleil{/etc/rc0.d}# ls -li /etc/rc3.d/S90samba
9731 -rwxr--r-- 6 root sys 324 Jan 14 2006 /etc/rc3.d/S90samba*
cgh@soleil{/etc/rc0.d}# ls -li /etc/init.d/samba
9731 -rwxr--r-- 6 root sys 324 Jan 14 2006 /etc/init.d/samba*

Notice the first field in each record shows the integer, 9731? That is the inode number. The next field to attend to is the third. In this case, a "6" for each record. This refers to the link count, or number of links that point to the same piece of data.

Another approach to observing all rc links associated with an init script is to use the find command to search a branch of the file system for the inode number matching the init script. Let's look at the standard Samba service included with Solaris 10. We know from the prior example that inode #9731 references the samba script. The following command will seek out all of the hard links:

cgh@soleil{/etc/rc0.d}# find /etc/rc?.d -inum 9731

If these link were symbolic the task would not be as simple, and we would not have the benefit of a link counter to ensure the integrity of our boots.

The last facet of initology I want to discuss is proper convention for disabling an init script on a Solaris server. As with the above examples, the correct process comes right out of the Basic Administration Guide, Section 8. The init scripts only process files that begin with an "S" or a "K". I most often see the upper-case letter replaced with lower case. The number two method I've observed is to remove the links altogether, leaving (hopefully) the init script in place.

The correct process for disabling an init script is almost always to prepend an underscore. The underscore stands out clearly in the list while lower cases characters tend to have less contrast next to the upper case entries. It sounds trivial, but how goood is your eye sight at 3am after your pager goes off? Another benefit is the grouping of all disabled scripts in the directory listing so you can tell at a glance what is turned off. Finally, by not removing it altogether we can preserve the ordering of the scripts, which is some cases is critical. Take a look at the example below, and hopefully my suggestions will be apparent:

cgh@soleil{/etc/rc3.d}# ls -l
total 44
-rw-r--r-- 1 root sys 1285 Jan 21 2005 README
-rwxr--r-- 6 root sys 474 Jan 21 2005 S16boot.server*
-rwxr--r-- 6 root sys 1649 Jan 8 2005 S50apache*
-rwxr--r-- 6 root sys 5840 Jan 29 2004 S52imq*
-rwxr-xr-x 1 root sys 491 Apr 10 12:49 S75seaport*
-rwxr--r-- 6 root sys 685 Jan 21 2005 S76snmpdx*
-rwxr--r-- 6 root sys 1125 Jan 21 2005 S77dmi*
-rwxr--r-- 6 root sys 344 Jan 21 2005 S80mipagent*
-rwxr--r-- 6 root sys 513 May 15 19:21 S81volmgt*
-rwxr-xr-x 5 root sys 2225 Apr 10 12:49 S82initsma*
-rwxr--r-- 5 root sys 824 May 26 2004 S84appserv*
-rwxr--r-- 6 root sys 324 Jan 14 2006 S90samba*
-rw-r--r-- 1 root root 0 Aug 31 21:31 _S92foodb
-rw-r--r-- 1 root root 0 Aug 31 21:31 _S95fooapp

Henceforth, you will properly integrate your scripts with the entire run level facility using hard links. When those magical links need to be disabled you will prepend underscores to them. You are now a master of the Solaris init scripts, and ready to carry this knowledge to others. You are also ready to explore the Solaris 10 SMF and enjoy all that it has to offer.

Wednesday, August 23, 2006

Spotlight on Richard McDougall

If you haven't yet visited Richard McDougall's Blog you should fire up your browser and head over. I had the pleasure of meeting Richard at a SunUP Network Conference in Singapore where we were both giving presentations. We met up again at later conferences in Sydney Australia and Boston in the same scenario, and he hit it out of the park every time he got in front of customers. It's very rare to meet someone as brilliant as Richard who is also so down to Earth and generous with his knowledge; he is a true Jedi Master in the land of Solaris.

One characteristic you observe early in Richard's presentations is his enthusiasm for Solaris and its potential. This article on Chip Multi-Threading is a classic example of his style, and was what inspired me to write this entry. I remember him speaking about some of Sun Volume Manager's (SVM) new (at the time) features which were specifically designed to address the reasons customers had chosen Veritas Volume Manager. Rather than attacking the message with the technical nuts and bolts, he hit on a few topics and delivered the message that Sun was listening. I am certain that more people re-examined SVM after his delivery than any speeds and feeds preso would have motivated. A true Jedi master delivers important messages without patronizing through understanding the intended recipients.

The other item I want to draw your attention to is his new set of books: Solaris Internals and Solaris(TM) Performance and Tools: DTrace and MDB Techniques for Solaris 10 and OpenSolaris (Hardcover) which were just delivered to me from Amazon. First of all, I hate poorly bound books. I buy books to use as reference manuals - tools of my trade. These books feel like professional tools that you will appreciate returning to. Remember buying that Calculus book in college? The one that weighs 25 lbs? This is that book. I love it! They are expensive, but good books aren't cheap, and the investment the author made in sharing his skills isn't cheap either. The first edition of Solaris Internals is well known as the authoritative reference on Solaris plumbing, and with all of the exciting changes Solaris 10 brings, this book is timely. I'm anxious to dig into it and post a review, but in the mean time please check the books out. This page has more information, and sample content.

I'd like to start paying tribute to some of the Jedi Masters I've benefited from, and this post serves as the first. Please take a moment to read Richard's Blog. Check back frequently - if he posts it, you should know about it. And if you need a diversion from Solaris, he's also a great photographer.

Sunday, August 20, 2006

The verdict: Ubuntu Linux is a keeper

I'm writing now from the keyboard of my reborn laptop. Having just completed the installation and configuration of Ubuntu Linux on it, and happily retired its installation of Windows XP. Since this blog is really about Sun Solaris and Systems Engineering I don't want to spend too much time talking about this from a technical standpoint. It does have relevance to the theme as we all need a portable means of working on Solaris systems. If you use UNIX as your primary operating environment, you know how awkward it is to depend on Windows as your interface to the systems you support.

So far, Unbuntu "just works" with no headaches at all; It auto-detects and configures my Netgear WG511 "G" Network Card, and can successfully enter and exit both hibernate and suspend modes. These were the two big headaches for me under Fedora Core. I am really impressed that the special volume and mute keys worked as well. Those used to require installing a separate Thinkpad buttons package called tpb. The boot screens look slick, the theme is very clean and coherant, and the desktop is clean and EMPTY. I love that! I give it two thumbs up. It looks like I'm finally going to learn the Debian flavor Linux after years of being a die-hard Red-Hat camper.

Now let me clarify this position; I'm not changing my opinion about Mac OS being the ultimate desktop. But, I can obtain an old IBM Thinkpad T20 for a fraction of the cost of a PowerBook or MacBook. I wouldn't want to process my photographs on it, but for a tool that lets me perform systems work, and use typical Office Software, I'm very happy.

Saturday, August 19, 2006

Microsoft's Genuine Advantage

I have an older IBM Thinkpad T23 laptop which I purchased after sending most of my scrapyard through eBay. I bought it with the intention of running 90% Linux, and occasionally using Windows when I need some odd utility, or have to connect to something that only speaks Windows. The T23 is a rock solid machine, and being far from the bleeding edge, also has pretty decent hardware compatibility. With 1GB of RAM and a 1GHz CPU, the machine is plenty fast for it's intended role as a terminal web browser, email client, and occasional OpenOffice platform. I bet the most used application on it was gnome-terminal if I really analyzed the accounting records; Nothing stressful.

When I first loaded Fedora Linux it was a simple process to get the machine useable. Useable and optimial turned out to be divided by a full-strength, adult size, bang-a-roo of a headache. The little things like getting it to play MP3s didn't phase me too much. In fact, taken one by one the entire list isn't anything that can't be handled. The problem is that I'm tired of having to handle things. I just want my computers to let me do what I want without HAVING to hack. I'd rather hack by choice than for base survival.

I was able to get Wireless ethernet working after some digging, but what really sent me over the edge was ACPI. My power consumption was awful, and getting it to be even close to Windows proved as complex as tuning 100 Oracle instances fighting for the resources of a SparcStation 5. Not fun at all. Eventually, I decided that despite my inability to mentally mesh with Window's gears I would dump Linux and stick to the main stream.

I bought a copy of Windows XP Pro off eBay, complete with hologram media, funky sticker, and all of those gimmicky little things they do to make ou think you're getting something official and important. I downloaded all the updates, I filled out the registration, I did all the things someone would do when they are an IT professional who wants to be legitimate. After using it for about a year with no issues, including the "Windows Genuine Advantage" thingy which used to think I has a legitimate copy.

Today, after a long hiatus, my laptop was booted and it informed me that Windows Genuine Advantage had changed its mind. Warning boxes were popping up left and right, and graciously giving me the opportunity to "purchase genuine Windows". You know what? I already did. It was shrink wrapped, and had so many gimmicky little security things that it was gaudy. And now you want to give me an opportunity to do it again? No thanks. From a quick Google search it looks like I'm not the first person to be annoyed.

I did notice that my system clock was really goofed up, and I've heard that the validation process involves hardware checks, so maybe something in my configuration triggered it. I don't know, but frankly I don't care. I don't want to know why it happened. I'm going back to a world that doesn't include helpful paper clips and other rediculous instantiations of a help system.

Since my battery is nearly dead, I've decided not to worry about ACPI. Windows is being scrapped tonight and I'm going to either run Ubuntu or Fedora linux. I'm not crazy about Solaris on the desktop because it has less standard productivity software, and the updates for non-Solaris software are not convenient. I'm the #1 fan for servers, but on the desktop I'm a Linux guy until I can afford a PowerBook or MacBook.

I'll post more about the final choice I make, but I felt it necessary to document this eve of liberation for all to see. And now I must end this post as I have a date with fdisk to catch...

Thursday, August 03, 2006

IM Ruining Grammar?

This one is a bit off theme, but I couldn't resist. Apparently it has been found that the prolonged use of IM does not truly impair one's grammatical ability. Thank the University of Toronto for this piece of knowledge...

Are you serious? It's not grammar that gets hurt when IM is abused. It's one's social skills. The article I mention above is really talking about kids who get carried away, but living in a cube farm, I've seen the adult version as well. At some point, we've all been guilty of IM'ing someone close enough to hit with a paper airplane.

Between email, voicemail, wikis, and IM, not to mention remote work, just about everything is driving a wedge between developing the personal relationships that foster good working environments. When I've met someone in person I immediately feel more at ease trusting them for the role they will play in a project.

Body language plays a HUGE role in our ability to communicate effectively, and to dismiss it for the "efficiency" of electronic communication is naive at best. The next time you need to talk to someone, walk to the other side of the building, or schedule the time to drive to their site. You'll be glad you did, and they probably will too.

The demise of Linux

I read a quick editorial which suggests that Ubuntu Linux is going to be the downfall of Red Hat. The premise being that as Red Hat grew more commercial, and abandoned its community in favor of its stockholders, the sys-admins who used Red Hat as their desktop OS drifted away from the Red Hat camp. As Ubuntu came into being, and did so in a very strong way, those SA-types who were driven away from Red Hat will now want to put what they are more familiar with (Ubuntu) on their servers when they have the choice.

This whole discussion brought me right back down memory lane. I started using Linux before Red Hat existed with a few early Slackware distributions. I remember writing all those 3" diskettes - somewhere around 80 of them by the time I burned the X-windows distribution as well. Shortly after came Red Hat, and at that time I was helping to set up the first Linux environment at SUNY Plattsburgh. We switched over from Slackware to Red Hat, and loved it.

I ended up sticking with Red Hat for the next 10 years. It was the OS of choice when I lead a project to build servers for our local Boy Scouts of America council, and remained there until Red Hat went totally commercial, burning the bridges out from under us. Make no mistake, I was extremely disappointed with their decision. We switched over to Fedora Core, and for the most part it has been a smooth transition. Despite its big red "DEVELOPMENT" stamp, Fedora has been very good to our availability. In fact, at the moment we've got an impressive uptime on an FC2 system:

# uptime
21:41:15 up 412 days, 2:37, 2 users, load average: 0.00, 0.00, 0.00

But thinking back on the experience I have an entirely different memory of what direction I was forced in, and where I wanted to be. The places I've worked would not consider using Linux for their production environments. Linux has made some great in roads t the corporate world, but there aren't a lot of Fortune 500 companies running their SAP central instance on Linux. I'm sorry, it's just not happening. HPUX, AIX, and Solaris are king in the land of mission-critical highly scalable UNIX servers.

Although Solaris never made a particularly compelling desktop, it's what I've wanted to use on every server I've ever built. It's rock-solid, well documented, and very cohesive. When you use Solaris and Sun products, you rarely get the impression that 10,000 individual developers all tried to do it "their way" when the final build was cast in stone. What always stopped me was cost. Solaris x86 had pathetic support and commitment in the past. It's so incredibly painful to migrate between operating environments that I never wanted to risk Solaris x86 being yanked - which it was.

The second big barrier was cost. If you went with Sparc, you had to have money. Lots of money. Oodles of money! What I had was a basement full of x86 architecture hardware, and the not-for-profits I volunteer at had the same. There was simply no funding for shiny Sun hardware no matter how badly we wanted it.

And then the sleeping giant awoke. After being pummelled by the dot-com crash, Sun figured out what went wrong, and executed one of the m most amazing feats of corporate intertia changes I've ever seen. In a very short time frame, support for Solaris x86 was restored at a full commitment level. And it was made free. Then they continued to make their Java Enterprise System free to download and use as well.

So, the making of a fantastic Linux in Ubuntu may hurt Red Hat, but it's not what will deliver the killing blow. Red Hat has an opportunity right now to try to pull off a corporate inertia swing of Sun's magnitude. They need to restore faith in the community restore the religion they destroyed and find some kind of innovation to draw people back in. Solaris has done all of this and created an affordable support model that doesn't intimidate the small businesses who were once driven to Linux.

The first blood has been drawn by Solaris, but the second wound is far deeper. This second wound is bleeding internally and missing a lot of coverage. Mac OS-X is the killer desktop. If you have a reason to be using UNIX on a desktop, then using anything other than Mac OS-X is a tough sell in my book. Hardware is a bit more expensive, sure. But it's the best of every world, and solid as a rock. It doesn't hurt that it looks great either.

A recent seminar I attended talked about business models and knowing when to have the guts to drop a design. The idea was that you need to look at things you're developing and ask whether or not they give you a long-term sustainable advantage. I have to use that same litmus test to examine Linux. In the server world I can get free and open Solaris which is out-innovating Linux in my observation. And on the desktop, while Linux continues to improve, it's not even close to Mac OS-X.

In the end, these observations mean little to the tech-hobbyist who loves Linux for its religion. But in the business world, religion doesn't make IT choices. Competitive advantage does.

Sunday, July 23, 2006

Solaris - the Wine

Solaris Wine Label
Originally uploaded by cghubbell.
I believe this may be one of the best system tools for evenings when a long day of UNIX has left the Force unbalanced in your mind. I stumbled on this wine while perusing a liquor store in Horseheads, NY. I hardly drink at all these days, but definitely enjoy a nice glass of wine when I unwind after a long day. If you're into this kind of thing and enjoy seeing Solaris outside the data center, check out the Solaris Winery!

Friday, July 21, 2006

Did you hear what I sed?

When battling the dark side of UNIX, it is critical that you not let your eyes betray your instincts. Windows teaches you to trust what you see, which is in itself a good reason to be wary. Today's lesson will involve our old friend awk, and a not so well known friend, od (octal dump).

I was working on a section of code, which decided whether or not arguments were passed by checking in a case statement for an empty string, or anything other than an empty string. It looks like this:

case "$SID_LIST" in
"" ) # No arguments passed, go with default.
echo "Stopping all configured Oracle databases."
su oracle -c "$ORACLE_HOME/bin/dbshut"
* ) # SID list paased - pass it on to dbshut
echo "Stopping specified Oracle database(s)."
su oracle -c "$ORACLE_HOME/bin/dbshut $SID_LIST"

This structure comes from the Oracle 10g dbshut script which I'm applying some customizations to. As a result, I'm trying not to completely restructure the script. If I were to write it myself, I'd be more tempted to put this in an if statement, and test for a null string (test -z). But, since I'm working with someone else's code, I'm trying to stick to minimizing my impact.

If you call this particular script with arguments (an argument is a token that follows the command, like do_something RED BLUE) I detect the extra arguments from the command line, and put them into a variable called SID_LIST as follows:

ACTION=$1 # Assign first argument to action
shift; # shift arg pointer past $1 (action)
SID_LIST="$*" # Assign any remaining args to the argument list

So, when I call the script with a command like, "dbshut TESTDBA TESTDBB" I expect to see SID_LIST end up with the values "TESTDBA TESTDBB". Good enough! But what if someone repeats an argument? We don't want to iterate through arguments we have already processed, so I decided to add my own personal garnish of ensuring the list is unique. And this little detour is where the fun began...

The modification I made looked like this:

SID_LIST=`echo $SID_LIST | tr " " "\n" | sort | uniq | tr "\n" " "`

Let's break this down into logical steps:
First, translate any spaces into newlines because the next commands in the pipeline will expect to see things in multi-line form. This turns "one two one three" into:


Next, sort the output alphabetically to ensure similar items are immediately next to each other, which is necessary for the following piece of the command. Now we send the sorted list to a program called uniq which removes duplicates. The output now looks something like this (remember, its alphabetical):


Finally, we need to get it back into a single-line format, so we send the output into the reverse of the first tr command which replaces any newlines with spaces. Our final output looks like this:

"one three two"

Having conquered that challenge, I integrated the code fragment and observed its behavior. Oddly, I discovered that whether or not I supplied arguments, the case statement always resolved my input to be in the "*" branch rather than the "" branch. After taking a closer look, I discovered that my output was not what it appeared... In fact, the final newline had been replaced with a space by the last tr command, and my string looked like this:


Because SID_LIST did not match "", the case statement selected the "*" branch instead. Feeling quite impressed with my mastery of the debugging arts, I surmised that a simple sed statement could whack my terminating space, and leave me with the desired empty string that would set my logic free. But alas, it was not to be...

I left me editor, and started playing on the command line. First, I created a simulation by setting a variable to contain a series of pretend arguments:

testbox{cgh}$ A="one two two three three four"

Next, I simulated my script's pipeline so make sure I could duplicate the problem. I surrounded the output with brackets to make the trailing space more obvious...

testbox{cgh}$ B=`echo $A | tr " " "\n" | sort | uniq | tr "\n" " "`
testbox{cgh}$ echo "[$B]"
[four one three two ]

Excellent, now we can test a fix... I put a sample string with a trailing space into a variable, and sent it into a sed command. The sed script is pretty straight-forward; search for a space character immediately before the end of the line, and replace it with nothing. This breaks down to the three divisions between slashes: [s]earch/[space]$(end of line)/replace_with_nothing/.

testbox{cgh}$ X="four one three two "
testbox{cgh}$ echo "[`echo $X | sed -e 's/ $//'`]"
[four one three two]

And behold, it worked! I now take the tested sed script, and attach it to the end of the pipeline...

testbox{cgh}$ A="one two two three three four"
testbox{cgh}$ B=`echo $A | tr " " "\n" | sort | uniq | tr "\n" " " | sed -e 's/ $//'`
testbox{cgh}$ echo "[$B]"

What happened to my string? I copied and pasted the code, and it should have worked! Here is the part where we learn to trust our instincts, and not what we see. Let's revisit our input variables using The Force...

Earlier, we set $X to contain a sample set of arguments with a trailing space, and that input string worked nicely. Maybe the input changed somewhere in the pipeline to not exactly reflect the test conditions in our experiment... Here's how we can compare them:

testbox{cgh}$ echo $X | od -c
0000000 f o u r o n e t h r e e t
0000020 w o \n
testbox{cgh}$ echo $A | tr " " "\n" | sort | uniq | tr "\n" " " | od -c
0000000 f o u r o n e t h r e e t
0000020 w o

Do you see it? The difference is that our experiment's $X string is terminated by a newline character, while our pure pipeline string has lost its newline. This becomes a problem for the sed command which removes our trailing space. Sed acts when it sees an input terminator like a newline or ctrl-D character. In this pipeline, sed is never getting what it needs.

The solution is fairly simple, although not pretty. I broke this pipeline into two statements, and sent my sed script its input from an echo command rather than directly through the pipeline. This allows echo to put a newline onto the string and make sed happy. Here's what it looks like:

SID_LIST=`echo $SID_LIST | tr " " "\n" | sort | uniq | tr "\n" " "`
SID_LIST=`echo $SID_LIST | sed -e 's/ $//'`

This could be performed in other ways, my personal favorite being to reincarnate this script in Perl and eliminate all these pipelines and separate commands. But, by leaving it as-is I can keep the user base more comfortable with the language. It also serves as a great lesson for Jedi training, and so shall it remain.

Wednesday, July 19, 2006

Poor grammar isn't always a bad thing

If you write enough shell scripts you will eventually fall prey to your own comments. Unless you read my blog of course, in which case you will have saved hours of frustration!

Let's take a fictitious problem... You need to print the first and third columns of the /etc/passwd file so that a report can be generated correlating user IDs to user names. Being the UNIX monk that you are, you assure your management that a shell script can meet their every need, and there is really no reason to have an ODBC link from Microsoft Access to the passwd file.

You throw together some code, and it looks like this:

nawk 'BEGIN { FS=":" }
# We don't want to print anything but
# the first and third column
{print $1,$4}' /etc/passwd
exit 0

Looks like a nice tight algorithm, well commented, and generally a job well done. You pat yourself on the back and refill your coffee, ready for the next challenge. Not so fast... First you decide to test that script, and you see the following:

testbox{cgh}$ ./comtst.ksh
./comtst.ksh[6]: syntax error at line 6 : `'' unmatched

But how can this be? It's a simple script, and the logic is flawless! Let's test it to be sure...

testbox{cgh}$ nawk 'BEGIN { FS=":" } {print $1,$4}' /etc/passwd
root 1
daemon 1
bin 2
sys 3
adm 4
lp 8
uucp 5
nuucp 9
ftp 60001
smmsp 25
listen 4
nobody 60001
noaccess 60002
nobody4 65534
cgh 1000

It works... What is the problem here?

It turns out that the comments in the embedded nawk code are the problem. In this case, the apostrophe in "don't" closes the opening apostrophe at the beginning of the nawk statement, and the shell interprets the code like this:

#!/usr/bin/ksh nawk 'BEGIN { FS=":" }# We don'

So what we really do it pass nawk a syntactically incorrect program. Having figured it out, we re-write the code as follows:

nawk 'BEGIN { FS=":" }
# We do not want to print anything but
# the first and third column
{print $1,$4}' /etc/passwd
exit 0

There are two morals to this story: First, at the risk of repeating myself like a broken record, don't use multiple shells unless it's absolutely necessary because you run the risk of obscure interpretation problems. In this case, we could solve the problem by writing in Perl where there's no need to embed a second language.

The second moral is to always avoid using contractions and meta-characters in your comments. It makes for slightly longer comments, but if you scrictly avoid the temptation, it is one less thing to worry about. This example was so simple that it's not hard to locate, but if you had a complex nawk script with its own subroutines buried in a complex shell script, it can be very frustrating trying to locate the bug.

The dark side will tempt you with contractions, but now your Jedi training has equipped you to calm your mind and type out those extra few characters. Until next time, may the code be with you.

Tuesday, July 11, 2006

Don't Shed Your Shell

I've said it before, and will say it again; Switching interpreters in mid-code is a practice to avoid whenever possible. There are times that it can be avoided, but there's a lot of times when you can sacrifice a bit of elegance for simpler maintenance.

As with most bugs, I was recently bit by a dumb mistake. I needed the ability to lookup Solaris Resource Manager Project information using tags embedded in the description field. For example, SID=TESTDB is how I would specify an Oracle database SID. I wrote a Korn shell function called getprojbyattrib() which accomplished this very thing. Tested on its own, it worked wonderfully. When I went to integrate it with the existing Oracle start-up scripts I ran into some problems. Turned out they were easy to debug, but the root cause was my old enemy of incompatible interpreters.

This new shell library function is used to figure whether or not an SRM project is configured for a given Oracle database. If one and only one match is returned, then the database is started in a project container. Any other condition means that the database is started without SRM. To help in this cause, I embedded a counter in the function to return how many matches were found. The code in question was simple:

# Keep track of the number of projects we find while outputting
# them so the final tally can be used as a success indicator.
echo "$PRJ"

Make note of the seventh line of code which does the incrementing. This is a Korn shell specific operation. When the calling code from the oracle startup script referenced this, it gave an error which told me that it had interpreted line #7 at "PRJCOUNT=$". This is because the Bourne shell doesn't understand the operation.

The fix is simple. Either switch the calling script to use the Korn shell interpreter because Korn is a superset of Bourne, or change the increment code to be Bourne-friendly by using either bc or expr.
PRJCOUNT=`/usr/bin/expr $PRJCOUNT + 1`

Interestingly, the library function was written with a header that specified Korn shell as its interpreter:


This becomes irrelevant when you are sourcing functions or variables as the whole point is to have your calling shell get access to these objects.

Sp what did I do? At first I switched the calling code, but some afterthought lead me to work with the underlying Bourne shell subset so the library would be more portable. I don't really like Bourne shell as Korn is much more capable, but in this case portability is weighted more heavily than elegance.

Repeat after me: Switching interpreters in mid-code is something to be avoided whenever possible.

Friday, June 02, 2006

Solaris Date command and epoch time

I can't count over the years how often I've wanted to output a date stamp in seconds since the epoch to make duration calculations simple.

Just to prove that I'm not 100% biased towards Solaris let me point out that my Linux scripts all enjoy the ability to call /bin/date with a simple switch that outputs time in my desired format: date +%s.

In Solaris, neither /usr/bin/date nor /usr/xpg4/bin/date support output in the "seconds since epoch" format. This is what we call low hanging fruit as enhancements go. Unfortunately, the fruit still hangs.

Perl does a very nice job of handling date math and epoch conversion, but that requires a separate interpreter, and when I'm in shell I don't like to jump in an out of other interpreters. I found a pretty cool hack that seems to avoid an external interpreter, and gets me what I want...

Since we know that the system keeps track in the format we want, we need to find a utility that uses a system call... In this case I made a crazy guess and found the time() call. Here's what it looks like:

testbox# man -s 2 time
System Calls time(2)

time - get time

time_t time(time_t *tloc);

The time() function returns the value of time in seconds
since 00:00:00 UTC, January 1, 1970.

So, making a second wild guess I assumed that our beloved /usr/bin/date command uses the time() system call. Let's take a look... If we use the truss command to check out system calls and returns we should find what we're looking for. We'll use the grep command to look for the time() call.

There's a catch though... Truss is going to dump output to stderr, and grep looks for input on stdin. Those paths won't cross. So, we need to redirect stderr into the stdout stream before piping it all over to grep.

testbox# truss /usr/bin/date 2>&1 | grep ^time
time() = 1149275766

Cool! You can see the 2>&1 take stderr (2) and redirects (>) to catenate (&) with stdout (1). This cuts the 40+ lines of system calls down to the one we care about.

It's not a standard interface, so any time we use it we run the risk of any OS patch breaking our algorithm. Perl would be a safer way to go, but it does require more overhead in terms of firing up the interpreter for such a simple thing. You'll have to decide for yourself whether or not this hack is useful to you, but I think it's a good one.

To clean it up and make a bit better behaved we'll need to get rid of leading and trailing spaces, and output just what we need. Here's a quick script you can call or source...

/usr/bin/truss /usr/bin/date 2>&1 | nawk -F= '/^time\(\)/ {gsub(/ /,"",$2);print $2}'
exit $?

And finally, let's see it in action:

testbox# ./edate

So there you have it. A way to get epoch time without writing a single line of C.

Veritas Foundation Suite goes free

Veritas Volume Manager is free. I'm not joking! Check it out for yourself!. Of course, there are some restrictions involved.

  • Only for Linux and Solaris x64

  • <= 2 CPU cores

  • Max of 4 user volumes

I'm not sure I "get" this move. Historically, there has been a philosophical battle going on in the enterprise engineering space over whether 'tis best to manage your root disks with Sun's Solaris Volume Manager (SVM) or Veritas' Volume Manager (VxVM).

On one side, VxVM had the strength of a volume management solution that scaled to infinity and beyond. It's not much more difficult to manage 1000 disks on VxVM than it is to manage 10. The corollary to this point is that if you require VxVM for your data farm, then you can reduce system architecture complexity by standardizing on use of VxVM for management of root disks as well. Simple is good, right?

On the other side, SVM is extremely easy to use, and extremely easy to recover from problems with. Given it's simple layering approach, it's also much more difficult to get into trouble with. SVM has long been a favorite with Sys Admins because it means they never have to worry about the dreaded unencapsulation dance. Again, simple is good, right?

From a purely academic standpoint, the complexity of using multiple Volume Managers seems unnecessary. In practice, I believe the opposite is true. Root disks are a very different beast than data disks. For one thing, they tend to be configured and then forgotten (barring a disk failure). In contrast, the data farm is always churning with reconfiguration or expansion. Given the opposing dynamics of these types of volumes, it makes sense to use tools with different strengths.

When a root mirror pair needs work, you want a simple and reliable way to solve the problem and move on with very little chance of making an unrecoverable mistake. This is SVM in a nutshell. Its tight OS integration, and shallow learning curve mean that even a junior SA has a great probability of pulling off that disk replacement.

Unfortunately, if you have terrabytes of disk storage hooked to your Enterprise 20k you want a volume manager that is very flexible and agile. Historically, SVM has not been that tool. Its GUI does not display huge data farms efficiently, and its volume naming conventions get unruly when the disk count grows. This is where VxVM comes in.

VxVM use abstraction and well designed (although complex) interfaces that allow mountains of data to be displayed quickly through either command line interface or GUI. Historically, it also had much more power and flexibility than Sun's offerings. This meant you needed VxVM for data farms.

The best of both worlds, in my opinion is using SVM for Operating System volumes and internal disks, and VxVM for all external storage whether SAN or DAS. If you only have limited external storage then just use SVM. Really, simple is good!

Returning to Veritas' recent change in marketing strategy, it seems that they may be trying to counter some of the arguments for using Sun's integrated solution. It's clear that ZFS is going to provide a very powerful facility for storage provisioning within Solaris 10, and SVM already supports many of the core features which used to be Veritas' key selling points. Veritas has one key advantage in that they have a time tested solution available NOW. ZFS isn't mature enough yet for the mission critical enterprise, but that's a very short term disadvantage. The OpenSolaris model means that when ZFS makes it into a hardware update of Solaris 10, it's going to be 95% there. I'll give it six months before they master that remaining 5% which wider distribution will open up.

Is the right strategy to pick to seldom-used platforms to make VxVM free on, and then limit use to four volumes on a product which is only advantageous when volumes are plentiful? I don't think so. If you have a simple system you're not going to WANT to use VxVM because SVM is so much simpler.

To me it looks like Veritas is trying to use a seeding strategy for a market that it has no chance of enticing. While Veritas has a great product, it seems that they don't fully understand their niche. I'm putting my money on SVM and ZFS.

Tuesday, May 30, 2006

Testing for correct usage in shell functions

Here's a simple touch you can apply to your shell scripts to aid in debugging when they grow to become monstrous and you can't remember the syntax of all your subroutines any better than you can remember the 10th digit in Pi, which happens to be 3 for those who care about such things.

Although not strictly required to take advantage of this tweak, I recommend you begin by using good headers for each subroutine. I won't go into each one, but a specific entry I always make is usage. For example, if a subroutine do_foo takes arguments arg_one and arg_two, the header would look like this:

# ------
# do_foo
# ------
# USE: doo_foo ARG_ONE ARG_TWO
# DESC: Execute foo functionality
# PRE: na
# POST: na
# ERR: na
foo () {
} #end do_foo

The line I want you to pay attention to in the above code begins with "USE:" (4th line). This line specifies the interface which a user of your code should be aware of. You are telling them that this code expects TWO arguments. Now, you can get fancy and use EBNF like syntax to identify optional arguments, but let's keep it simple for this example and just recognize that we have established an interface.

What can we do as a developer to make sure that when someone calls our code, they do not get something unexpected? We can check to make sure they follow our instructions. It's simple enough, although you can certainly take it greater depths. Let's go back to our do_foo example and put a check in place...

foo () {
    test $# -eq 2 || exit 1
} #end do_foo

Let's break down the line I just added... test lives in /usr/bin and should be a fluent part of your shell vocabulary. We are "testing" to see if the number of arguments ($#) is equal to the integer 2. If not (symbolized by ||) then we exit with non-zero status, which is the UNIX convention for something other than success. The next level of effort would include writing a shell equivalent to Perl's die subroutine. This would allow an error message to accompany the exit. We'll save that for another article.

So, what's the benefit of adding this code-bloat to our subroutine? It's common to have a function that uses optional arguments and acts differently depending on what arguments it receives. If the function expects ARG_ONE and ARG_TWO, and you call it with only ARG_ONE, it may assume that ARG_TWO is equal to "". In that case, the output may be "object not found" rather then "Whoa! You made a mistake calling me!". If you were depending on a specific output, this could cause later code blocks to break.

Here's a more specific example. If we are using the ldaplist command to check on project information, we will get two totally different sets of output if we omit a second argument. Pay particular attention to the command and arguments in the examples below:

testbox# ldaplist project
dn: solarisprojectname=srs,ou=projects,dc=mydomain,dc=com
dn: solarisprojectname=bar,ou=projects,dc=mydomain,dc=com
dn: solarisprojectname=foo,ou=projects,dc=mydomain,dc=com
dn: solarisprojectname=group.staff,ou=projects,dc=mydomain,dc=com
dn: solarisprojectname=default,ou=projects,dc=mydomain,dc=com
dn: solarisprojectname=noproject,ou=projects,dc=mydomain,dc=com
dn: solarisprojectname=user.root,ou=projects,dc=mydomain,dc=com

In contrast, what we REALLY wanted was only one line that matches our criteria, not the whole set of data.

testbox# ldaplist project solarisprojectname=user.root
dn: solarisprojectname=user.root,ou=projects,dc=mydomain,dc=com

If we use an argument checker, the error woudl be caught immediately rather than passing on a long list of irrelevant data to whatever we do next. In this case it's particularly ugly because both outputs are identically formatted. Maybe you'd find the problem quickly, maybe you wouldn't.

When your code gets to be hundreds of lines long and you need to start debugging obscure behavior, it can save you a lot of time to write self-policing code. Chances are that if you make a simple mistake calling that subroutine it will fail immediately rather than doing the wrong thing in a hard to find way. A line of prevention is worth an hour of debugging!

Thursday, May 25, 2006

Using syslog with Perl

I recently had an occasion to write a fairly simple Perl script that checks for rhosts files in any home directory which is configured on a system. Nothing fancy, but very useful. After getting through the file detection logic I was left with the question, what now? Should I write a custom log file? Should I call /usr/bin/logger?

As always, I looked for precedents and standard facilities. The first thing that came to mind was syslog. And of course, the fact that I was using Perl led me to believe that I wasn't going to need to execute an external process (the "duct tape hack" as I call it). I view the shell as another language, and something never really feels right when I need to embed one language within another. Don't even get me started about embedding big awk scripts inside shell scripts... That's going to be a future topic.

The duct tape method is bad for a number of reasons. There is overhead associated with forking and executing a new child process from your main script. If you are running awk and sed, or other tools thousands or millions of times against a file then you are forcing Solaris to execute far more system calls than necessary. By keeping it all inside Perl and using modules, you can let the interpreter do the work, and realize a good part of the efficiency that C system programming gives you. I'll save the specifics of this for a later time - we need to dig into the syslog example.

In this case I quickly found the standard Sys::Syslog module. This little gem makes it a snap to log output. I won't go into the Solaris syslog facility here, but suffice it to say that you'll need to arrive at your intended Facility and Priority before going farther. For my purposes I went with User and LOG_NOTICE.

To begin with, we need to include some libraries...

use Sys::Syslog;

When we want to set up the connection with syslog we do the following:

openlog($progname, 'pid', 'user');

The above line specifies that we will use the 'user' facility, which is typically what you should be using if you don't have a specific reason to go with one of the other options. It also specifies that we want to log the pid of the logging process with each entry. Logging the pid is a convention that isn't always necessary, but I like it. The first part, $progname is a variable that stores the name of the script. This deserves a little extra attention.

Since I'm known to change the name of my scripts on occasion I don't like to hard code the name. In shell scripts I usually set a progname variable using /usr/bin/basename with the $0 argument. $0 always contains the first element in the array of command line variables. So, if I called a script named foo with the arguments one, two, three, the command would look something like this:

# /home/me/foo one two three

The resulting array $* would be:


To identify our program name we want the first array element. However, we don't want all that extra garbage of the path. It makes for a messy syslog. The basename UNIX utility helps us to prune the entry. Here's an example in shell:

$ basename /home/me/foo

If we want to do the equivalent in Perl without spawning an external process we can use the File::Basename module. Again, with a simple include at the top of our script this function becomes available to us:

use File::Basename;

Now we can put it all together and create an easily referenced identity check:

my $progname=basename("$0");

Why don't we just hard code the script name? After all, not everyone likes to refactor their code for fun. Besides the idea that we want our code to be maintenance free, there are times when one set of code may be called from links which have different names than the primary body. For example, let's assume that the script foo performs three functions: geta, getb, and getc. To make it easier to call these functions we want to be able to call these directly without duplicating code. Here's how we could do that:

# ls -l ~/bin
-r-xr-xr-x 1 root root 5256 Jun 8 2004 /usr/local/bin/foo
# ln ~/bin/foo ~/bin/geta
# ln ~/bin/foo ~/bin/getb
# ln ~/bin/foo ~/bin/getc

We can now call any of geta,getb,getc and actually call foo. With some simple logic blocks based on what $programe evaluates to we are able to create a convenient interface to a multi-functional program with centralized code. Nice! But I digress - let's get back to looking at syslog...

We have opened a connection to the syslog, and now is the moment of truth. Let's write a syslog entry...

syslog($priority, $msg);

Let's recap... I used a facility of user, and a priority of notice. I want to record the pid, and write a message. What does this look like when its executed?

May 25 11:01:25 testbox rhostck[833]: rhosts file found at /u01/home/cgh

That was really easy, and it's much cleaner than executing the external logger utility because it's all inside Perl.

Tuesday, May 23, 2006

A plethora of ldapsearches...

If you're going to deploy a directory service for Solaris systems, and you are really lucky, your server and clients will all be using a Solaris version greater than 9. LDAP works nicely in 9, but it's a bit of a transition release. Only in Solaris 10 is Sun's commitment to LDAP clear. Let's take a look at one of the more frustrating examples of Solaris 9's transitionary status: The ldapsearch command.

ldapsearch comes in many different flavors. First is the native Solaris version which lives in /usr/bin. On Solaris 9 this version does not support SSL (-Z option). In Solaris 10 SSL is nicely supported through this client. Next we have the iPlanet flavor which lives in a dark and gloomy path: /usr/iplanet/ds5/shared/bin. This is installed by default with Solaris 9 and happily supports SSL despite its gloomy path. But wait, there's still one more! After installing the JES Directory Server you will find one more flavor of ldapsearch living in /usr/sadm/mps/admin/v5.2/shared/bin. Now that's an intuitive path. This last flavor will only be on your server, but I'd hate to leave it out of the fun.

As if having too many to choose from isn't enough, two of the ldapsearch flavors require proper setting of the LD_LIBRARY_PATH variable. When a dynamically linked binary requires a library that lives somewhere other than the system default (usually /usr/lib variants) it needs the LD_LIBRARY_PATH variable to tell it where to look.

Here's an example of a binary that needs the extra help from LD_LIBRARY_PATH:

testbox$ /usr/sadm/mps/admin/v5.2/shared/bin/ldapsearch ldapsearch: fatal: open failed: No such file or directory

So what happened? Let's take a closer look...

testbox$ truss /usr/sadm/mps/admin/v5.2/shared/bin/ldapsearch
execve("/usr/sadm/mps/admin/v5.2/shared/bin/ldapsearch", 0xFFBFFB54, 0xFFBFFB5C) argc = 1
resolvepath("/usr/lib/", "/usr/lib/", 1023) = 16
resolvepath("/usr/sadm/mps/admin/v5.2/shared/bin/ldapsearch", "/usr/sadm/mps/admin/v5.2/shared/bin/ldapsearch", 1023) = 46
stat("/usr/sadm/mps/admin/v5.2/shared/bin/ldapsearch", 0xFFBFF928) = 0
open("/var/ld/ld.config", O_RDONLY) Err#2 ENOENT
stat("../", 0xFFBFF430) Err#2 ENOENT
stat("../lib/", 0xFFBFF430) Err#2 ENOENT
stat("../../lib/", 0xFFBFF430) Err#2 ENOENT
stat("../../../lib/", 0xFFBFF430) Err#2 ENOENT
stat("../../../../lib/", 0xFFBFF430) Err#2 ENOENT
stat("../lib-private/", 0xFFBFF430) Err#2 ENOENT
stat("/usr/lib/", 0xFFBFF430) Err#2 ENOENT ldapsearch: fatal: open failed: No such file or directory
write(2, " l d . s o . 1 : l d a".., 81) = 81
lwp_self() = 1

Here we can see Solaris trying to find the required dynamically linked library, It traverses 8 directories, each time returning the ENOENT key which intuitively means "ERROR - No entity found". So, job #1 is finding that library and acquainting it with the binary that's lost its way...

testbox$ grep /var/sadm/install/contents
/usr/appserver/lib/ f none 0755 root bin 380348 45505 1052289104 SUNWasu
/usr/dt/appconfig/SUNWns/ f none 0755 root sys 450716 23095 1032825102 SUNWnsb
/usr/iplanet/ds5/lib/ f none 0755 root bin 361976 55632 1013353620 IPLTdsu
/usr/lib/mps/ f none 0755 root bin 392416 44988 1100692806 SUNWldk
/usr/lib/mps/sparcv9/ f none 0755 root bin 433976 29179 1100692807 SUNWldkx

In this case, we know that the needed library is going to be used with the JES ldapsearch, so we'll guess that appserver's offering isn't quite what we want. /usr/iplanet looks tempting, and will probably work, but what we want is the /usr/lib/mps directory which is distributed with the Sun LDAP C SDK.

So now that we've found the missing library, let's plug it into the LD_LIBRARY_PATH and see what happens. I'm using the Korn shell, so if you're a C-Shell type you'll just have translate on the fly.

testbox$ export LD_LIBRARY_PATH=/usr/lib/mps:/usr/lib/mps/sasl2
testbox$ sudo /usr/sadm/mps/admin/v5.2/shared/bin/ldapsearch [...]
version: 1
dn: dc=foo,dc=com
objectClass: top
objectClass: domain
objectClass: nisDomainObject
dc: apps

It worked! (You didn't doubt me did you?) You may have noticed that I actually added two paths. After fixing the first missing library you would have dicovered a second missing one which was identified and fixed the same way. I love a problem with multiple layers... Especially when layer #2 is the same solution I needed to peel layer #1. The other thing to note is that I abridged the command line for ldapsearch. Executing an ldap query with SSL can be like writing a book so I cut it short.

So, not only do you need to pick the right ldapsearch flavor, but you also need to set LD_LIBRARY_PATH accordingly. If you are using the Solaris native versions you don't need to do anything. But for JES and iPlanet verions, here's what you need:

  • iPlanet: LD_LIBRARY_PATH=/usr/lib/mps

  • JES: LD_LIBRARY_PATH=/usr/lib/mps:/usr/lib/mps/sasl2

So which one should you use? Here's a quick flow to make that decision. If you are using Solaris 10, just go with /usr/bin/ldapsearch. It does everything without any hassle. If you are on 9, then a decision emerges. If you have an SSL-secured directory server you can not use /usr/bin/ldapsearch. Typically, you will use the iPlanet version on Solaris 9, and if you are on the server itself, go with the JES version.

So there you have it, a lot of hassle can be saved by deploying on Solaris 10 rather than 9. Most of what you'll need the Directory for will be handled by Solaris internals, so you won't need ldapsearch, for example, to authenticate users against the Directory Server. Where you will need ldapsearch is if you are storing custom entries in the directory, or executing a special query against it.