Sunday, July 23, 2006

Solaris - the Wine

Solaris Wine Label
soalriswine.jpg
Originally uploaded by cghubbell.
I believe this may be one of the best system tools for evenings when a long day of UNIX has left the Force unbalanced in your mind. I stumbled on this wine while perusing a liquor store in Horseheads, NY. I hardly drink at all these days, but definitely enjoy a nice glass of wine when I unwind after a long day. If you're into this kind of thing and enjoy seeing Solaris outside the data center, check out the Solaris Winery!

Friday, July 21, 2006

Did you hear what I sed?

When battling the dark side of UNIX, it is critical that you not let your eyes betray your instincts. Windows teaches you to trust what you see, which is in itself a good reason to be wary. Today's lesson will involve our old friend awk, and a not so well known friend, od (octal dump).

I was working on a section of code, which decided whether or not arguments were passed by checking in a case statement for an empty string, or anything other than an empty string. It looks like this:

case "$SID_LIST" in
"" ) # No arguments passed, go with default.
echo "Stopping all configured Oracle databases."
su oracle -c "$ORACLE_HOME/bin/dbshut"
;;
* ) # SID list paased - pass it on to dbshut
echo "Stopping specified Oracle database(s)."
su oracle -c "$ORACLE_HOME/bin/dbshut $SID_LIST"
;;
esac


This structure comes from the Oracle 10g dbshut script which I'm applying some customizations to. As a result, I'm trying not to completely restructure the script. If I were to write it myself, I'd be more tempted to put this in an if statement, and test for a null string (test -z). But, since I'm working with someone else's code, I'm trying to stick to minimizing my impact.

If you call this particular script with arguments (an argument is a token that follows the command, like do_something RED BLUE) I detect the extra arguments from the command line, and put them into a variable called SID_LIST as follows:

ACTION=$1 # Assign first argument to action
shift; # shift arg pointer past $1 (action)
SID_LIST="$*" # Assign any remaining args to the argument list


So, when I call the script with a command like, "dbshut TESTDBA TESTDBB" I expect to see SID_LIST end up with the values "TESTDBA TESTDBB". Good enough! But what if someone repeats an argument? We don't want to iterate through arguments we have already processed, so I decided to add my own personal garnish of ensuring the list is unique. And this little detour is where the fun began...

The modification I made looked like this:

SID_LIST=`echo $SID_LIST | tr " " "\n" | sort | uniq | tr "\n" " "`


Let's break this down into logical steps:
First, translate any spaces into newlines because the next commands in the pipeline will expect to see things in multi-line form. This turns "one two one three" into:

one
two
one
three

Next, sort the output alphabetically to ensure similar items are immediately next to each other, which is necessary for the following piece of the command. Now we send the sorted list to a program called uniq which removes duplicates. The output now looks something like this (remember, its alphabetical):

one
three
two

Finally, we need to get it back into a single-line format, so we send the output into the reverse of the first tr command which replaces any newlines with spaces. Our final output looks like this:

"one three two"

Having conquered that challenge, I integrated the code fragment and observed its behavior. Oddly, I discovered that whether or not I supplied arguments, the case statement always resolved my input to be in the "*" branch rather than the "" branch. After taking a closer look, I discovered that my output was not what it appeared... In fact, the final newline had been replaced with a space by the last tr command, and my string looked like this:

"one[space]three[space]two[space]"

Because SID_LIST did not match "", the case statement selected the "*" branch instead. Feeling quite impressed with my mastery of the debugging arts, I surmised that a simple sed statement could whack my terminating space, and leave me with the desired empty string that would set my logic free. But alas, it was not to be...

I left me editor, and started playing on the command line. First, I created a simulation by setting a variable to contain a series of pretend arguments:

testbox{cgh}$ A="one two two three three four"


Next, I simulated my script's pipeline so make sure I could duplicate the problem. I surrounded the output with brackets to make the trailing space more obvious...

testbox{cgh}$ B=`echo $A | tr " " "\n" | sort | uniq | tr "\n" " "`
testbox{cgh}$ echo "[$B]"
[four one three two ]


Excellent, now we can test a fix... I put a sample string with a trailing space into a variable, and sent it into a sed command. The sed script is pretty straight-forward; search for a space character immediately before the end of the line, and replace it with nothing. This breaks down to the three divisions between slashes: [s]earch/[space]$(end of line)/replace_with_nothing/.

testbox{cgh}$ X="four one three two "
testbox{cgh}$ echo "[`echo $X | sed -e 's/ $//'`]"
[four one three two]


And behold, it worked! I now take the tested sed script, and attach it to the end of the pipeline...

testbox{cgh}$ A="one two two three three four"
testbox{cgh}$ B=`echo $A | tr " " "\n" | sort | uniq | tr "\n" " " | sed -e 's/ $//'`
testbox{cgh}$ echo "[$B]"
[]


What happened to my string? I copied and pasted the code, and it should have worked! Here is the part where we learn to trust our instincts, and not what we see. Let's revisit our input variables using The Force...

Earlier, we set $X to contain a sample set of arguments with a trailing space, and that input string worked nicely. Maybe the input changed somewhere in the pipeline to not exactly reflect the test conditions in our experiment... Here's how we can compare them:

testbox{cgh}$ echo $X | od -c
0000000 f o u r o n e t h r e e t
0000020 w o \n
0000023
testbox{cgh}$ echo $A | tr " " "\n" | sort | uniq | tr "\n" " " | od -c
0000000 f o u r o n e t h r e e t
0000020 w o
0000023


Do you see it? The difference is that our experiment's $X string is terminated by a newline character, while our pure pipeline string has lost its newline. This becomes a problem for the sed command which removes our trailing space. Sed acts when it sees an input terminator like a newline or ctrl-D character. In this pipeline, sed is never getting what it needs.

The solution is fairly simple, although not pretty. I broke this pipeline into two statements, and sent my sed script its input from an echo command rather than directly through the pipeline. This allows echo to put a newline onto the string and make sed happy. Here's what it looks like:

SID_LIST=`echo $SID_LIST | tr " " "\n" | sort | uniq | tr "\n" " "`
SID_LIST=`echo $SID_LIST | sed -e 's/ $//'`


This could be performed in other ways, my personal favorite being to reincarnate this script in Perl and eliminate all these pipelines and separate commands. But, by leaving it as-is I can keep the user base more comfortable with the language. It also serves as a great lesson for Jedi training, and so shall it remain.

Wednesday, July 19, 2006

Poor grammar isn't always a bad thing

If you write enough shell scripts you will eventually fall prey to your own comments. Unless you read my blog of course, in which case you will have saved hours of frustration!

Let's take a fictitious problem... You need to print the first and third columns of the /etc/passwd file so that a report can be generated correlating user IDs to user names. Being the UNIX monk that you are, you assure your management that a shell script can meet their every need, and there is really no reason to have an ODBC link from Microsoft Access to the passwd file.

You throw together some code, and it looks like this:

#!/usr/bin/ksh
nawk 'BEGIN { FS=":" }
# We don't want to print anything but
# the first and third column
{print $1,$4}' /etc/passwd
exit 0


Looks like a nice tight algorithm, well commented, and generally a job well done. You pat yourself on the back and refill your coffee, ready for the next challenge. Not so fast... First you decide to test that script, and you see the following:

testbox{cgh}$ ./comtst.ksh
./comtst.ksh[6]: syntax error at line 6 : `'' unmatched
testbox{cgh}$


But how can this be? It's a simple script, and the logic is flawless! Let's test it to be sure...

testbox{cgh}$ nawk 'BEGIN { FS=":" } {print $1,$4}' /etc/passwd
root 1
daemon 1
bin 2
sys 3
adm 4
lp 8
uucp 5
nuucp 9
ftp 60001
smmsp 25
listen 4
nobody 60001
noaccess 60002
nobody4 65534
cgh 1000


It works... What is the problem here?

It turns out that the comments in the embedded nawk code are the problem. In this case, the apostrophe in "don't" closes the opening apostrophe at the beginning of the nawk statement, and the shell interprets the code like this:

#!/usr/bin/ksh nawk 'BEGIN { FS=":" }# We don'


So what we really do it pass nawk a syntactically incorrect program. Having figured it out, we re-write the code as follows:

#!/usr/bin/ksh
nawk 'BEGIN { FS=":" }
# We do not want to print anything but
# the first and third column
{print $1,$4}' /etc/passwd
exit 0


There are two morals to this story: First, at the risk of repeating myself like a broken record, don't use multiple shells unless it's absolutely necessary because you run the risk of obscure interpretation problems. In this case, we could solve the problem by writing in Perl where there's no need to embed a second language.

The second moral is to always avoid using contractions and meta-characters in your comments. It makes for slightly longer comments, but if you scrictly avoid the temptation, it is one less thing to worry about. This example was so simple that it's not hard to locate, but if you had a complex nawk script with its own subroutines buried in a complex shell script, it can be very frustrating trying to locate the bug.

The dark side will tempt you with contractions, but now your Jedi training has equipped you to calm your mind and type out those extra few characters. Until next time, may the code be with you.

Tuesday, July 11, 2006

Don't Shed Your Shell

I've said it before, and will say it again; Switching interpreters in mid-code is a practice to avoid whenever possible. There are times that it can be avoided, but there's a lot of times when you can sacrifice a bit of elegance for simpler maintenance.

As with most bugs, I was recently bit by a dumb mistake. I needed the ability to lookup Solaris Resource Manager Project information using tags embedded in the description field. For example, SID=TESTDB is how I would specify an Oracle database SID. I wrote a Korn shell function called getprojbyattrib() which accomplished this very thing. Tested on its own, it worked wonderfully. When I went to integrate it with the existing Oracle start-up scripts I ran into some problems. Turned out they were easy to debug, but the root cause was my old enemy of incompatible interpreters.

This new shell library function is used to figure whether or not an SRM project is configured for a given Oracle database. If one and only one match is returned, then the database is started in a project container. Any other condition means that the database is started without SRM. To help in this cause, I embedded a counter in the function to return how many matches were found. The code in question was simple:

# Keep track of the number of projects we find while outputting
# them so the final tally can be used as a success indicator.
PRJCOUNT=0
for PRJ in $PRJLIST
do
echo "$PRJ"
PRJCOUNT=$(($PRJCOUNT+1))
done


Make note of the seventh line of code which does the incrementing. This is a Korn shell specific operation. When the calling code from the oracle startup script referenced this, it gave an error which told me that it had interpreted line #7 at "PRJCOUNT=$". This is because the Bourne shell doesn't understand the operation.

The fix is simple. Either switch the calling script to use the Korn shell interpreter because Korn is a superset of Bourne, or change the increment code to be Bourne-friendly by using either bc or expr.
PRJCOUNT=`/usr/bin/expr $PRJCOUNT + 1`

Interestingly, the library function was written with a header that specified Korn shell as its interpreter:

#!/bin/sh


This becomes irrelevant when you are sourcing functions or variables as the whole point is to have your calling shell get access to these objects.

Sp what did I do? At first I switched the calling code, but some afterthought lead me to work with the underlying Bourne shell subset so the library would be more portable. I don't really like Bourne shell as Korn is much more capable, but in this case portability is weighted more heavily than elegance.

Repeat after me: Switching interpreters in mid-code is something to be avoided whenever possible.