Monday, May 21, 2007

Who's on first? Identifying port utilization in Solaris

Setting up Apache2 on Solaris 10 is normally about as challenging as brushing your teeth. But in this case, I was humbled by an unexpected troubleshooting adventure. I needed to transfer a TWiki site from an Apache server running on Solaris 9 to an Apache2 server running on Solaris 10. Sounds pretty straight forward, but I abandoned discipline at one point in the game and that detour came back to bite me.

I started carelessly thinking that to make things simple I would just use the legacy apache. This would save any initial headaches with module incompatibilities (if any existed). So, I started out copying the config file in place and trying to start the daemons. It didn't work, and after a few minutes of fiddling with the new httpd.conf I changed course. My reasoning went something like this, "If I'm going to spend much time fiddling, I might as well fiddle with Apache2 and have something better than I started with." And so it began.

I stopped the legacy Apache daemons and followed a similar process with Apache2, ending with the same result: No daemons. I did some fiddling and located a minor typo I'd made in the configuration which is not of consequence to this story. I issued a "svcadm restart apache2" command. Yeeha! Now I had five httpd processes just chomping at the bit for a chance to serve those Wiki pages.

Or did I? It turned out that no matter what I did with my web browser remotely or locally I couldn't get a response. So, I tried a quick telnet to port 80 to see what there was to see... And of course I received a response, so all must be well. Somewhere in my troubleshooting process I made two mistakes:

First, I didn't remove the httpd.conf file from /etc/apache, which means the legacy Apache starts up and conflicts with Apache2 on a reboot. I've already written an article that goes into some detail about why the current legacy Apache's integration isn't ideal, so I won't expand on my frustration in this one. This problem was quickly solved, and could have been avoided if I had adhered to my Jedi training.

Second, I assumed that when I directed a Telnet session to port 80 it was reaching the Apache2 server. In fact, it was not. I shut down the Apache2 server and again issues the Telnet command to port 80. Surprise! The same greeting appeared. So, some process on the system had claimed port 80 before Apache could do so. Now... To find it!

Linux distributions typically ship with the lsof utility. This provides a quick and convenient way to identify what process is using what TCP port. Solaris doesn't have lsof in the integrated Open Source software (/usr/sfw) or the companion CD (/opt/sfw). It's not hard to obtain and compile, but it's just inconvenient enough that I'm inclined not to do it. My next logical question became, "what is the Solaris way to accomplish my goal?".

Solaris has no way to natively solve this issue without a shell script. There are a number of similar scripts available on-line through a quick Google search. None are particularly complex, but complex enough that you wouldn't want to write them every time you need it. Here's what I ended up with:

#!/bin/sh

if [ `/usr/xpg4/bin/id -u` -ne 0 ]; then
echo "ERROR: This script must run as root to access pfiles command."
exit 1
fi

if [ $# -eq 1 ]; then
port=$1
else
printf "which port?> "
read port
echo "Searching for processes using port $port...";
echo
fi

for pid in `ps -ef -o pid | tail +2`
do
foundport=`/usr/proc/bin/pfiles $pid 2>&1 | grep "sockname:" | egrep "port: $port$"`
if [ "$foundport" != "" ];
then
echo "proc: $pid, $foundport"
fi
done

exit 0


When executed, it will produce output similar to the following. Note that it requires root permissions to traverse the proc directories...

cgh@testbox{tmp}$ sudo ./portpid 80
proc: 902, sockname: AF_INET 0.0.0.0 port: 80
sockname: AF_INET 192.168.1.4 port: 80
sockname: AF_INET 127.0.0.1 port: 80


A quick "ps -ef " command told be that our Citrix server was to blame for the port conflict...


cgh@testbox{tmp}$ ps -ef | nawk '$2 ~ /^902$/ {print $0}'
ctxsrvr 902 1 0 May 18 ? 7:00 /opt/CTXSmf/slib/ctxxmld


Ah ha! Problem solved. I'd like to see the Solaris engineering team add a "p" command, or an option to an existing command to make this functionality a standard part of Solaris. Another option would be to integrate the Linux syntax for the fuser command to make this possible.

Friday, May 18, 2007

Apache in Solaris 10: 3 Simple Things I Would Change

The Apache legacy run control script in Solaris 10 (/etc/init.d/apache) provides an excellent example of a few practices to avoid when writing init scripts.

Take a look at the code snippet below:

if [ ! -f ${CONF_FILE} ]; then
exit 0
fi


Are you kidding me? Of course this is easy to debug, but let's look at what it does anyway: If the configuration file is missing, when you ask to start Apache, and it will exit with a code of zero when it doesn't find the /etc/apache/httpd.conf file. In case you didn't catch the first four words of this paragraph I'll repeat them. Are you kidding me?

Here's a simple improvement...

if [ ! -f ${CONF_FILE} ]; then
echo "ERROR: ${CONF_FILE} not found. Exiting."
exit 1
fi


The first change was to exit with a non-zero status. Zero is the UNIX standard exit code representing successful completion. If the configuration file is missing and you request a startup, it should NOT exit with a zero status.

The second change is to provide a concise error message indicating why the exit code is going to be zero. There is no benefit to bolstering the cryptic nature of UNIX. In my mind the best systems are designed such that a tired SA at 4AM has a reasonable chance of accurate debug and corrective action.

Having said all this, the reason the code is necessarily convoluted because the not-yet-configured service has an active set of init scripts in the run control directories.

cgh@testbox{etc}$ ls -i /etc/init.d/apache 21813 /etc/init.d/apache*
cgh@testbox{etc}$ find /etc/rc?.d -inum 2813
/etc/rc0.d/K16apache
/etc/rc1.d/K16apache
/etc/rc2.d/K16apache
/etc/rc3.d/S50apache
/etc/rcS.d/K16apache


So the root cause of our problem is that someone decided to make it easy for someone who doesn't understand the Solaris Run Control facility to start Apache by simply creating the httpd.conf file. Is that really a good idea? I would argue that for many reasons it's a bad practice. If a service is not configured to run, it should not be active in any run level.

The third detail I would change is Solaris' default behavior of installing active sym-links in the legacy rc directories, and instead use an SMF manifest that adheres to standards.

None of this impacts the otherwise excellent web server that Sun has integrated into their OS, and I'm grateful that Sun has provided it in their standard OS rather than leaving it to the semi-integrated Companion CD. I woudl, however, like to see that integration brought up to Jedi standards.

5/21/07 Postscript: I probably should have made it clear that the Apache2 server is implemented nicely using SMF, and is probably what you ought be to using on Solaris 10 if you've decided to forego the JES Web Server. I don't think that excuses the older Apache server from maintaining Jedi discipline, but it does move the issue a bit toward the background.

Monday, May 07, 2007

Turn off the LAMP and Reuse Acronyms

I've never been a fan of the LAMP acronym because it's too restrictive. It gives the impression that to be socially responsible in the Linux community one needs to be a LAMP developer.

In this month's Linux Journal magazine I found an article explaining one perspective on why PostgreSQL is a more desirable database than MySQL. I've had the exact same thought process for years now. Truth be known, I also prefer Perl development to PHP, and I prefer running the stack on Solaris over Linux. I guess that SAPP doesn't have the same sexy ring as LAMP. There's probably an odd trademark thing with an ERP company as well.

Now before you get too bent out of shape, I am aware that the acronym has some poetic license with it, and people often swap Perl and PHP, and in theory any other letter can be swapped out. Why invent a new acronym that doesn't convey the real idea when a perfectly good acronym already exists?

There is nothing wrong with simply stating that an application is built on an Open Source Stack. The acronym OSS (Open Source Software) is well known and conveys a lot more than LAMP. It stands for a methodology rather than a point solution, and embraces the foundation that made "LAMP" so successful. Why limit yourself to MySQL and PHP? Wouldn't you be more valuable as an architect capable of leveraging the most appropriate components Open Source has to offer?