Monday, May 21, 2007

Who's on first? Identifying port utilization in Solaris

Setting up Apache2 on Solaris 10 is normally about as challenging as brushing your teeth. But in this case, I was humbled by an unexpected troubleshooting adventure. I needed to transfer a TWiki site from an Apache server running on Solaris 9 to an Apache2 server running on Solaris 10. Sounds pretty straight forward, but I abandoned discipline at one point in the game and that detour came back to bite me.

I started carelessly thinking that to make things simple I would just use the legacy apache. This would save any initial headaches with module incompatibilities (if any existed). So, I started out copying the config file in place and trying to start the daemons. It didn't work, and after a few minutes of fiddling with the new httpd.conf I changed course. My reasoning went something like this, "If I'm going to spend much time fiddling, I might as well fiddle with Apache2 and have something better than I started with." And so it began.

I stopped the legacy Apache daemons and followed a similar process with Apache2, ending with the same result: No daemons. I did some fiddling and located a minor typo I'd made in the configuration which is not of consequence to this story. I issued a "svcadm restart apache2" command. Yeeha! Now I had five httpd processes just chomping at the bit for a chance to serve those Wiki pages.

Or did I? It turned out that no matter what I did with my web browser remotely or locally I couldn't get a response. So, I tried a quick telnet to port 80 to see what there was to see... And of course I received a response, so all must be well. Somewhere in my troubleshooting process I made two mistakes:

First, I didn't remove the httpd.conf file from /etc/apache, which means the legacy Apache starts up and conflicts with Apache2 on a reboot. I've already written an article that goes into some detail about why the current legacy Apache's integration isn't ideal, so I won't expand on my frustration in this one. This problem was quickly solved, and could have been avoided if I had adhered to my Jedi training.

Second, I assumed that when I directed a Telnet session to port 80 it was reaching the Apache2 server. In fact, it was not. I shut down the Apache2 server and again issues the Telnet command to port 80. Surprise! The same greeting appeared. So, some process on the system had claimed port 80 before Apache could do so. Now... To find it!

Linux distributions typically ship with the lsof utility. This provides a quick and convenient way to identify what process is using what TCP port. Solaris doesn't have lsof in the integrated Open Source software (/usr/sfw) or the companion CD (/opt/sfw). It's not hard to obtain and compile, but it's just inconvenient enough that I'm inclined not to do it. My next logical question became, "what is the Solaris way to accomplish my goal?".

Solaris has no way to natively solve this issue without a shell script. There are a number of similar scripts available on-line through a quick Google search. None are particularly complex, but complex enough that you wouldn't want to write them every time you need it. Here's what I ended up with:

#!/bin/sh

if [ `/usr/xpg4/bin/id -u` -ne 0 ]; then
echo "ERROR: This script must run as root to access pfiles command."
exit 1
fi

if [ $# -eq 1 ]; then
port=$1
else
printf "which port?> "
read port
echo "Searching for processes using port $port...";
echo
fi

for pid in `ps -ef -o pid | tail +2`
do
foundport=`/usr/proc/bin/pfiles $pid 2>&1 | grep "sockname:" | egrep "port: $port$"`
if [ "$foundport" != "" ];
then
echo "proc: $pid, $foundport"
fi
done

exit 0


When executed, it will produce output similar to the following. Note that it requires root permissions to traverse the proc directories...

cgh@testbox{tmp}$ sudo ./portpid 80
proc: 902, sockname: AF_INET 0.0.0.0 port: 80
sockname: AF_INET 192.168.1.4 port: 80
sockname: AF_INET 127.0.0.1 port: 80


A quick "ps -ef " command told be that our Citrix server was to blame for the port conflict...


cgh@testbox{tmp}$ ps -ef | nawk '$2 ~ /^902$/ {print $0}'
ctxsrvr 902 1 0 May 18 ? 7:00 /opt/CTXSmf/slib/ctxxmld


Ah ha! Problem solved. I'd like to see the Solaris engineering team add a "p" command, or an option to an existing command to make this functionality a standard part of Solaris. Another option would be to integrate the Linux syntax for the fuser command to make this possible.

No comments: