Most data centers I've encountered tackled their patching strategy a long time ago. Some may have revisited it when Live Upgrade was introduced, but in general the process doesn't change much once it is created. Why? Patching isn't glorious and exciting. We tend to take it for granted when it works, and "deal with it" when it doesn't. I have to admit I have been guilty of not paying a lot of attention to the guts of Solaris patching for years because all the sites I've worked at had a process and I was busy doing other things. Until now, that is.
I'm currently tasked with designing an Enterprise patching strategy for Solaris servers. What started out as a project I considered pretty dry turned into something I'm really glad to have the opportunity to work on. Why? Because I'm excited about the approach Sun is recommending. I think a lot of the things I used to dislike about patching Sun systems are on their way out.
If you haven't already seen it, Sun's On-Line Learning Center has a new course: Solaris 10 Patching Best Practices(WS-2700-S10). It's free, so even in the current climate of slashed training budgets you can still learn the new way of approaching updates. You should be able to get through it in an average work day and still keep up with email.
For a long time sites with more advanced Sun support have been able to leverage a patch baseline known as EIS, or Enterprise Installation Standards. However, if you didn't have some form of advanced interaction with Sun, or the xVM Operations Center (xVMOC) you don't have regular access to EIS. That left you with maintenance updates/upgrades, recommended clusters, SunAlert cluster, or the "Dim Sum" approach of grabbing an analysis off a current patchdiag.xref and installing the patchlist-du-jour. Which path is the right one?
Here's what you don't want to do: Research all of Sun's white papers and best practices that remain available long after growing long in the tooth. The patching strategies and recommendations are a snarled mess of contradictions that lead to confusion, frustration, and eventually rolling your own because its better than nothing. The good news is that Sun's new training course brings some sanity to the plate.
The high level recommendation from Sun is very straight-forward. Start with the patch/package utilities updates from SunSolve to ensure your patching system is not going to introduce problems. Then install either the latest maintenance upgrade (ideally), or the latest maintenance patch set. This gives you a clean and well integrated baseline. Next, apply the SunAlert recommended cluster to attack any critical fixes that have become necessary since the last maintenance release. The training course implies that Sun plans to merge the Recommended and SunAlert clusters to reduce confusion - another great improvement.
What's great about this approach? First, it's simple. I can grab a few clusters and put together an easy to understand, easy to implement, repeatable process. Second, I'm a huge fan of the use of baselines. By minimizing the use of one-off patches we move to grabbing a baseline which includes the required fix. This means that while I'm introducing more change, I'm introducing a set of changes that went through QA at Sun. That doesn't remove my testing responsibility, but it means I'm standing on the shoulders of giants rather than hoping for the best. Even if I have a phenomenal test suite, it's not going to be as mature or comprehensive as Sun's internal processes. Third, my environment is going to be more consistent. Why? Because all the Solaris 10 servers will eventually end up on the same MU. Today I have similar patch levels on a wild assortment of MUs.
While there's a lot more to the training content, the other big point made throughout is that you need to use Live Upgrade. It's not just a feature you may want to try. It's how you should be patching Sun systems. The catch of course, is that not all systems are configured in a way that lends itself to LU. But the writing is on the wall, and my interpretation tells me I need to start (1) updating our site's reference architectures to move toward being LU-friendly, and (2) begin using LU on those systems which will support it conveniently so we start building site knowledge.
Showing posts with label Solaris 10. Show all posts
Showing posts with label Solaris 10. Show all posts
Thursday, November 19, 2009
Sunday, June 17, 2007
Sun Certification: To dig, or not to dig? (part 1 of 2)
Certification can be an almost religious debate amongst the technical community. One faction believes whole heartedly that the measure of a technologist is his ambition, and list of accomplishments. The other camp believes to death that certifications are a demonstration of professional commitment and a common ground from which to base skill assessments. I am currently a Sun Certified System Administrator (SCSA) for Solaris 7 and Solaris 9, as well as a Sun Certified Network Administrator (SCNA) for Solaris 7. I have been studying avidly for the upgrade exam which will add Solaris 10 to my SCSA listing which I hope to pass soon so I can reclaim the studying hours for other more interesting tasks.
I think just about everyone who has been in the field for a reasonable length of time has encountered the certification specter... You know, the guy who has his Masters degree in CS or IS/IT, MCSE, CCNA, SCSA, and a few others tossed in for good measure. They look like a lesser god on paper, but then you notice that once they log on to a system they can't write a script to save their life, and forget that shutting off the SSH daemon during business hours is a bad thing. These academic savants are a big reason why certifications have a bad name. In my mind they demonstrate the basis for the phrase, "just because you CAN doesn't mean you SHOULD." A certification, in my mind is a commitment to understand the best practices and core tools within a product and apply that knowledge actively to your solutions and daily work. A classic example is the proper use of init scripts - something that the majority of system administrators I have crossed paths with never learned. This information is found easily in the Solaris System Administration documentation collection, so why is no one practicing it? In this case, it has nothing to do with it being a bad practice to follow... It's just a topic people do not bother to understand beyond the minimum required to make it work.
On the other hand, I have known many top-notch Solaris professionals who are not certified. They can run circles around me in both theory and practice, but never took the additional step. I don't respect them any less because they have demonstrated a commitment to their field through practice. What I don't respect is the "average" SA who believes they could write the kernel scheduler in half the lines of code, but hasn't accomplished anything more advanced than setting up Apache Virtual Servers and using Veritas' Volume Manager to unencapsulate a root disk.
I've listened to this type of person lecturing from their soapbox about how they don't need a certification to prove their skills. Uh huh. But it might take the edge off the cowboy hat, and create a spark of thought-discipline. You see, being certified does not mean that you have to practice everything you learned. It means you have taken the time to understand in depth one way of doing things. The alternative is spending no time studying, and simply absorbing that which you cross paths with.
Another reason certifications have a bad name is that they do not address the real world. Exactly how could a one-hour exam possibly compress all the operational knowledge one gathers by the time they are ready to be certified? By now anyone reading my blog should be free of an doubt that I love Sun Microsystems. Having reminded you of that point first, I will now say that I am not a fan of Sun's certification strategy in the Solaris Operating System Track. I am basing my study on Sun's Web Based training curriculum, which I find generally outstanding as a substitue for instructor lead education. My beef is not with the vehicle, but the curriculum.
As an example, I would estimate that one third of the training materials consumed my time with how to accomplish a task in the Solaris Management Console (SMC). SMC is an interesting idea which flew about as well as a snail tied to a brick. It's not all bad, but it's not all that useful. I don't mind the option to use a GUI, but the amount of time spent on it in the curriculum is rediculous when considered against the amount of use SMC gets in the real world.
Is it good to know how to use SMC? Of course! Especially for it's ability to manage local accounts (but it stinks for network information systems like NIS+ or LDAP). But let's not worry about memorizing all of its menus and screens. One of UNIX advantages is its ability to be remotely managed over a serial connection. I'd never hire a UNIX SA who couldn't do his job proficiently over a 9600-8-n-1 connection.
Here's another sore spot for me... One of Solaris 10's most incredible features is ZFS. I have not begun to expand in my mind the full effect it will have on the industry, and it's not just a series of commands to memorize - it's an entirely new way to manage storage. And yet, there is NO coverage of it on the Solaris 10 exam. Are you KIDDING me?
Thankfully zones are covered, and I'm told that the exam had a good number of related questions. However, the coverage isn't very deep, and sticks to the commands more than the theory. That's unfortunate because it's easy to look up a man page, but hard to design a well thought out consolidation platform. I'd say that sentence sums up my thoughts on certification strategies on many levels.
Resource management is another feature which seems conspicuously absent from the certification curriculum. Although it is very complex (aren't zones as well?), it is a very powerful Solaris feature which I believe is a competitive advantage for Sun. So why not expect a certified administrator to know how to use it? The idea isn't to make everyone feel good with a title on their business card, it is to demonstrate that someone has differentiated themselves by demonstrating a defined level of skill.
What else would be important for an SA to have cursory knowledge of? DTRACE, any one? I don't expect every competent Solaris administrator to be able to write advanced D scripts, or memorize the seemingly infinite number of probes available in Solaris, but for the love of McNealy, can't we even expect them to know what kind of problem it solves? Can't we even establish what a probe is, and why Solaris is WAY ahead of Linux in that respect?
Finally, the emphasis on memorizing obscure command line options really grates on me. This is really what undermines the technical merit of Sun's Solaris Certifications. There are so many commands and concepts that deserve coverage, it seems a shame to take up space with questions like, "What is the option you must give to ufsdump in order to ensure /etc/dumpdates is updated when usng UFS snapshots." I don't know anyone practicing in the field who wouldn't look that up in the man pages even if they THOUGHT they knew the option.
I could go on, but I think the point is made: The exam is lacking in strategic substance. As a result, most folks describe it as a memorization excercise. That's a shame, because the exam COULD be a differentiating ground for Solaris professionals as well as a way for Sun to ensure the compelling features of Solaris are being leveraged to their fullest. And yet, I'm getting ready to take my third SCSA exam. Why?
I maintain the currency of my Solaris Certifications because I believe a professional seeks to understand standards in their field, whether good or bad. As a professional Solaris system architect, the SCSA and SCNA exams are at the core of my practice whether I choose to follow or deviate from their content. I also believe that a certification tells my customers (or employers) that I demonstrate a certain level of competence, even if the bar is not as high as I would like to see it.
I believe deeply in the importance of standards and certifications as a vehicle to advancing the maturity of Systems Engineering practices as applied to system administration. And although Sun's certifications are not there yet, I will continue to support them for what they do provide, and what I hope they will provide in the future; A vehicle to advance the maturity of the industry.
Part 2 of this article will discuss my recommendations for improving Sun's Certifications. As usual I have a few ideas up my sleeves. Stay tuned...
I think just about everyone who has been in the field for a reasonable length of time has encountered the certification specter... You know, the guy who has his Masters degree in CS or IS/IT, MCSE, CCNA, SCSA, and a few others tossed in for good measure. They look like a lesser god on paper, but then you notice that once they log on to a system they can't write a script to save their life, and forget that shutting off the SSH daemon during business hours is a bad thing. These academic savants are a big reason why certifications have a bad name. In my mind they demonstrate the basis for the phrase, "just because you CAN doesn't mean you SHOULD." A certification, in my mind is a commitment to understand the best practices and core tools within a product and apply that knowledge actively to your solutions and daily work. A classic example is the proper use of init scripts - something that the majority of system administrators I have crossed paths with never learned. This information is found easily in the Solaris System Administration documentation collection, so why is no one practicing it? In this case, it has nothing to do with it being a bad practice to follow... It's just a topic people do not bother to understand beyond the minimum required to make it work.
On the other hand, I have known many top-notch Solaris professionals who are not certified. They can run circles around me in both theory and practice, but never took the additional step. I don't respect them any less because they have demonstrated a commitment to their field through practice. What I don't respect is the "average" SA who believes they could write the kernel scheduler in half the lines of code, but hasn't accomplished anything more advanced than setting up Apache Virtual Servers and using Veritas' Volume Manager to unencapsulate a root disk.
I've listened to this type of person lecturing from their soapbox about how they don't need a certification to prove their skills. Uh huh. But it might take the edge off the cowboy hat, and create a spark of thought-discipline. You see, being certified does not mean that you have to practice everything you learned. It means you have taken the time to understand in depth one way of doing things. The alternative is spending no time studying, and simply absorbing that which you cross paths with.
Another reason certifications have a bad name is that they do not address the real world. Exactly how could a one-hour exam possibly compress all the operational knowledge one gathers by the time they are ready to be certified? By now anyone reading my blog should be free of an doubt that I love Sun Microsystems. Having reminded you of that point first, I will now say that I am not a fan of Sun's certification strategy in the Solaris Operating System Track. I am basing my study on Sun's Web Based training curriculum, which I find generally outstanding as a substitue for instructor lead education. My beef is not with the vehicle, but the curriculum.
As an example, I would estimate that one third of the training materials consumed my time with how to accomplish a task in the Solaris Management Console (SMC). SMC is an interesting idea which flew about as well as a snail tied to a brick. It's not all bad, but it's not all that useful. I don't mind the option to use a GUI, but the amount of time spent on it in the curriculum is rediculous when considered against the amount of use SMC gets in the real world.
Is it good to know how to use SMC? Of course! Especially for it's ability to manage local accounts (but it stinks for network information systems like NIS+ or LDAP). But let's not worry about memorizing all of its menus and screens. One of UNIX advantages is its ability to be remotely managed over a serial connection. I'd never hire a UNIX SA who couldn't do his job proficiently over a 9600-8-n-1 connection.
Here's another sore spot for me... One of Solaris 10's most incredible features is ZFS. I have not begun to expand in my mind the full effect it will have on the industry, and it's not just a series of commands to memorize - it's an entirely new way to manage storage. And yet, there is NO coverage of it on the Solaris 10 exam. Are you KIDDING me?
Thankfully zones are covered, and I'm told that the exam had a good number of related questions. However, the coverage isn't very deep, and sticks to the commands more than the theory. That's unfortunate because it's easy to look up a man page, but hard to design a well thought out consolidation platform. I'd say that sentence sums up my thoughts on certification strategies on many levels.
Resource management is another feature which seems conspicuously absent from the certification curriculum. Although it is very complex (aren't zones as well?), it is a very powerful Solaris feature which I believe is a competitive advantage for Sun. So why not expect a certified administrator to know how to use it? The idea isn't to make everyone feel good with a title on their business card, it is to demonstrate that someone has differentiated themselves by demonstrating a defined level of skill.
What else would be important for an SA to have cursory knowledge of? DTRACE, any one? I don't expect every competent Solaris administrator to be able to write advanced D scripts, or memorize the seemingly infinite number of probes available in Solaris, but for the love of McNealy, can't we even expect them to know what kind of problem it solves? Can't we even establish what a probe is, and why Solaris is WAY ahead of Linux in that respect?
Finally, the emphasis on memorizing obscure command line options really grates on me. This is really what undermines the technical merit of Sun's Solaris Certifications. There are so many commands and concepts that deserve coverage, it seems a shame to take up space with questions like, "What is the option you must give to ufsdump in order to ensure /etc/dumpdates is updated when usng UFS snapshots." I don't know anyone practicing in the field who wouldn't look that up in the man pages even if they THOUGHT they knew the option.
I could go on, but I think the point is made: The exam is lacking in strategic substance. As a result, most folks describe it as a memorization excercise. That's a shame, because the exam COULD be a differentiating ground for Solaris professionals as well as a way for Sun to ensure the compelling features of Solaris are being leveraged to their fullest. And yet, I'm getting ready to take my third SCSA exam. Why?
I maintain the currency of my Solaris Certifications because I believe a professional seeks to understand standards in their field, whether good or bad. As a professional Solaris system architect, the SCSA and SCNA exams are at the core of my practice whether I choose to follow or deviate from their content. I also believe that a certification tells my customers (or employers) that I demonstrate a certain level of competence, even if the bar is not as high as I would like to see it.
I believe deeply in the importance of standards and certifications as a vehicle to advancing the maturity of Systems Engineering practices as applied to system administration. And although Sun's certifications are not there yet, I will continue to support them for what they do provide, and what I hope they will provide in the future; A vehicle to advance the maturity of the industry.
Part 2 of this article will discuss my recommendations for improving Sun's Certifications. As usual I have a few ideas up my sleeves. Stay tuned...
Monday, May 21, 2007
Who's on first? Identifying port utilization in Solaris
Setting up Apache2 on Solaris 10 is normally about as challenging as brushing your teeth. But in this case, I was humbled by an unexpected troubleshooting adventure. I needed to transfer a TWiki site from an Apache server running on Solaris 9 to an Apache2 server running on Solaris 10. Sounds pretty straight forward, but I abandoned discipline at one point in the game and that detour came back to bite me.
I started carelessly thinking that to make things simple I would just use the legacy apache. This would save any initial headaches with module incompatibilities (if any existed). So, I started out copying the config file in place and trying to start the daemons. It didn't work, and after a few minutes of fiddling with the new httpd.conf I changed course. My reasoning went something like this, "If I'm going to spend much time fiddling, I might as well fiddle with Apache2 and have something better than I started with." And so it began.
I stopped the legacy Apache daemons and followed a similar process with Apache2, ending with the same result: No daemons. I did some fiddling and located a minor typo I'd made in the configuration which is not of consequence to this story. I issued a "svcadm restart apache2" command. Yeeha! Now I had five httpd processes just chomping at the bit for a chance to serve those Wiki pages.
Or did I? It turned out that no matter what I did with my web browser remotely or locally I couldn't get a response. So, I tried a quick telnet to port 80 to see what there was to see... And of course I received a response, so all must be well. Somewhere in my troubleshooting process I made two mistakes:
First, I didn't remove the httpd.conf file from /etc/apache, which means the legacy Apache starts up and conflicts with Apache2 on a reboot. I've already written an article that goes into some detail about why the current legacy Apache's integration isn't ideal, so I won't expand on my frustration in this one. This problem was quickly solved, and could have been avoided if I had adhered to my Jedi training.
Second, I assumed that when I directed a Telnet session to port 80 it was reaching the Apache2 server. In fact, it was not. I shut down the Apache2 server and again issues the Telnet command to port 80. Surprise! The same greeting appeared. So, some process on the system had claimed port 80 before Apache could do so. Now... To find it!
Linux distributions typically ship with the lsof utility. This provides a quick and convenient way to identify what process is using what TCP port. Solaris doesn't have lsof in the integrated Open Source software (/usr/sfw) or the companion CD (/opt/sfw). It's not hard to obtain and compile, but it's just inconvenient enough that I'm inclined not to do it. My next logical question became, "what is the Solaris way to accomplish my goal?".
Solaris has no way to natively solve this issue without a shell script. There are a number of similar scripts available on-line through a quick Google search. None are particularly complex, but complex enough that you wouldn't want to write them every time you need it. Here's what I ended up with:
When executed, it will produce output similar to the following. Note that it requires root permissions to traverse the proc directories...
A quick "ps -ef " command told be that our Citrix server was to blame for the port conflict...
Ah ha! Problem solved. I'd like to see the Solaris engineering team add a "p" command, or an option to an existing command to make this functionality a standard part of Solaris. Another option would be to integrate the Linux syntax for the fuser command to make this possible.
I started carelessly thinking that to make things simple I would just use the legacy apache. This would save any initial headaches with module incompatibilities (if any existed). So, I started out copying the config file in place and trying to start the daemons. It didn't work, and after a few minutes of fiddling with the new httpd.conf I changed course. My reasoning went something like this, "If I'm going to spend much time fiddling, I might as well fiddle with Apache2 and have something better than I started with." And so it began.
I stopped the legacy Apache daemons and followed a similar process with Apache2, ending with the same result: No daemons. I did some fiddling and located a minor typo I'd made in the configuration which is not of consequence to this story. I issued a "svcadm restart apache2" command. Yeeha! Now I had five httpd processes just chomping at the bit for a chance to serve those Wiki pages.
Or did I? It turned out that no matter what I did with my web browser remotely or locally I couldn't get a response. So, I tried a quick telnet to port 80 to see what there was to see... And of course I received a response, so all must be well. Somewhere in my troubleshooting process I made two mistakes:
First, I didn't remove the httpd.conf file from /etc/apache, which means the legacy Apache starts up and conflicts with Apache2 on a reboot. I've already written an article that goes into some detail about why the current legacy Apache's integration isn't ideal, so I won't expand on my frustration in this one. This problem was quickly solved, and could have been avoided if I had adhered to my Jedi training.
Second, I assumed that when I directed a Telnet session to port 80 it was reaching the Apache2 server. In fact, it was not. I shut down the Apache2 server and again issues the Telnet command to port 80. Surprise! The same greeting appeared. So, some process on the system had claimed port 80 before Apache could do so. Now... To find it!
Linux distributions typically ship with the lsof utility. This provides a quick and convenient way to identify what process is using what TCP port. Solaris doesn't have lsof in the integrated Open Source software (/usr/sfw) or the companion CD (/opt/sfw). It's not hard to obtain and compile, but it's just inconvenient enough that I'm inclined not to do it. My next logical question became, "what is the Solaris way to accomplish my goal?".
Solaris has no way to natively solve this issue without a shell script. There are a number of similar scripts available on-line through a quick Google search. None are particularly complex, but complex enough that you wouldn't want to write them every time you need it. Here's what I ended up with:
#!/bin/sh
if [ `/usr/xpg4/bin/id -u` -ne 0 ]; then
echo "ERROR: This script must run as root to access pfiles command."
exit 1
fi
if [ $# -eq 1 ]; then
port=$1
else
printf "which port?> "
read port
echo "Searching for processes using port $port...";
echo
fi
for pid in `ps -ef -o pid | tail +2`
do
foundport=`/usr/proc/bin/pfiles $pid 2>&1 | grep "sockname:" | egrep "port: $port$"`
if [ "$foundport" != "" ];
then
echo "proc: $pid, $foundport"
fi
done
exit 0
When executed, it will produce output similar to the following. Note that it requires root permissions to traverse the proc directories...
cgh@testbox{tmp}$ sudo ./portpid 80
proc: 902, sockname: AF_INET 0.0.0.0 port: 80
sockname: AF_INET 192.168.1.4 port: 80
sockname: AF_INET 127.0.0.1 port: 80
A quick "ps -ef " command told be that our Citrix server was to blame for the port conflict...
cgh@testbox{tmp}$ ps -ef | nawk '$2 ~ /^902$/ {print $0}'
ctxsrvr 902 1 0 May 18 ? 7:00 /opt/CTXSmf/slib/ctxxmld
Ah ha! Problem solved. I'd like to see the Solaris engineering team add a "p" command, or an option to an existing command to make this functionality a standard part of Solaris. Another option would be to integrate the Linux syntax for the fuser command to make this possible.
Labels:
apache,
discipline,
linux,
Solaris 10,
troubleshooting
Subscribe to:
Posts (Atom)