Thursday, November 19, 2009

Solaris Patching Made Simple

Most data centers I've encountered tackled their patching strategy a long time ago. Some may have revisited it when Live Upgrade was introduced, but in general the process doesn't change much once it is created. Why? Patching isn't glorious and exciting. We tend to take it for granted when it works, and "deal with it" when it doesn't. I have to admit I have been guilty of not paying a lot of attention to the guts of Solaris patching for years because all the sites I've worked at had a process and I was busy doing other things. Until now, that is.

I'm currently tasked with designing an Enterprise patching strategy for Solaris servers. What started out as a project I considered pretty dry turned into something I'm really glad to have the opportunity to work on. Why? Because I'm excited about the approach Sun is recommending. I think a lot of the things I used to dislike about patching Sun systems are on their way out.

If you haven't already seen it, Sun's On-Line Learning Center has a new course: Solaris 10 Patching Best Practices(WS-2700-S10). It's free, so even in the current climate of slashed training budgets you can still learn the new way of approaching updates. You should be able to get through it in an average work day and still keep up with email.

For a long time sites with more advanced Sun support have been able to leverage a patch baseline known as EIS, or Enterprise Installation Standards. However, if you didn't have some form of advanced interaction with Sun, or the xVM Operations Center (xVMOC) you don't have regular access to EIS. That left you with maintenance updates/upgrades, recommended clusters, SunAlert cluster, or the "Dim Sum" approach of grabbing an analysis off a current patchdiag.xref and installing the patchlist-du-jour. Which path is the right one?

Here's what you don't want to do: Research all of Sun's white papers and best practices that remain available long after growing long in the tooth. The patching strategies and recommendations are a snarled mess of contradictions that lead to confusion, frustration, and eventually rolling your own because its better than nothing. The good news is that Sun's new training course brings some sanity to the plate.

The high level recommendation from Sun is very straight-forward. Start with the patch/package utilities updates from SunSolve to ensure your patching system is not going to introduce problems. Then install either the latest maintenance upgrade (ideally), or the latest maintenance patch set. This gives you a clean and well integrated baseline. Next, apply the SunAlert recommended cluster to attack any critical fixes that have become necessary since the last maintenance release. The training course implies that Sun plans to merge the Recommended and SunAlert clusters to reduce confusion - another great improvement.

What's great about this approach? First, it's simple. I can grab a few clusters and put together an easy to understand, easy to implement, repeatable process. Second, I'm a huge fan of the use of baselines. By minimizing the use of one-off patches we move to grabbing a baseline which includes the required fix. This means that while I'm introducing more change, I'm introducing a set of changes that went through QA at Sun. That doesn't remove my testing responsibility, but it means I'm standing on the shoulders of giants rather than hoping for the best. Even if I have a phenomenal test suite, it's not going to be as mature or comprehensive as Sun's internal processes. Third, my environment is going to be more consistent. Why? Because all the Solaris 10 servers will eventually end up on the same MU. Today I have similar patch levels on a wild assortment of MUs.

While there's a lot more to the training content, the other big point made throughout is that you need to use Live Upgrade. It's not just a feature you may want to try. It's how you should be patching Sun systems. The catch of course, is that not all systems are configured in a way that lends itself to LU. But the writing is on the wall, and my interpretation tells me I need to start (1) updating our site's reference architectures to move toward being LU-friendly, and (2) begin using LU on those systems which will support it conveniently so we start building site knowledge.