Wednesday, December 23, 2009

Free the Support Tools Bundle!

If you aren't already familiar with the Support Tools Bundle, you probably ought to check it out. It contains many very useful tools, at least one of which you absolutely need if you support more than one Solaris server.

I consider many of these tools to be critical components of our current Solaris architecture. As such, updating the tools is a part of our regular patch process. The tools are also integrated in our JumpStart JET templates. And herein lies my frustration.

You can only get the support tools as a bundle. If I want to get the latest SNEEP, I need to download the whole bundle. It's only ~ 40MB, so I can live with that given today's bandwidth. Unfortunately, when you unzip the shiny new file you are faced with something I consider a monstrosity. A shell archive. Why?

The next design flaw we encounter is the extraction method. The shell script exits unless you run it as root. If all I want to do is extract files, why should I be root? This undermines the principle of least privilege if I just need to put files in my home directory, or /var/tmp.

So let's assume we recklessly assume the role of root and execute the shell archive. We are presented with a choice to install or extract the files. Hopefully you want those files in /var/tmp/stb because that's your only choice. Again I ask, Why? Is there some flaw in using gzipped tar balls? I'm not a big fan of using zip, but it accomplishes a similar goal and would be acceptable.

How about a simple plan? Use a gzipped tarball that extracts one directory for each product and an installer in the root. That way I can just extract it and get the product updates into my JET server without having to go through an extra step. If you are skilled enough to know why you need the tools in STB, you can handle a tar.gz file. UNIX has survived the test of time by leveraging simplicity and standards. When we get too fancy we undermine the platform's greatest strengths.

As with any feature (and use of a shell archive is indeed a feature) we should ask the question, what is the value of this extra complexity? I would suggest the answer to that question is "none". Let's whack it and get back to standards, Sun.

Recommendations for you!


I just have to ask the question... Is anyone else who spends a LOT of time on Sun's web sites getting ready to rip out their fingernails when the site pops up the "Recommendations for you" box and forces you to close it? I'm a paying customer with a valid contract. I don't need to be treated like a mass marketing target.

The Internet provides great new business opportunity, and wild possibilities for creative marketing. But let's hope personal customer relationships which were once important haven't been replaced by marketing shotguns designed to bug 1,000 customers as long as one or two click on the shiny links.

Thursday, November 19, 2009

Solaris Patching Made Simple

Most data centers I've encountered tackled their patching strategy a long time ago. Some may have revisited it when Live Upgrade was introduced, but in general the process doesn't change much once it is created. Why? Patching isn't glorious and exciting. We tend to take it for granted when it works, and "deal with it" when it doesn't. I have to admit I have been guilty of not paying a lot of attention to the guts of Solaris patching for years because all the sites I've worked at had a process and I was busy doing other things. Until now, that is.

I'm currently tasked with designing an Enterprise patching strategy for Solaris servers. What started out as a project I considered pretty dry turned into something I'm really glad to have the opportunity to work on. Why? Because I'm excited about the approach Sun is recommending. I think a lot of the things I used to dislike about patching Sun systems are on their way out.

If you haven't already seen it, Sun's On-Line Learning Center has a new course: Solaris 10 Patching Best Practices(WS-2700-S10). It's free, so even in the current climate of slashed training budgets you can still learn the new way of approaching updates. You should be able to get through it in an average work day and still keep up with email.

For a long time sites with more advanced Sun support have been able to leverage a patch baseline known as EIS, or Enterprise Installation Standards. However, if you didn't have some form of advanced interaction with Sun, or the xVM Operations Center (xVMOC) you don't have regular access to EIS. That left you with maintenance updates/upgrades, recommended clusters, SunAlert cluster, or the "Dim Sum" approach of grabbing an analysis off a current patchdiag.xref and installing the patchlist-du-jour. Which path is the right one?

Here's what you don't want to do: Research all of Sun's white papers and best practices that remain available long after growing long in the tooth. The patching strategies and recommendations are a snarled mess of contradictions that lead to confusion, frustration, and eventually rolling your own because its better than nothing. The good news is that Sun's new training course brings some sanity to the plate.

The high level recommendation from Sun is very straight-forward. Start with the patch/package utilities updates from SunSolve to ensure your patching system is not going to introduce problems. Then install either the latest maintenance upgrade (ideally), or the latest maintenance patch set. This gives you a clean and well integrated baseline. Next, apply the SunAlert recommended cluster to attack any critical fixes that have become necessary since the last maintenance release. The training course implies that Sun plans to merge the Recommended and SunAlert clusters to reduce confusion - another great improvement.

What's great about this approach? First, it's simple. I can grab a few clusters and put together an easy to understand, easy to implement, repeatable process. Second, I'm a huge fan of the use of baselines. By minimizing the use of one-off patches we move to grabbing a baseline which includes the required fix. This means that while I'm introducing more change, I'm introducing a set of changes that went through QA at Sun. That doesn't remove my testing responsibility, but it means I'm standing on the shoulders of giants rather than hoping for the best. Even if I have a phenomenal test suite, it's not going to be as mature or comprehensive as Sun's internal processes. Third, my environment is going to be more consistent. Why? Because all the Solaris 10 servers will eventually end up on the same MU. Today I have similar patch levels on a wild assortment of MUs.

While there's a lot more to the training content, the other big point made throughout is that you need to use Live Upgrade. It's not just a feature you may want to try. It's how you should be patching Sun systems. The catch of course, is that not all systems are configured in a way that lends itself to LU. But the writing is on the wall, and my interpretation tells me I need to start (1) updating our site's reference architectures to move toward being LU-friendly, and (2) begin using LU on those systems which will support it conveniently so we start building site knowledge.

Tuesday, September 29, 2009

xVM OpsCenter and overbundling

I've been spending a fair amount of time assessing the patching strategy on my current assignment. My primary focus is on Solaris systems, although there is a Linux population to take care of as well. My recommendation has always been to stick with vendor recommended solutions when it comes to patching because in the Enterprise it's a lot more complicated than clicking on Windows update and hoping for the best.

With that in mind, I browsed over to Sun.com to see what the latest recommendation is. xVM OpsCenter pops out in neon lights. It will even wash my dishes. For what it's capable of, I think it's possible to make an argument that the price is tolerable. Unfortunately, if you are a practitioner of Solaris and need a patching solution you may not need your dishes washed. Then what?

If you aren't going to need full blown provisioning, monitoring, audit, and other cool features you are left with precious little in the way of keeping up on what I call "oil changes". Most of the historical tools are now on their death beds, no doubt to encourage the herd to graze on xVM. Note that I'm only talking about Enterprise level patching which requires some degree of configuration management.

When you dig into xVM you see that there are two options. The basic option does very little that most sites don't already do, although it's wrapped in a nice package. I don't think it's doing anything worth the price of admission at that level though. The advanced package adds what everyone wants: patching. So, you can buy your car with or without tires.

I think this is a bad idea.

Patching is a vital component of the customer experience. It's a way to ensure that Sun doesn't have a CNN moment because a major bug was too difficult to patch and highly visible site didn't get the hole plugged in time. It's also the bane of most admins' existence. It takes a lot of time, causes our customers to suffer downtime, and occasionally takes a server to the happy hunting grounds. To be the best operating system, you need to have a great update strategy.

I have no problem with the xVM framework being an expensive Cadillac, as long as I can still buy a Chevy that does the job. In other words, as long as the Solaris operating environment includes a decent functional framework for patching, then charge all you want for xVM. Today, even with a support contract I don't have access to a proper patching framework from Sun, which means all those third party solutions start getting traction on something that ought to come from Sun.

A basic level of functionality should be part of the environment, so what would the base requirements be? Call it xVM-lite, or call it part of Solaris. Either way, here's a stab at it:

- An on-site proxy option so all hosts don't talk directly to SunSolve. Why not include it in Solaris? This would save Sun bandwidth costs and probably help them to sell some storage.

- Integration with Explorer. Wouldn't it be nice to use that same patching server as the site's Explorer repository for pre-planning patching sessions? We're talking trivial shell scripting here.

- Ability to leverage SunSolve baselines for SunAlert, Security, and Recommended bundles, as well as to manage site-specific custom patch lists.

- Basic auditing of who patched what, and when.

- No GUI necessary. Just a well thought out command line.

What's the precedent? Look at JET. Sun will offer you xVM if you want an Easy Button solution in a GUI, or you can use the JET framework. Personally, I prefer JET. It has nothing to do with the price... I just believe its a well thought out, very reliable design. What I appreciate most, is that when it comes to provisioning I have a choice, and as part of Solaris there is an included option that gets the job done.

Including patching functionality for customers with valid SunSolve entitlements would be a huge improvement in Solaris' usability. Forcing us to buy a 12 course meal when we only need lunch feels like something that happens when you let a marketing department without industry experience make key decisions.

Monday, September 14, 2009

Default routes on 7x00 series Open Storage

I've been having a very enlightening time with our new 7310 Open Storage array. As a totally new product, and one that hasn't yet reached ubiquity, the normal resources are a bit shy of what I'm used to. Put simply, Google hasn't yet learned how to manage these arrays.

We're in the process of deploying a reasonably complex network scenario on ours using two link aggregations, then layering tagged VLANs for administrative access and the dedicated storage net. Each VLAN is to be redundant via IP Multi-pathing (IPMP). This configuration is just about the only option for high capacity and redundancy when you have multiple VLANs involved.

The good news is, Sun's Open Storage, or Fishworks, has a very well designed command line interface. It's quite comprehensive, and from what I can see, it allows you to lose the GUI and still have a workable device. Which is good, because I managed to decapitate the GUI, or BUI (browser user interface, as Sun calls it).

The kiss of death for the BUI came when I attempted to replace a simple datalink on nge0 with an aggregation of nge0 and nge1. In doing so the default route was removed and not replaced. No problem on the dedicated storage VLAN because it was a non-routed private subnet. Big problem on the public side where I was trying to find the BUI.

It turns out to be a simple problem to fix, but the fix itself is not very intuitive. Because the BUI is dead, you have no choice but to use the CLI. For this reason alone, I strongly encourage anyone using 7x00 series storage to make sure that EVERYTHING you implement in the BUI has an equivalent process via CLI. You never know when you'll need it.

After logging in to the BUI, head over to services --> routing. What you'll probably see is a bunch of routes for each interface, but no default route. To add the default route and reanimate the BUI you will need to create a route as follows:

7310array:configuration services routing > create
7310array:configuration services routing > set family=IPv4
7310array:configuration services routing > set destination=0.0.0.0
7310array:configuration services routing > set mask=0
7310array:configuration services routing > set gateway=192.168.1.1
7310array:configuration services routing > set interface=ipmp1
7310array:configuration services routing > commit

Note, of course, that you'll need to plug in the appropriate gateway and device according to your configuration.

If you are used to adding a default route in Solaris, it isn't all that intuitive to type in 0.0.0.0/0, and it sure as heck wasn't documented anywhere I could find. All's well that ends well though; The change immediately brought back my BUI.

7310array:configuration services routing> show
Properties:
= online

Routes:

ROUTE DESTINATION GATEWAY INTERFACE TYPE
route-000 0.0.0.0/0 192.168.1.1 ipmp1 static
route-001 10.151.1.0/24 10.1.1.46 ipmp2 dynamic
route-002 10.151.1.0/24 10.1.1.47 ipmp2 dynamic
route-003 13.151.249.0/24 192.168.1.46 ipmp1 dynamic
route-004 13.151.249.0/24 192.168.1.47 ipmp1 dynamic

7310array:configuration services routing>


All is at balance in the universe. My work here is done.

Monday, June 29, 2009

A farewell to Solaris 9... Already?

My flock still has a large Solaris 9 community within it. It's hard to believe its already time to start the long march to EOSL, but alas, the announcement is clear, as is the Solaris 9 Transition FAQ. The bell is ringing.

Looking back at other Solaris EOLs I seem to always recall thinking that revision had really grown long in the tooth, and the replacement OS was badly needed. In this case, Solaris 10 has a long list of what I consider "dreams come true" to make you want to upgrade. However, I have a lot of experience watching Solaris 9 boxes take some incredible abuse and keep ticking. In my mind, it may have fewer bells and whistles, but it really did its job well.

So let's raise a glass of Solaris and toast to the legacy of 5.9, and to the enterprise evolution that is 5.10 and OpenSolaris. Cheers!

Sunday, June 21, 2009

OpenSolaris on the ThinkPad

After a long run of just dealing with Windows on my personal latpop I have finally managed to get OpenSolaris running on it. I've had a continuous hassle with my old Wifi card that seemed to only be truly happy under Windows. After a few years of that I took a chance on a new card from eBay and found that it... WORKED!

I started out trying the latest Ubuntu desktop, which has a great library of packages available for it and fantastic integration. Unfortunatley, its driver configuration seemed to work, then send my wifi into a coma after some period of time. Didn't diagnose it. Didn't care to. My laptop isn't a science project for me, it's a tool I want to just work when I dump a new OS onto its disk.

Next stop was the one I was more excited about: OpenSolaris.

root@saphyra:~# uname -a
SunOS saphyra 5.11 snv_111b i86pc i386 i86pc Solaris
root@saphyra:~# wificonfig showstatus
linkstatus: connected
active profile: none
essid: <>
bssid: <>
encryption: wep
signal strength: medium(10)
root@saphyra:~#

Yes, that's right, it's all working. At the moment I'm able to work on a zone / LDAP project from the comfort of my couch enjoying my reborn Thinkpad T23. This thing works like a charm despite being a dinosaur by modern standards.

Before rebuilding it I had pretty much stopped using it because Windows XP was unable to boot in under 5 minutes and it took almost as long to launch an Acrobat Reader session for simple PDF stories I was reading. At the moment everything I do, including web browsing works well and is responsive under only 1GB ram and a 1.1 GHz processor. Sweet.

Wednesday, June 03, 2009

Solaris Web Console on Windows... Ouch.

I've been spending quite a bit of time lately running the Sun Directory Service Control Center (DSCC) via the Solaris Web Console (port 6789). When I first started the project I was running Firefox on a Sun workstation. Everything was snappy, the engineer was happy.

Somehow along the way I started using my Windows box to access the console. Still running Firefox I discovered an unbelievable slowness. It takes about three clock minutes to process the initial log in. Once I'm in DSCC everything runs acceptably, but that first login is murder.

One of my co-workers stopped my cube today and suggested I try Internet Explorer. Perish the thought! How could that bloated pig possibly out-perform my Firefox browser? OK, I tried it. He was right.

Internet Explorer provides almost instantaneous response to Webconsole logins while Firefox churns its butter for three minutes. This isn't some dot-net application that's clearly Microsoft slanted. It's a Sun web application. Open stuff that would never have a Microsoft bias. I'm not running dead hardware either; This is on a sweet core-duo 1.83 GHz with 1 GB RAM. Handling an initial log in to Webconsole ought to be cake for this hardware.

My observations are based on stock out of the box configurations, so I'm sure there's some Firefox flag to tweak which will optimize it. It just seems mind boggling that a Sun Microsystems web application would perform exponentially better on Internet Explorer and unacceptably slow on Firefox.

Me? I'm going back to running the browser on my UNIX box. It's way too frustrating trying to be a UNIX Engineer via the Windows platform.

Thursday, May 21, 2009

Adding UNIX users to DS6


I seem to be digging up rants this week. I'm a pretty positive guy, you just wouldn't know it by reading my blog this week. I'm currently working on deploying a fresh Sun Directory Server environment using version 6.3.1. This is to replace an aging 5.2 environment that's ready to retire. Overall I've been very impressed with how much more mature and polished the new version is. A few learning curves to get through, but once I found the right way I was pleased with the product. Unfortunately, today I hit something that just can't be right. Unfortunately, it seems to be confirmed by a bunch of Google hits so I'm not the only one.

When you use Directory Services Control Center (DSCC) to add a user it doesn't provide any of the POSIX fields you need from the posixAccount class. So, your new users pretty much have a user name and a first / last name. No home directory, no user ID, no group ID, and hey... You didn't need a shell did you? Are you kidding me?

The workaround, and I use the term loosely, appears to be adding the record without the necessary information, then editing the record after it is created. You then switch the record to "text mode" and manually insert the following lines into the editable section:

objectclass: posixAccount
loginshell: /bin/ksh
homeDirectory: /home/username
uidNumber: 1234
gidNumber: 10
gecos: John Smith


Ok, so that gets us an account, but isn't it moderately annoying to have to go through all that? Why in the name of Scott McNealy didn't anyone make the wild and unruly assumption that once in a freakishly rare moon someone might use DSEE to centralize the administration of their Solaris users. After all, NIS and NIS+ are deprecated and no one digs local file editing. So, wouldn't that assumption have been somewhere around the top ten for their user requirements?

I did a quick dig to see if I could find a simple configuration file that specified what schema object(s) were used when adding a user attribute, or populating the "common objects" menu, but came up dry. I'll have to a deeper search when time allows. I know it's sitting in some XML file somewhere, but there's more than a few to look through.

So what are my options? Well, there's always the LDIF plan. Which is pretty much useless to the folks who typically manage user account maintenance. Way too error-prone. It's also pretty aggravating for day to day administration. LDIF is pretty much intended for batch loading and sitting behind various automations. I shouldn't need to write an automation solution to add simple UNIX accounts since that capability was standard in the 5.x Directory Servers.

Another option is to use Sun's Directory Editor which is part of DSEE. This path leads to some entertainment as well. If you try to download DSE, the web form will not let you select a platform, and thus prevents you from downloading the component. So, you need to download the ZIP distribution of DSEE instead. Then you just need to deploy Sun's Application Server, or Tomcat. Yeah, just what I needed - another component. Doesn't webconsole already sit on an app server? The best part is, DSE is left over from the 2005Q1 JES distribution from what I can see. Obviously, not a high priority for maintenance. Very encouraging indeed.

So, while Sun's Directory Server continues to be a phenomenal data repository it appears that Sun views its user base as being application / identity developers rather then the legions of system administrators / engineers out there trying to implement a well supported central management strategy. Come on guys and gals, it's not that hard to make us happy. Lose the web 2.0 bling and give us core functionality. Hmm, then add the bling back in! The DSCC interface really is very nice, but what good is a hot car without a steering wheel?

Tuesday, May 19, 2009

ldaplist: Why so much white space?

Sometimes little things drive me nuts. So nuts, it's almost tempting to get into some code and make it right. Of course, that would have absolutely no return on investment for a singificant amount of hassle, but I have to admit I think about it from time time. What has rubbed me the wrong way?

The complete lack of either [1] aesthetic engineering, or [2] use of traditional 80x24 console screens as experienced by the developers of the ldaplist utility. It's as if someone had just finished a grade school term paper when they wrote the output format. Here's the default output:

testbox# ldaplist
dn: cn=Directory Administrators, dc=example,dc=com

dn: cn=nsAccountInactivationTmp,dc=example,dc=com

dn: ou=Timezone,dc=example,dc=com

dn: automountMapName=auto_home,dc=example,dc=com

dn: automountMapName=auto_direct,dc=example,dc=com

dn: automountMapName=auto_master,dc=example,dc=com

dn: ou=projects,dc=example,dc=com

dn: ou=group-ldap,dc=example,dc=com

dn: automountMapName=auto_shared,dc=example,dc=com

dn: ou=SolarisAuthAttr,dc=example,dc=com

dn: ou=SolarisProfAttr,dc=example,dc=com

dn: ou=people,dc=example,dc=com

dn: ou=group,dc=example,dc=com

dn: ou=rpc,dc=example,dc=com

dn: ou=protocols,dc=example,dc=com

dn: ou=networks,dc=example,dc=com

dn: ou=netgroup,dc=example,dc=com

dn: ou=printers,dc=example,dc=com

dn: ou=hosts,dc=example,dc=com

dn: ou=services,dc=example,dc=com

dn: ou=ethers,dc=example,dc=com

dn: ou=profile,dc=example,dc=com

dn: ou=aliases,dc=example,dc=com


Forty-seven lines? That takes up WAY too many lines and provides no value for the white space incurred, not to mention requiring me to scroll my terminal window when I'm on the console. This actually annoys me enough that I run the command this way:

testbox# ldaplist | sed '/^$/d'

dn: cn=Directory Administrators, dc=example,dc=com
dn: cn=nsAccountInactivationTmp,dc=example,dc=com
dn: ou=Timezone,dc=example,dc=com
dn: automountMapName=auto_home,dc=example,dc=com
dn: automountMapName=auto_direct,dc=example,dc=com
dn: automountMapName=auto_master,dc=example,dc=com
dn: ou=projects,dc=example,dc=com
dn: ou=group-ldap,dc=example,dc=com
dn: automountMapName=auto_shared,dc=example,dc=com
dn: ou=SolarisAuthAttr,dc=example,dc=com
dn: ou=SolarisProfAttr,dc=example,dc=com
dn: ou=people,dc=example,dc=com
dn: ou=group,dc=example,dc=com
dn: ou=rpc,dc=example,dc=com
dn: ou=protocols,dc=example,dc=com
dn: ou=networks,dc=example,dc=com
dn: ou=netgroup,dc=example,dc=com
dn: ou=printers,dc=example,dc=com
dn: ou=hosts,dc=example,dc=com
dn: ou=services,dc=example,dc=com
dn: ou=ethers,dc=example,dc=com
dn: ou=profile,dc=example,dc=com
dn: ou=aliases,dc=example,dc=com


Ahhh, that's better. And at 1/2 the screen real estate I rarely need to scroll. Come on, what on Earth would motivate someone to add extra newlines to an output like this? Next thing you know they'll offer CSS templates so your output can have the right "user experience" complete with standard fonts.

Ok, I feel better now... Really. I'm ok.

Tuesday, May 05, 2009

#@$@#$# Spammers









Gotta love those spammers. I wish I could talk the way they write - it would be entertaining at a party to sound like one of the Cylon Hybrids. I'm going to go out on a limb and assume there at least a few Sci-fi fans out there reading a blog like this one.

The real purpose of this post is to apologetically announce that I've turned on comment moderation to keep everything clean after a long wave of spammers hit me. I'm not big on censoring, so rest assured that if you post a rational comment I will be happy to release it and continue encouraging dialog.

JET and the Recommended Cluster

JET is bugging me. I'm a sort of pack rat when it comes to installation media, and that extends to patch sets. Hey, you never know you might get a request to Jumpstart Solaris 2.4, right? Ok, I'm not really that bad.

But, you may well be using a certain recommended cluster for a certain OS, and then need to jump a box to test the next recommended cluster right? Surely, it's not necessary to make a global change to your production server build configuration when implementing a test cluster?

As far as I can tell, base_config's use of recommended clusters is not handled in a manner that encourages good revision management. For each OS major revision (e.g., 10, 9, 8) there can be one cluster. For example, in today's JET software if we were using /export/install as our JET media base there would be a directory called /export/install/patches. Under that we can store one patch cluster for each major OS revision:

/export/install/patches/10_Recommended
/export/install/patches/9_Recommended
/export/install/patches/8_Recommended


That works nicely until the next recommended cluster is released. At that point in time you can no longer have a repeatable build process because you need to replace the single instance of each OS with the new cluster. Not a good plan. We want our patch configuration to be configured in the template so that it's managed, and can be under source code control. Managing patch configurations outside the template is pretty much impossible to audit.

Here's an alternative approach I think would be a step in the right direction: Create a hierarchy to organize recommended clusters:

/export/install/patches/recommended/5.10/sparc/2009-04-22
/export/install/patches/recommended/5.10/sparc/2009-01-foo
/export/install/patches/recommended/5.10/x86
/export/install/patches/recommended/5.9/sparc/2009-04-22
/export/install/patches/recommended/5.9/x86/2009-01-foo


We need to be able to add recommended clusters in the same way we add other products. I'd like to see a new command called "list_recommended_clusters" which would have an output something like this:

# list_recommended_clusters
Version Location
------ ---------------
5.10_sparc_200901 /export/install/patches/recommended/sparc/5.10_sparc_200901
5.10_sparc_200902 /export/install/patches/recommended/sparc/5.10_sparc_200901
5.10_sparc_200903 /export/install/patches/recommended/sparc/5.10_sparc_200901
5.9_sparc_200901 /export/install/patches/recommended/sparc/5.10_sparc_200901
5.9_sparc_200902 /export/install/patches/recommended/sparc/5.10_sparc_200901
5.9_sparc_200903 /export/install/patches/recommended/sparc/5.10_sparc_200901


These clusters could then be specified in the JET template using a variable like base_config_recommended_cluster. In addition, the check routine used during a make_client invocation would ensure that the directory exists, and perhaps ensure that each patch on the patch_list was represented. Bingo! Now we can use good revision control to manage the integration of patch clusters with our server build process.

But I think we can take it one step farther. How about we add the ability to include arbitrary patch sets? Here's a first cut at how it could work: We start by creating a patch repository. Say, /export/install/patch_repo. Under that directory we may have subdirectories for 5.10, 5.9, etc. Patches are simply added to that directory by copying them into place. Nothing fancy. The nice thing about this approach is in its economy of space.

The recommended clusters will have a lot of overlap between them, with the potential for storing the same patch in many different directories. By having one patch repository, we simply store each necessary patch one time and refer to it in a patch_order file. It would be trivial to write a few scripts that could operate or query on a set of patches according to a certain patch list, or perhaps cull out patches not referenced in any current patch lists. I could give or take this feature. There are some good arguments to be made for just storing each patch set and ignoring the storage space. I'm ok with either approach, and even happier if this flexibility were accounted for.

Having established a patch repository, we now need a place to manage patch lists. These would be in typical patch_order formatted lists; No need to reinvent the wheel. Each would need to be named with a unique identifier. For example,

patch_order_5.10_2009q1
patch_order_5.9_dev-servers
patch_order_5.10_test01


These patch lists could then be specified within base_config as an alternative to using the Recommended clusters. Why?

  • The site has a known incompatibility with a patch or two in the common cluster.

  • The site wants to deploy other patches in the early part of the install as part of a managed list rather than manual entries in a template (e.g., custom_patches).

  • Using these lists allows a configuration to be frozen in time for configuration management, and allows a convenient record of exactly what a server was deployed with



I think these would be some very beneficial enhancements to the JET framework. I'd like to work on some of them, but I wanted to get the idea out there before I got wrapped up in something else and forgot about it. I'd be interested in hearing any thoughts on this topic - especially if someone has a better idea!

By the way, I do know about EIS baselines. But I think it's pretty rare for any enterprise to never have need for managing custom patch sets. It would be great if JET could come through with some help in this space.

Monday, March 30, 2009

Finding those pesky HBA cards

I was given a mission yesterday of finding how many host bus adapter (HBA) cards were in a set of servers. At first glance it seemed like an easy task, but then I remembered that Solaris servers never had a nice convenient output to tell us what card is in what slot in a way that normal humans could benefit from. It's sort of like playing charades; You have to put together a bunch of clues. Here's how I went about it.

The first place I stopped was prtdiag. That's my go-to configuration summary in most cases. Here's a subset of what I saw (probably going to look bad unless your browser is really stretched...):

Bus Max
IO Port Bus Freq Bus Dev,
FRU Name Type ID Side Slot MHz Freq Func State Name Model
---------- ---- ---- ---- ---- ---- ---- ---- ----- -------------------------------- ----------------------
/N0/IB6/P1 PCI 25 B 4 100 100 1,0 ok SUNW,qlc-pci1077,141.1077.141.2/+ QLA2462
/N0/IB6/P1 PCI 25 B 4 100 100 1,1 ok SUNW,qlc-pci1077,141.1077.141.2/+ QLA2462
/N0/IB6/P1 PCI 25 A 6 100 100 2,0 ok SUNW,qlc-pci1077,141.1077.141.2/+ QLA2462
/N0/IB6/P1 PCI 25 A 6 100 100 2,1 ok SUNW,qlc-pci1077,141.1077.141.2/+ QLA2462
/N0/IB7/P1 PCI 27 B 4 100 100 1,0 ok SUNW,qlc-pci1077,141.1077.141.2/+ QLA2462
/N0/IB7/P1 PCI 27 B 4 100 100 1,1 ok SUNW,qlc-pci1077,141.1077.141.2/+ QLA2462
/N0/IB7/P1 PCI 27 A 6 100 100 2,0 ok SUNW,qlc-pci1077,141.1077.141.2/+ QLA2462
/N0/IB7/P1 PCI 27 A 6 100 100 2,1 ok SUNW,qlc-pci1077,141.1077.141.2/+ QLA2462
/N0/IB8/P0 PCI 28 A 3 100 100 1,0 ok SUNW,qlc-pci1077,141.1077.141.2/+ QLA2462
/N0/IB8/P0 PCI 28 A 3 100 100 1,1 ok SUNW,qlc-pci1077,141.1077.141.2/+ QLA2462
/N0/IB8/P1 PCI 29 B 4 100 100 1,0 ok SUNW,qlc-pci1077,141.1077.141.2/+ QLA24
/N0/IB8/P1 PCI 29 B 4 100 100 1,1 ok SUNW,qlc-pci1077,141.1077.141.2/+ QLA24
/N0/IB8/P1 PCI 29 A 6 100 100 2,0 ok SUNW,qlc-pci1077,141.1077.141.2/+ QLA2462
/N0/IB8/P1 PCI 29 A 6 100 100 2,1 ok SUNW,qlc-pci1077,141.1077.141.2/+ QLA2462
/N0/IB9/P1 PCI 31 B 4 100 100 1,0 ok SUNW,qlc-pci1077,141.1077.141.2/+ QLA2462
/N0/IB9/P1 PCI 31 B 4 100 100 1,1 ok SUNW,qlc-pci1077,141.1077.141.2/+ QLA2462
/N0/IB9/P1 PCI 31 A 6 100 100 2,0 ok SUNW,qlc-pci1077,141.1077.141.2/+ QLA2462
/N0/IB9/P1 PCI 31 A 6 100 100 2,1 ok SUNW,qlc-pci1077,141.1077.141.2/+ QLA2462


Of course, there were a great many other lines, but this is what the Fibre Channel card lines look like. Of course, I picked this out because I recognized the QLC driver. Not sure what someone would do if they didn't know that. In this case, there were 18 lines with this output. This indicates there are 9 cards because each slot was represented twice (two ports on each device). This was supported by me being reasonably sure that we had dual-ported cards on this server.

The next place I looked for confirmation was prtconf. This output tends to be more complete, but far more verbose, and generally annoying to get summaries from. To be more precise, the output contains a lot of information...

foobox: prtconf -v | wc -l
9358


That was a complete moment of frustration. The output was too busy and didn't look helpful. Note to self: Why is this not simple? I'm looking for a simple answer, not an excuse to write a Nawk script. No matter how I skinned the output I ended up with 18 matching lines. I'm right back at the prtdiag output.

My last stop was a more obscure one, but a tool which is very helpful: prtpicl. Ok, I'll admit, this one is still ugly.

foobox: prtpicl -v | wc -l
11183


But, at this point I just wanted to get it done, so I dug in a little bit and checked out what it had to say. The easily parsed format provides a convenient Vendor ID and Device ID for each connected device. That's good news because those PCI IDs are easy to look up on the Internet. Knowing our site standards I was able to identify the Vendor ID of the cards we order and look for them:

foobox: prtpicl -v | egrep -e '0x1077' | grep -v subsystem
:vendor-id 0x1077
:vendor-id 0x1077
:vendor-id 0x1077
:vendor-id 0x1077
:vendor-id 0x1077
:vendor-id 0x1077
:vendor-id 0x1077
:vendor-id 0x1077
:vendor-id 0x1077
:vendor-id 0x1077
:vendor-id 0x1077
:vendor-id 0x1077
:vendor-id 0x1077
:vendor-id 0x1077
:vendor-id 0x1077
:vendor-id 0x1077
:vendor-id 0x1077
:vendor-id 0x1077
:vendor-id 0x1077
:vendor-id 0x1077

Please, no comments about how this could be done in a Perl one-liner. We're going to ignore the indented items because they belong to a different hierarchy of data. If we count up the leftmost indented items we see again there are 18 instances of PCI devices with the relevant vendor ID. So, is this a port 0, port 1, deal which requires me to divide by two?

Again, I'm not sure because the output is cryptic. Yes, I know there are ways to make sense of it with hardware knowledge, but let's assume we're dealing with an average SA, and not a device driver developer.

The last tool I tried is a device path decoder which is sort of an unsupported toy developed inside Sun. I don't know where we obtained it, but we happened to have it here so I ran the path_to_inst file through it. What did it tell me? That I had nine of the HBA cards in the box. It had a very simple, easy to read format which used indentation to clearly show the system's layout.

So, it looks like prtdiag was the most direct way to surmise an answer. I would like to see Solaris give me a hardware diagnostic which provides a physical model rather than a logical one. Just tell me there is a card in slot 4 with its vendor / device ID. I don't care to sort out its ports. I just want the device. There are plenty of other tools which provide the logical view, or device driver hierarchy.

Monday, February 09, 2009

Solaris LDAP Integration Void

Yikes, that was a harsh post title from a self-proclaimed advocate of Sun's products. I can't count the number of times I've had conversations with people about two related topics: First, how critical it is that sites begin to adopt LDAP and stop managing boxes independently. Second, how immature the administrative side of Sun's LDAP is.
It appears that Ben Rockwood, a much respected voice in the OpenSolaris community, has observed the same.

These topics each deserve a series of posts because they are complex. I mean it. Until you've tried, its hard to understand the documentation dichotomy of Sun's Directory Server Enterprise Edition. The best way I can describe it would be to imagine you have been asked to learn English given a dictionary as your only resource.

There is phenomenal depth to the documentation in form of resource guides. In other words, once you "get it" you can do anything with Sun's documentation. The number of concepts you need to master to deploy LDAP in an Enterprise is staggering, and the number of real-world cases available from Google is small. You really need a few weeks of Instructor Lead Training, but how many companies are on that track these days? Not too many. There are a few outdated books as well, but they only get you to the starting gate for a basic environment.

So now let's assume that you have learned the system and properly architected your Directory Servers. Your next challenge is managing the data. I worked on a project which integrated Oracle instances with Solaris Resource Manager (SRM). The central LDAP project ID repository allowed us to ensure no Project IDs were duplicated around the environment, and minimized the amount of management associated with application migrations. Seems simple, right?

The first issue we encountered was that there is no facility for entering records into the Directory. Don't even talk to me about the documented solution of using Sun Management Console (SMC). It's cute for local files, but it is worthless for naming services, and even Sun's solution center thinks its insane to try using it. No, really. I opened a case, and they asked my why I would ever try to use it.

There should be a set of CLI interfaces for managing this data. Period. Its a simple thing, and by now the Directory Services have been around long enough that this is sorely over due. They should follow the standard usage model that tools like useradd or usermod provide. People understand this, and the precedent should be respected.

The only other option is the Directory Editor. You pick a third party one, or a Sun one. But in the end you are responsible for reverse-engineering whether a directory attribute is a list, or a collection of attributes. This is not appropriate. For standard Solaris maps like netmasks, auto_master, hosts, etc. there should be interface dialogs which provide reasonable levels of sanity checking. I shouldn't need to scan through cryptic attributes. What's even more scary is the idea of handing over a full directory editor to say, someone on the first tier help desk who may not fully understand how terrifying it would be to make the wrong right-click.

This was a bit of a rant, but it is primarily intended to scream out in support of Ben's post. This is a huge opportunity to improve Solaris' administrative scalability and I think all too often LDAP projects get dropped during internal evaluations because the local staff has too many issues getting it working.