Solaris Jedi: September 2006

Thursday, September 28, 2006

Adventures on eBay

I'm finally creating a Sparc based lab outside my place of employment. I want to make sure I can work on projects without any hint of conflict of interest of proprietary intellectual capital nonsense. My work is for hire from 8-5, but my ideas remain my own, and I want to be able to implement them outside of work with freedom.

I've been using an old AMD based system with Solaris x86 for a while now, but since my real interest is in Enterprise Engineering I really wanted to have some Sparc based equipment and at least one disk array. This will open the doors to many projects that would only be feasible on expensive modern x86 hardware. Funny how those same features have been on Sparc hardware since the early days...

For budget reasons I've gone with a v240 as my workhorse server. It doesn't have the sweet LOM that many Sparc system enjoy, but it does have four CPUs and four GB RAM. That will give me plenty of horsepower to run a few Oracle databases and work on some Zone projects. A v440 would have been ideal for my goals, but the price point on those systems is WAY too high for the investment returns I'll be looking at in the short-term.

I have the v240 server in my office right now, and have been playing with it a bit before it goes to the basement rack. After attaching a console cable and booting I was met with a hostname from the test.aol.com domain. I was a bit surprised to find that AOL doesn't wipe their disks before sending servers to auction, but after a quick look around I discovered that there isn't really anything on the disks anyway; just a Solaris 8 OS image. I almost skipped checking it out... I have zero interest in cracking, least of all a Solaris 8 image. I did have a small interest in their best practices, but I'm far more interested in my own projects, so the box will be getting a fresh load of Solaris 10 (6/06) this weekend.

My storage array of choice is the D1000. I'm not going to do much that requires massive expansion or throughput. I just need a bunch of disks I can put into ZFS, share for a cluster, and run databases on. The D1000 has a 12 disk Ultra-SCSI backplane that provides plenty of throughput for my needs. And hey, there's not much worry about driver obsolescence. It's such a simple device that it's just going to work as long as Solaris continues to support SCSI. The D200 is very cool, but I just couldn't find a value-add for my project list, so I went with budget. I won the bid for this unit this morning, so I'll have it in a week or so. Then I need to fill it with disks.

I upgraded my measly 16 port 10mb SuperStack II hub to a 24 port Superstack II Switch. This will ensure the servers have plenty of bandwidth to talk amongst themselves. I'm particulary proud of this purchase - only $1 plu s shipping. Works for me.

A few white papers from now I'm planning to add a second v240 for work on SunCluster and alternative clustering solutions. I'll also be adding a pair of Netra T1 AC200 servers for a directory services project.

And that should just about round out the new data center. I'm amazed at how much capability old servers have. CPU technology has progressed so much faster than the remaining system bottlenecks that many of today's systems simply will never show significant CPU utilization. For me, this means that if I'm willing to run servers at a level that makes them sweat, I can accomplish the same work for a fraction of the cost. This is the premise for a white paper I'll be working on that explores the use of old enterprise class hardware in not-for profit or small business shops. Stay tuned!

Wednesday, September 27, 2006

Shells and the path of standards

It seems that shells evoke almost as much controversy as religion. Everyone has a favorite, and for each favorite, there's another that would sooner grind their knuckles on the back of a circuit board than follow.

As long as I've been using Solaris systems I've never found common ground with the C-shell. Its programming capabilities are more limited compared to other options and its syntax is just too far from standards to be worth learning to me. But C is a standard UNIX language, you say. How can I call C-shell non-standard?

I'm a systems engineer, not an application developer (and now's not the time to debate that grey area!). If I were to write a script to traverse the entire Operating Environment, and calculate a distribution for the various shell I think we'd find 97% written in Bourne shell, 2% written in ksh, and 1% written in Bash. Please remember that I'm talking about Solaris here, and not Linux. When in Rome, do as the Romans do.

Because I create a LOT of shell scripts, I'm also painfully aware of Bourne shell's limitations. Not the least of which is its anemic data structure capability. For that reason, I started to use the Korn shell, or ksh. I have used ksh on Solaris for about 10 years both as my interactive shell and as my programming shell. It's still painful for data structures when compared to Perl, but it's MUCH nicer to work with than the Bourne Subset.

In addition to my Solaris career, I have a life in a parallel universe as a Linux engineer, volunteering at a local not-for-profit organization. There we use Linux for servers because Solaris was too expensive when we chose the initial environment. In the Linux world, the Bourne Again Shell, or Bash, is the de-facto standard for good reason. It's a very capable shell with great programming constructs, and a pleasant demeanor for interactive shell use. I dig it.

In fact, I dug it so much that I recently changed my Solaris interactive shell to it. Bash has been included in Solaris for quite a while now, and I consider it a standard-enough component that it's safe to learn. I love the file completion and history browsing with tab and arrow keys. It's so much more intuitive than the equivalent in ksh. I've been happily bashing around for a few months on Solaris, but recently did an about face in my thinking while reading up on Role Based Access Control, or RBAC.

My current site has a sudo to RBAC conversion on the roadmap, which should be a great project. In the RBAC world, you assume a role to complete an activity that is outside your default privileges. Those roles will have one of three shells: pfcsh, pfsh, or pfksh. Did you notice that Bash is not amongst them? Not for the foreseeable future.

So, in a Solaris-standard world that leverages the powerful RBAC facility, you will not have the option of working in a bash compatible shell if you want to perform an activity outside your normal rights. That's enough reason for me to drop the bells and whistles of Bash and go back to ksh. I'll probably miss it for a few weeks before I forget completely.

This experience proved to be yet another reminder for me that staying on the path of standards is full of temptations. While Bash is stable and common, it is not a native Solaris standard. That's fine for an end-user or software developer that operates outside the OS internals, but as a systems engineer my job is to live in the internals, and at this point in time Korn shell seems to give me the best functionality, programming, and standards compliance.

Thursday, September 21, 2006

Revision control bites... Unless you train it!

I've recently begun the task of placing my world under revision control. Keeping track of edits, versions, builds, etc. had become too much of a chore considering that Solaris has a facility to do it for me. The Source Code Control System, or SCCS, is the Solaris standard for revision control in just about every release I'm familiar with. There are many other revision control system out there, each with pros and cons. Popular alternatives include RCS, CVS, and Subversion. In my case, I want to choose a system that I know will be available in all reelases of Solaris on all sites, even in galaxies far, far away. SCCS is the only one that meets my key criteria.

Having chosen my platform for version control, I begain moving projects into its protective custody. All went well for the first few weeks, and I began to develop the the comfort level that usually preceeds a problem. The most recent project I put under SCCS control is a package with a simple preinstall script. The script is responsible for checking for the existance of a file prior to package installation, and making a backup before overwriting it. The backup is named FILENAME.PACKAGENAME.DATESTAMP. To implement this datestamp I set a variable in a backup subroutine as follows:


backup_file () {
   DATESTAMP=`date +%Y%m%d%H%M`
   test -f "${1}" && /usr/bin/cp ${1} ${1}.CGHfoopkg.${DATESTAMP}
   return ${?}
} #end backup_file

After testing successfully on my development machine, I checked the file back in (sccs delget) and rebuilt the package. When I transferred the package to my staging server, the preinstall failed miserably with strange syntax originating at lines two and three of the above segment.

The first thing I questioned was whether there is some difference in versions of the time command between my development server and staging server, but it only took a second to prove that theory null and void.

After some mucking around, I discovered that the problem only exists when the source code was checked in to SCCS. This quickly lead me to realize that SCCS was expanding keywords in my DATESTAMP variable.

SCCS keywords allow you to have SCCS dynamically insert a revision, filename, check-in time, and other metadata when a file is checked in. By placing this into my checked out code:


#
# SCCS Revision Control:
#       %M%     %I%     %H% %T%
#

I end up with this when the code is checked-in:


#
# SCCS Revision Control:
#       preinstall     1.5     09/21/06 09:01:46
#

The problem is, SCCS by default does not stop at the header. It goes all the way through the code. In this case, it was hitting my DATESTAMP variable and changing its value, thus breaking my backup_file function.

It took some digging to find the solution, but I finally discovered the sccs admin command. Using this command I was able to specify that only the first ten lines of the file should be considered for keyword expansion using the following command:


# sccs admin -fs10 preinstal

After making this change, and checking the file back in, my problems were gone. The source stayed clean, and I was once again blissfully coding. My newly trained SCCS repository has been behaving wonderfully ever since.

Monday, September 18, 2006

To install links, or not to install links... That is the Question!

I'm working on a package at the moment that integrates Oracle 9i / 10g startups with Solaris Resource Manager (SRM). One of its functions is providing init scripts to start and stop the database and listeners during reboots. Seems simple at first, but I ran into an interesting dilemma.

At my current site, Oracle is not an automated installation, and not part of our Jumpstart framework. The challenge this presents is that we do not want to install active run control links for software that is not yet installed on the system, but we want to be able to install these packages during the jumpstart.

Ideally, I'd like to see the rc scripts registered via pkgadd so they can be easily identified, and cleanly removed if the package is de-installed. In the end, I had to compromise, but I think it turned out pretty safe because I followed standards in use of those rc / init scripts and used a hard link.

Although it may sound odd at first glance, I chose not to include the rc links in my package prototype. The links are installed in a disabled mode (pre-pended underscore character) by a postinstall script. I chose this route because it creates a placeholder for the startup order I selected rather than hoping whoever does the installation gets it right. This eliminates inconsistency and human error.

Next, I created a preremove script. The preremove works by issuing a find command on the /etc/rc?.d scripts that looks for any files with an inode matching the init script's inode. If proper convention for using hard-links was followed, this method will find the associated rc scripts even if their startup order or link name are changed over the life of the system.

LINKS=`find /etc/rc?.d -inum ${INITSCRIPTINODE} -print`

The final touch was to create a simple script that can activate all the sym-links at once. It leverages the preremove's inode hunting strategy and finds entries that have an initial underscore, and renames them using a simple Sed expression:

NEWLINKNAME=`echo $LINKNAME | sed -e 's/^_//'`

I'll be adding a reverse-function for the enabler script that can disable as well, and this will live in our NFS-accessible admin scripts repository for convenience. No need to distribute software that isn't critical to operation when the network is down.

The only downfall to my solution is not being able to search for the rc links in the Solaris software registry (/var/sadm/install/contents). This would have allowed someone unfamiliar with the solution to identify its components. The problem with registering RC links is that they change a lot over time and it's very difficult to keep the registry current. A link may change its name, start order, may be active, or inactive.

The effort required to manage that kind of a dynamic file would have been almost as large as the project I was packaging, so I decided to keep my scope tight and assume that as an acceptable risk. When we move from Solaris 9 to Solaris 10 this problem will be eliminated by integration with SMF, so no point in getting wrapped around the axles.

Wednesday, September 13, 2006

Auditing challenged with SSH

I recently had a real dilemma thrown at me by the security team at a site I work on. The site's policy dictates that no authentication should take place without a password. Any exceptions require both a business case justification on file, and an expiration. This presents a real challenge with SSH. SSH allows public key authentication as an alternative to passwords, and the private key can be created without a passphrase. In addition, there is no way to enforce a key expiration at a server level (at least, not that I've been able to find).

SSH supports two primary means of authentication. Password authentication is essentially the same process that occurs through Telnet, except we use a secure tunnel instead. Public Key authentication bypasses the system's password channel completely. At first glance, it's easy to say, "so disable public key authentication and be done with it."

There are two benefits to using public key authentication with SSH. First, you have the ability to confiigure your private key (residing on the client) without a passphrase. In doing so, you increase the risk of private key compromise, but enable passwordless authentication. This makes batch jobs much easier because it elminiates an interactive session.

The second benefit is that public key authentication is a two-factored authentication. It combines something you have, a private key, and something you know: a passphrase. As such, it's much more secure than traditional password based authentication which onyl uses single factor authentication. Even if someone compromises your private key file, it can't be used unless they know the passphrase, which SHOULD be a complex phrase.

The problem is that there is no way to control who on a server is authorized to use public key encryption, no way to enforce passphrase complexity, and no way to expire a public key. I could create a public / private key pair on a machine under my desk that isn't subject to production data center security scans and audits. Let's next say that I decided not to use a passphrase, or that the passphrase was weak and easily guessed. Next, assume I'm the DBA for the company's Oracle Fincancials encironment. It would require very little effort for someone to compromise my under-desk system and gain passwordless access to the company's critical systems. This is a low-risk, high payoff scenario that people who know what they're doing would be likely to attempt.

Another issue with the public key encryption capabilities of SSH is that you can enter any passphrase (or no passphrase). With my system password, I can install a crack module into the Pluggable Authentication Module (PAM) stack and enforce very complex passwords. Unfortunately, because key pairs are generated on clients, there is no way to enforce sanity at the server level.

After some research I determined that the version of Sun SSH which ships with Solaris 9 is far less capable than the OpenSSH releases one could obtain and build. Using OpenSSH it is possible to move the default $HOME/.ssh to another base directory (like /var). In doing so, it is much easier to create a root controlled environment where someone can not use the authorized_keys file unless authroized by a superuser. Unfortunately, the maintenance issues created by doing this are not justified, and our policy is to stick with vedor provided and supported software in all but the most extenuating circumstances. Under Solaris 9, there's just no safe and auditable way to allow publick key authentication. In an ideal world, I should be able to not just configure public key authentication at a server leve, but specify which keys were respected by the server's SSH daemon.

There is no solution I can see to the issue of enforcing passphrase complexity, or auditing use of non-interactive key pairs because that part of the process is handled entirely by the client. It's very difficult to convnice a security officer that key generated on an uncontrolled device can be trusted for authentication against Sarbox servers.

Our decision, much to my dismay, was to disable public key authentication site-wide. I feel like we're throwing the baby out with the bath water, but at the same time I understand the need to audit system access, and be able to enforce policy. I'm anxious for our Solaris 9 fleet to turn over to Solaris 10 so we can begin using the more capable version of SSH it includes.

Friday, September 08, 2006

AppleCare not up to Sun Service

One of the big reasons I was so excited to move my desktop environment to Mac OS X was its underlying UNIX operating environment. Being a UNIX guy, I'm well aware of how well instrumented it is, and how surgically it can (often) be debugged. This, of course is in contrast to the Windows world where it has become not only common, but almost accepted that troubleshooting step #1 is to reboot and sacrifice a chicken.

Over time my G5 was starting to crash with increasing frequency. At first it was once in a rare while, although I was still surprised that it happened. Recently it accelerated to the point where it crashed almost once per day. Given how much I paid to have rock solid hardware, and AppleCare behind it, this was not acceptable to me. So, I finally grew tired of my Mac's grey text-box of death and called my friendly AppleCare representative.

I began by telling my story, and adding a detail I felt was critical. Each time the system crashes I generated a crash dump report, and the stack trace always pointed back to the USB driver. I grabbed the text below from another site as an example, but it's very similar to what I was seeing.


1 Unresolved kernel trap(cpu 0): 0x300 - Data access DAR=0xdeadbeef PC=0x0e692550
2 Latest crash info for cpu 0:
3    Exception state (sv=0x0EB5DA00)
4       PC=0x0E692550; MSR=0x00009030; DAR=0xDEADBEEF; DSISR=0x42000000; LR=0x0E692530;
        R1=0x081DBC20; XCP=0x0000000C (0x300 - Data access)
5       Backtrace:
6          0x0E6924A8 0x00213A88 0x00213884 0x002141D4 0x00214830
           0x00204CB0 0x00204C74
7       Kernel loadable modules in backtrace (with dependencies):
8          com.apple.dts.driver.PanicDriver(1.0)@0xe691000
9             dependency: com.apple.iokit.IOUSBFamily(1.9.2)@0xed9c000
10 Proceeding back via exception chain:
11    Exception state (sv=0x0EB5DA00)
12       previously dumped as "Latest" state. skipping...
13    Exception state (sv=0x0EB64A00)
14       PC=0x00000000; MSR=0x0000D030; DAR=0x00000000;
                   DSISR=0x00000000; LR=0x00000000; R1=0x00000000; XCP=0x00000000 (Unknown)

Note that USB line? I was seeing it in EVERY crash. This tends to be something worth investigating. In my case I have a USB card reader attached to the USB ports in the bottom of my CinemaDisplay, and a Palm Pilot USB cable plugged into the front of my case. Another observation I made was that each time my system crashed it was essentially idle. It usually happened at night, or while I was at work. I would return to the sound of a jet engine coming from that silver box.

My first suggestion was that we look at the panic logs and try to identify the faulty components, but this didn't get too much traction. AppleCare is set up so that if the basic rubber stamp checks (slightly better than a reboot, but not by much) fail, they redirect you to a local store. In my case this wasn't appealing. It's a 40 minute drive from here, and the issue is intermittant. I could end up being without my Mac for more than a week if things went well.

So, we went through and erased all the caches and preferences, then reset the NVRAM. My system was brought back to factory specs, although I really did almost nothing abnormal to it. I don't use funky extensions or other hacks; I use mainstream well supported stuff.

Much to my surprise, the system has been stable since the activities. I'll be the first to eat my words, but I'm not used to voodoo troubleshooting. This was like chemo-therapy where we just bombard the system in hopes of getting all the cancerous code. I'm used to working in a surgical environment where we see CPU 0 corrupting data on an interval that indicates it needs replacing.

As much as I'm a hopeless fan of Solaris, I have to say that I don't think it's a huge quality difference between Sun and Apple that gives me ths uneasy feeling about this experience; I think it's the quality of Sun Service. They are used to dealing with mission critical servers more than art-critical desktops. No offense to the Mac world - I'm one of you... But it's a very different world.

Solaris Jedi