Monday, March 30, 2009

Finding those pesky HBA cards

I was given a mission yesterday of finding how many host bus adapter (HBA) cards were in a set of servers. At first glance it seemed like an easy task, but then I remembered that Solaris servers never had a nice convenient output to tell us what card is in what slot in a way that normal humans could benefit from. It's sort of like playing charades; You have to put together a bunch of clues. Here's how I went about it.

The first place I stopped was prtdiag. That's my go-to configuration summary in most cases. Here's a subset of what I saw (probably going to look bad unless your browser is really stretched...):

Bus Max
IO Port Bus Freq Bus Dev,
FRU Name Type ID Side Slot MHz Freq Func State Name Model
---------- ---- ---- ---- ---- ---- ---- ---- ----- -------------------------------- ----------------------
/N0/IB6/P1 PCI 25 B 4 100 100 1,0 ok SUNW,qlc-pci1077,141.1077.141.2/+ QLA2462
/N0/IB6/P1 PCI 25 B 4 100 100 1,1 ok SUNW,qlc-pci1077,141.1077.141.2/+ QLA2462
/N0/IB6/P1 PCI 25 A 6 100 100 2,0 ok SUNW,qlc-pci1077,141.1077.141.2/+ QLA2462
/N0/IB6/P1 PCI 25 A 6 100 100 2,1 ok SUNW,qlc-pci1077,141.1077.141.2/+ QLA2462
/N0/IB7/P1 PCI 27 B 4 100 100 1,0 ok SUNW,qlc-pci1077,141.1077.141.2/+ QLA2462
/N0/IB7/P1 PCI 27 B 4 100 100 1,1 ok SUNW,qlc-pci1077,141.1077.141.2/+ QLA2462
/N0/IB7/P1 PCI 27 A 6 100 100 2,0 ok SUNW,qlc-pci1077,141.1077.141.2/+ QLA2462
/N0/IB7/P1 PCI 27 A 6 100 100 2,1 ok SUNW,qlc-pci1077,141.1077.141.2/+ QLA2462
/N0/IB8/P0 PCI 28 A 3 100 100 1,0 ok SUNW,qlc-pci1077,141.1077.141.2/+ QLA2462
/N0/IB8/P0 PCI 28 A 3 100 100 1,1 ok SUNW,qlc-pci1077,141.1077.141.2/+ QLA2462
/N0/IB8/P1 PCI 29 B 4 100 100 1,0 ok SUNW,qlc-pci1077,141.1077.141.2/+ QLA24
/N0/IB8/P1 PCI 29 B 4 100 100 1,1 ok SUNW,qlc-pci1077,141.1077.141.2/+ QLA24
/N0/IB8/P1 PCI 29 A 6 100 100 2,0 ok SUNW,qlc-pci1077,141.1077.141.2/+ QLA2462
/N0/IB8/P1 PCI 29 A 6 100 100 2,1 ok SUNW,qlc-pci1077,141.1077.141.2/+ QLA2462
/N0/IB9/P1 PCI 31 B 4 100 100 1,0 ok SUNW,qlc-pci1077,141.1077.141.2/+ QLA2462
/N0/IB9/P1 PCI 31 B 4 100 100 1,1 ok SUNW,qlc-pci1077,141.1077.141.2/+ QLA2462
/N0/IB9/P1 PCI 31 A 6 100 100 2,0 ok SUNW,qlc-pci1077,141.1077.141.2/+ QLA2462
/N0/IB9/P1 PCI 31 A 6 100 100 2,1 ok SUNW,qlc-pci1077,141.1077.141.2/+ QLA2462


Of course, there were a great many other lines, but this is what the Fibre Channel card lines look like. Of course, I picked this out because I recognized the QLC driver. Not sure what someone would do if they didn't know that. In this case, there were 18 lines with this output. This indicates there are 9 cards because each slot was represented twice (two ports on each device). This was supported by me being reasonably sure that we had dual-ported cards on this server.

The next place I looked for confirmation was prtconf. This output tends to be more complete, but far more verbose, and generally annoying to get summaries from. To be more precise, the output contains a lot of information...

foobox: prtconf -v | wc -l
9358


That was a complete moment of frustration. The output was too busy and didn't look helpful. Note to self: Why is this not simple? I'm looking for a simple answer, not an excuse to write a Nawk script. No matter how I skinned the output I ended up with 18 matching lines. I'm right back at the prtdiag output.

My last stop was a more obscure one, but a tool which is very helpful: prtpicl. Ok, I'll admit, this one is still ugly.

foobox: prtpicl -v | wc -l
11183


But, at this point I just wanted to get it done, so I dug in a little bit and checked out what it had to say. The easily parsed format provides a convenient Vendor ID and Device ID for each connected device. That's good news because those PCI IDs are easy to look up on the Internet. Knowing our site standards I was able to identify the Vendor ID of the cards we order and look for them:

foobox: prtpicl -v | egrep -e '0x1077' | grep -v subsystem
:vendor-id 0x1077
:vendor-id 0x1077
:vendor-id 0x1077
:vendor-id 0x1077
:vendor-id 0x1077
:vendor-id 0x1077
:vendor-id 0x1077
:vendor-id 0x1077
:vendor-id 0x1077
:vendor-id 0x1077
:vendor-id 0x1077
:vendor-id 0x1077
:vendor-id 0x1077
:vendor-id 0x1077
:vendor-id 0x1077
:vendor-id 0x1077
:vendor-id 0x1077
:vendor-id 0x1077
:vendor-id 0x1077
:vendor-id 0x1077

Please, no comments about how this could be done in a Perl one-liner. We're going to ignore the indented items because they belong to a different hierarchy of data. If we count up the leftmost indented items we see again there are 18 instances of PCI devices with the relevant vendor ID. So, is this a port 0, port 1, deal which requires me to divide by two?

Again, I'm not sure because the output is cryptic. Yes, I know there are ways to make sense of it with hardware knowledge, but let's assume we're dealing with an average SA, and not a device driver developer.

The last tool I tried is a device path decoder which is sort of an unsupported toy developed inside Sun. I don't know where we obtained it, but we happened to have it here so I ran the path_to_inst file through it. What did it tell me? That I had nine of the HBA cards in the box. It had a very simple, easy to read format which used indentation to clearly show the system's layout.

So, it looks like prtdiag was the most direct way to surmise an answer. I would like to see Solaris give me a hardware diagnostic which provides a physical model rather than a logical one. Just tell me there is a card in slot 4 with its vendor / device ID. I don't care to sort out its ports. I just want the device. There are plenty of other tools which provide the logical view, or device driver hierarchy.