Monday, March 30, 2009

Finding those pesky HBA cards

I was given a mission yesterday of finding how many host bus adapter (HBA) cards were in a set of servers. At first glance it seemed like an easy task, but then I remembered that Solaris servers never had a nice convenient output to tell us what card is in what slot in a way that normal humans could benefit from. It's sort of like playing charades; You have to put together a bunch of clues. Here's how I went about it.

The first place I stopped was prtdiag. That's my go-to configuration summary in most cases. Here's a subset of what I saw (probably going to look bad unless your browser is really stretched...):

Bus Max
IO Port Bus Freq Bus Dev,
FRU Name Type ID Side Slot MHz Freq Func State Name Model
---------- ---- ---- ---- ---- ---- ---- ---- ----- -------------------------------- ----------------------
/N0/IB6/P1 PCI 25 B 4 100 100 1,0 ok SUNW,qlc-pci1077,141.1077.141.2/+ QLA2462
/N0/IB6/P1 PCI 25 B 4 100 100 1,1 ok SUNW,qlc-pci1077,141.1077.141.2/+ QLA2462
/N0/IB6/P1 PCI 25 A 6 100 100 2,0 ok SUNW,qlc-pci1077,141.1077.141.2/+ QLA2462
/N0/IB6/P1 PCI 25 A 6 100 100 2,1 ok SUNW,qlc-pci1077,141.1077.141.2/+ QLA2462
/N0/IB7/P1 PCI 27 B 4 100 100 1,0 ok SUNW,qlc-pci1077,141.1077.141.2/+ QLA2462
/N0/IB7/P1 PCI 27 B 4 100 100 1,1 ok SUNW,qlc-pci1077,141.1077.141.2/+ QLA2462
/N0/IB7/P1 PCI 27 A 6 100 100 2,0 ok SUNW,qlc-pci1077,141.1077.141.2/+ QLA2462
/N0/IB7/P1 PCI 27 A 6 100 100 2,1 ok SUNW,qlc-pci1077,141.1077.141.2/+ QLA2462
/N0/IB8/P0 PCI 28 A 3 100 100 1,0 ok SUNW,qlc-pci1077,141.1077.141.2/+ QLA2462
/N0/IB8/P0 PCI 28 A 3 100 100 1,1 ok SUNW,qlc-pci1077,141.1077.141.2/+ QLA2462
/N0/IB8/P1 PCI 29 B 4 100 100 1,0 ok SUNW,qlc-pci1077,141.1077.141.2/+ QLA24
/N0/IB8/P1 PCI 29 B 4 100 100 1,1 ok SUNW,qlc-pci1077,141.1077.141.2/+ QLA24
/N0/IB8/P1 PCI 29 A 6 100 100 2,0 ok SUNW,qlc-pci1077,141.1077.141.2/+ QLA2462
/N0/IB8/P1 PCI 29 A 6 100 100 2,1 ok SUNW,qlc-pci1077,141.1077.141.2/+ QLA2462
/N0/IB9/P1 PCI 31 B 4 100 100 1,0 ok SUNW,qlc-pci1077,141.1077.141.2/+ QLA2462
/N0/IB9/P1 PCI 31 B 4 100 100 1,1 ok SUNW,qlc-pci1077,141.1077.141.2/+ QLA2462
/N0/IB9/P1 PCI 31 A 6 100 100 2,0 ok SUNW,qlc-pci1077,141.1077.141.2/+ QLA2462
/N0/IB9/P1 PCI 31 A 6 100 100 2,1 ok SUNW,qlc-pci1077,141.1077.141.2/+ QLA2462


Of course, there were a great many other lines, but this is what the Fibre Channel card lines look like. Of course, I picked this out because I recognized the QLC driver. Not sure what someone would do if they didn't know that. In this case, there were 18 lines with this output. This indicates there are 9 cards because each slot was represented twice (two ports on each device). This was supported by me being reasonably sure that we had dual-ported cards on this server.

The next place I looked for confirmation was prtconf. This output tends to be more complete, but far more verbose, and generally annoying to get summaries from. To be more precise, the output contains a lot of information...

foobox: prtconf -v | wc -l
9358


That was a complete moment of frustration. The output was too busy and didn't look helpful. Note to self: Why is this not simple? I'm looking for a simple answer, not an excuse to write a Nawk script. No matter how I skinned the output I ended up with 18 matching lines. I'm right back at the prtdiag output.

My last stop was a more obscure one, but a tool which is very helpful: prtpicl. Ok, I'll admit, this one is still ugly.

foobox: prtpicl -v | wc -l
11183


But, at this point I just wanted to get it done, so I dug in a little bit and checked out what it had to say. The easily parsed format provides a convenient Vendor ID and Device ID for each connected device. That's good news because those PCI IDs are easy to look up on the Internet. Knowing our site standards I was able to identify the Vendor ID of the cards we order and look for them:

foobox: prtpicl -v | egrep -e '0x1077' | grep -v subsystem
:vendor-id 0x1077
:vendor-id 0x1077
:vendor-id 0x1077
:vendor-id 0x1077
:vendor-id 0x1077
:vendor-id 0x1077
:vendor-id 0x1077
:vendor-id 0x1077
:vendor-id 0x1077
:vendor-id 0x1077
:vendor-id 0x1077
:vendor-id 0x1077
:vendor-id 0x1077
:vendor-id 0x1077
:vendor-id 0x1077
:vendor-id 0x1077
:vendor-id 0x1077
:vendor-id 0x1077
:vendor-id 0x1077
:vendor-id 0x1077

Please, no comments about how this could be done in a Perl one-liner. We're going to ignore the indented items because they belong to a different hierarchy of data. If we count up the leftmost indented items we see again there are 18 instances of PCI devices with the relevant vendor ID. So, is this a port 0, port 1, deal which requires me to divide by two?

Again, I'm not sure because the output is cryptic. Yes, I know there are ways to make sense of it with hardware knowledge, but let's assume we're dealing with an average SA, and not a device driver developer.

The last tool I tried is a device path decoder which is sort of an unsupported toy developed inside Sun. I don't know where we obtained it, but we happened to have it here so I ran the path_to_inst file through it. What did it tell me? That I had nine of the HBA cards in the box. It had a very simple, easy to read format which used indentation to clearly show the system's layout.

So, it looks like prtdiag was the most direct way to surmise an answer. I would like to see Solaris give me a hardware diagnostic which provides a physical model rather than a logical one. Just tell me there is a card in slot 4 with its vendor / device ID. I don't care to sort out its ports. I just want the device. There are plenty of other tools which provide the logical view, or device driver hierarchy.

7 comments:

TonyT said...

Another command you might want to check out, if you haven't already, is /usr/sbin/fcinfo. I believe this is Solaris 10 only.

Brett said...

Having had the same problem, I ended up writing a script with hard-coded dev path -> PCI slot (and a bunch more). It works great but every time we get new servers, I have to open a case with Sun to get the mapping for that server (or use InfoDoc 208209 for older servers...but it doesn't seem to be available any longer). I don't mind too terribly as I only have to do it once. If you are interested, I could email it to you. It currently supports about 19-20 different Sun Servers (V120 to E2900 with multiple IO boats and also some x86 boxes (X4600, X4500, etc).

Mathilde said...

using luxadm -e port, you can see the path of the cards:

Found path to 4 HBA ports

/devices/pci@8,600000/SUNW,qlc@2/fp@0,0:devctl CONNECTED
/devices/pci@9,700000/SUNW,qlc@4/fp@0,0:devctl CONNECTED
/devices/pci@9,600000/SUNW,qlc@1/fp@0,0:devctl CONNECTED
/devices/pci@9,600000/SUNW,qlc@2/fp@0,0:devctl CONNECTED

In this case, 1 dual port (9,60000) and 2 single ports.

Christopher Hubbell said...

If you have no idea what type of card is in that slot and your data center isn't physically available to you, Solaris has not historically been the friendliest place to live.

Why is it so hard for the OS to provide something like:
device_path Emulex model# fcode 1.2.3

Until then, I'll just keep dreaming!

BK said...

You can also use cfgadm. E.g. try:

cfgadm -s "select=type(fc-fabric)"

or

cfgadm -alv -s "select=type(fc-fabric)"

Brian

EricB said...

How to find out what physical PCI slot maps to the WWpN of the card in that slot on an M5000. (worked for me anyway)

start with…

# fcinfo hba-port

You'll see all the WWpNs mapped to symlinks in /dev/cfg. Each of these links points to a path under /devices. Go to the /devices directory

# cd /devices

and do this:

# ls -l *iou*

You'll see output that look like this:

crw------- ... pci@3....:iou#0-pci#4


Look for the line that corresponds to the link pointed to by the /dev/cfg line you got from fcinfo.

Translate as follows for physical card location:

iou#0 is the IOU tray on the right, iou#1 is the tray on the left.

pci#0 is the bottom PCI slot (in the corresponding IOU), pci#1 is the next one up, etc.

Cori said...

You can use:
luxadm qlgc