SAN Loop Failure

Here’s an example of how important it is for your VMware Engineers to either have visibility into the SAN infrastructure, or work very intimately with the SAN Admins.

We lost a fiber loop on one of our NetApp FAS 3160 SANs yesterday. If my team did not receive the critical fail-over emails, we would not have known that there were storage issues for a couple of vital hours.

When your customers are complaining that there is something wrong with their VMs, or they have lost access to them; it is imperative to investigate and start troubleshooting. If we would have started troubleshooting without the knowledge of the SAN issue, then we would have started working on the VMs, which would have quickly led back to the ESX hosts they resided on. Troubleshooting the ESX hosts could have potentially made our outage A LOT worse.

This particular environment consists of a virtual center 4.1 and eight ESX 4.01 U2 hosts. There happens to be a bug in ESX 4.XX that occurs when a rescan is issued while an all-paths-down state exists for any LUN in the vCenter Server cluster. Therefore, a virtual machine on one LUN stops responding (temporarily or permanently) because a different LUN in the vCenter Server cluster is in an all-paths-down state. (Yeah, I cut and paste that from the VM KB, that you can read here.) The VM KB also mentions that the bug was fixed in ESX 4.1 U1.

Since we were receiving the outage emails, we knew that something was up with our storage. This allowed us to work closely with our storage admins to understand the full extent of the outage.

The details of our recovery goes something like this: A fail-over from Node A to Node B was made during the outage, however, Node B did not have access to the failed loop, therefore the aggregate on that loop was down. Node B carried the load for all other working aggregates giving our storage guys and the NetApp technician time to work on the loop. When repairs were completed, a fail-back (give-back) was done to allow Node A to take back over its share of the load. We confirmed with the NetApp tech that all paths and LUNs were represented to the ESX hosts. We then went in and rescanned each ESX host in the cluster for the ESX host to recognize the downed LUNs once more. After the scan, we viewed the properties of the LUN to ensure all paths were up. Once that was verified, we QA’d our VMs. Of 45 affected VMs, we had one casualty. The VMX file of one VM got corrupted.

The situation could have been worse, much worse. But I’m very glad that we stayed smart and calm and worked closely with our storage admins.

NetApp nSANity

I was asked a few days ago to run a tool against one of my ESX hosts. The tool is called nSANity and was supplied by my NetApp vendor. This application is designed to collect details of all SAN and fibre connectivity for end-to-end diagnosis. It takes this data and makes an XML report out of it. The reason I was requested to run this tool was so that data could be collected about a remote site and how it is currently connected to one of our NetApp SANs in preparation of a SAN data move to be performed by NetApp Professional Services.

Fig 1

The link for the software was supplied and I went to go retrieve this tool. I’m providing the link to the tool, but you will have to be a NetApp NOW member to download it. I get to the URL and start reading. I notice this right off the bat (see Fig 1). Hmm…. NetApp wants me to run this tool that they provide, yet they can not support you if you’re having trouble. Which I read as, “if this poops your machine, you’re on your own.”

Further reading gave me no indication of what kind of data was to be collected. In addition, the documentation mentions that you have to run this via SSH to the ESX Host as root. My company’s environment does not allow root access over SSH. It falls under the “No Fly Zone.” It’s forbidden, taboo, etc. Unfortunately, nSANity cannot be run locally on the ESX host. So I had to get an approval from Congress to allow root over SSH. (Ok, that’s an exaggeradtion, but that’s what it feels like.) Once I had the green light, I break out the fire extinguishers and prepare for the rapture of this ESX guinea pig.

The windows version of the tool runs very well on Windows 7. To fill in the blanks on how I ran this, I opened a command prompt (Run as Administrator) and then used the following command:

c:\> nsanity -d c:\temp vmware://root:*@esxhost.domain.com

The app runs for approximately ten minutes and collects configuration data on firewall rules, virtual switches, virtual nics, drives, cpu, processes, lspci, hbas, etc. Once the data was collected, root access via SSH had to be replugged. It seems harmless to the environment and the data collected doesn’t seem to fall under any security risks.

If you are having to run nSANity and aren’t familiar with the tool, I hope my experience with it provides some insight on how it works.