Disable Flow Control

HP C7000 Blade Chassis
HP C7000 Blade Chassis
I was onsite this past weekend doing a vSphere 5.1 turnkey install. The turnkey consisted of a C7000 Blade Chassis, and two ESXi 5.1U3 hosts. With this new installation, the customer was adding to a new 10Gbe environment to his network. There were also a few settings that needed to be made, specifically, disabling Flow Control on the ESXi Hosts for Mgmt, Data, and vMotion.

The Flex10 Virtual Connect Modules in the chassis do not have a method of disabling flow control, each ESXi Host would auto-negotiate with Flow Control On for both transmit and receive. Personally, I’ve never had to worry about Flow Control within the environments that I’ve set up. But on this engagement, I was working side by side with a guy who lives, breathes, and retransmits network. Trust me, he knows his stuff. It was his recommendation to disable Flow Control. Realizing I had to figure this out quickly, I turned to the number one tool in my toolbag – Professor Google.

Flexing my “Google-Fu”, I found my answer right away: VMware’s KB Article 1013413. I opened an SSH session, and ran this command

# ethtool –pause tx off rx off

That’s Great! One problem. When I reboot, Flow Control is going to reenable because of the auto-negotiate. So how do I make this persistent/permanent? Another well versed Google search brought me to VMware KB Article 2043564. Once this change was applied, reboots did not reenable Flow Control. My network guy was happy, and hey – I learned something new.



So the next question then, is how do I add this change into a kickstart file. Hmm….. Stay tuned folks. I’ll see if there’s a way to do that.

VMware VirtualCenter Server Service hung on “starting”

Discovered something interesting earlier today. I went to go work in vcenter and found that it was unresponsive. Thinking that the machine had recently “autoinstalled” a patch and rebooted. I first thought that maybe the vcenter service hadn’t started. I opened services.msc and found that the VMware VirtualCenter Server Service stuck on “starting”. I had no way to stop, or restart it. I rebooted the machine and hoped that the ‘universal’ fix would work — no gas.

Continue reading “VMware VirtualCenter Server Service hung on “starting””

Maximum Switchover Timeout

I recently ran into an issue where I was having to svmotion some rather large VMs (1-2TBs) that stretched over multiple datastores. During the svmotion, the VMs would time out at various percentages presenting this error.
svmotion

Consulting with Prof. G (Google) presented a VMware KB Article: 1010045. That article states; “This timeout occurs when the maximum amount of time for switchover to the destination is exceeded. This may occur if there are a large number of provisioning, migration, or power operations occurring on the same datastore as the Storage vMotion. The virtual machine’s disk files are reopened during this time, so disk performance issues or large numbers of disks may lead to timeouts.” Yep, this was me. I was having to svmotion VMs from one datastore to another during a vsphere 5 upgrade.

The KB Article discusses adding a timeout value, called “fsr.maxSwitchoverSeconds” to the VM’s VMX file to prevent the timeout.

To modify the fsr.maxSwitchoverSeconds option using the vSphere Client:

1.) Open vSphere Client and connect to the ESX/ESXi host or to vCenter Server.
2.) Locate the virtual machine in the inventory.
3.) Power off the virtual machine.
4.) Right-click the virtual machine and click Edit Settings.
5.) Click the Options tab.
6.) Select the Advanced: General section.
7.) Click the Configuration Parameters button.

Note: The Configuration Parameters button is disabled when the virtual machine is powered on.

8.) From the Configuration Parameters window, click Add Row.
9.) In the Name field, enter the parameter name:

fsr.maxSwitchoverSeconds

10.) In the Value field, enter the new timeout value in seconds (for example: 150).
(I chose a value of 200.)
11.) Click the OK buttons twice to save the configuration change.
12.) Power on the virtual machine.

From personal experience, this was a homerun. It resolved my problem.

Missing NIC

Had an interesting event yesterday. A new application build was underway that wasn’t going very smoothly. Snapshots were being made and reverted rather frequently. After one one snapshot reversion, progress on the install was being made. Approximately three hours went by and a high priority ticket was raised. Something had gone wrong, and no one could access the VM. It was unresponsive to pings, RDP, etc. Eventually, it was discovered that the NIC was missing. Once the NIC was readded, access to the VM was restored and the installation group was off and running.

A forensic investigation was conducted into the root cause of the missing NIC. It was suggested that one of the snapshots were corrupted. The event logs within virtualcenter were vague – they provided the timeline of what had been occurring, but failed to indicate what had transpired with the NIC. I downloaded the vm logs from the datastore and examined them. I wanted to see if there were problems during a snapshot capture, or a snapshot reversion. There were no problems or failures with snapshot creation or any rollbacks of the snapshots. I did however, come across an entry that had me puzzled.

E1000: Syncing with mode 6.

Of course, Professor Google was up for my challenge and provided me with this bit of info within the VMware Communities:
Network card is removed without any user intervention. I struck gold. One of the commentors,”NMNeto”, had hit the nail on the head with this comment:
“The problem is not in VMWARE ESX, it is in the hot plug technology of the network adapter. This device (NIC) can be removed by the user like a usb drive. … When you click to remove the NIC, VMWare ESX removes this device from the virtual machine hardware.”

Removable_NIC
"Safely Remove Hardware Icon"

From here, I had a strong lead. Now that I knew what I was looking for, I opened the Virtual Machine’s Event Viewer and started itemizing each entry – looking for the device removal. After about five minutes, I found the who, when, where, and how. I felt like I was winning a game of “Clue”.

“NMNeto” had also posted an adjoining link within the community for a thread that posted a resolution to prevent the issue going forward. Here’s an image from that link that gives you step by step instructions on how to prevent this on other VMs.

I will take this data and propose a change to add this entry to our VMs so we do not have this event to reoccur.

SAN Loop Failure

Here’s an example of how important it is for your VMware Engineers to either have visibility into the SAN infrastructure, or work very intimately with the SAN Admins.

We lost a fiber loop on one of our NetApp FAS 3160 SANs yesterday. If my team did not receive the critical fail-over emails, we would not have known that there were storage issues for a couple of vital hours.

When your customers are complaining that there is something wrong with their VMs, or they have lost access to them; it is imperative to investigate and start troubleshooting. If we would have started troubleshooting without the knowledge of the SAN issue, then we would have started working on the VMs, which would have quickly led back to the ESX hosts they resided on. Troubleshooting the ESX hosts could have potentially made our outage A LOT worse.

This particular environment consists of a virtual center 4.1 and eight ESX 4.01 U2 hosts. There happens to be a bug in ESX 4.XX that occurs when a rescan is issued while an all-paths-down state exists for any LUN in the vCenter Server cluster. Therefore, a virtual machine on one LUN stops responding (temporarily or permanently) because a different LUN in the vCenter Server cluster is in an all-paths-down state. (Yeah, I cut and paste that from the VM KB, that you can read here.) The VM KB also mentions that the bug was fixed in ESX 4.1 U1.

Since we were receiving the outage emails, we knew that something was up with our storage. This allowed us to work closely with our storage admins to understand the full extent of the outage.

The details of our recovery goes something like this: A fail-over from Node A to Node B was made during the outage, however, Node B did not have access to the failed loop, therefore the aggregate on that loop was down. Node B carried the load for all other working aggregates giving our storage guys and the NetApp technician time to work on the loop. When repairs were completed, a fail-back (give-back) was done to allow Node A to take back over its share of the load. We confirmed with the NetApp tech that all paths and LUNs were represented to the ESX hosts. We then went in and rescanned each ESX host in the cluster for the ESX host to recognize the downed LUNs once more. After the scan, we viewed the properties of the LUN to ensure all paths were up. Once that was verified, we QA’d our VMs. Of 45 affected VMs, we had one casualty. The VMX file of one VM got corrupted.

The situation could have been worse, much worse. But I’m very glad that we stayed smart and calm and worked closely with our storage admins.