Sunday, December 22, 2013

VMFS heap depletion

Over the past couple of days, we've had a VM that has crashed a number of times. When you try to open the VM console you get a black screen and a yellow MKS error at the top of the console. Strangely enough the VM can still be pinged. After powering it off and on again it boots but after not too long the same thing happens again. Also, vMotion did not work for a number of VMs and fails with the following error:

"The operation is not allowed in the current state"

In the vmkwarning log there are the following entries:

"WARNING: HBX: 1889: Failed to initialize VMFS3 distributed locking on volume 50d136be-62d92875-869a-10604bace2cc: Out of memory"

"WARNING: Fil3: 2034: Failed to reserve volume f530 28 1 50d136be 62d92875 6010869a cce2ac4b 0 0 0 0 0 0 0"

"WARNING: Heap: 2525: Heap vmfs3 already at its maximum size. Cannot expand."

"WARNING: Heap: 2900: Heap_Align(vmfs3, 6160/6160 bytes, 8 align) failed.  caller: 0x41800d8e84e9"

I found a KB article and a post from Cormac Hogan that explains the issue.

In ESXi 5.0 U1 the default VMFS heap size is set to 80 which means that the maximum total size of open vmdk files is 8 TB. When that limit is reached, then VMs can't access their disks.

There are two ways to fix this:

  • Upgrade to ESXi 5.1 U1 (or ESX 5.0 patch 5)
  • Increase the VMFS3.MaxHeapSizeMB to 256 (default is 80) in Configuration -> Advanced Settings and reboot the host

Upgrading to ESXi 5.1 U1 increases the maximum file total size of open vmdk's to 60 TB in stead of 8 TB.

Increasing the heap size to 256 will increase the maximum to 25 TB.




Monday, July 22, 2013

Practical commands for the ESXi console

For troubleshooting on ESX(i), I always end up in the console even on ESXi 5.x. There are a number of practical commands that I can usually remember, but not always. So here they are for reference in random order:

# tail -n 50 'file name'
Shows last 50 lines of a file

# tail -f 'file name'
Outputs continuously what is being written to the file

# more 'file name' | grep -C 10 'search string'
Outputs the line with the word you search for including 10 lines on each side of the entry.

# less 'file name'
A good alternative to 'more' and 'cat'. Lets you navigate back and forth with the keyboard arrows. Use 'w' for page up and 'z' for page down

# find / -name 'search string'
Search for something. Further described in this post.

# find -iname "*-flat.vmdk" -mtime +7 |  xargs ls –alh
Finds files older than 7 days and list them including when they've last been changed. See this post for more info on xargs.

Using the vi text editor. See this post.

Typen characters with ASCII code (hold the Alt key while inputting the number on the numeric keyboard). See more here.
@ - Alt+64
| - Alt+124

# esxcfg-scsidevs –a
# esxcli storage core adapter list
Both commands show info on the SCSI controller type, HBA type, WWNs. Here's more info.

# esxcfg-scsidevs –l
# esxcli storage core device list
Both commands show various info on LUNs including exact size

# esxcfg-mpath -l
# esxcli storage core path list
Shows info about the storage paths. Will show naa device id, LUN id, and state of the paths. You can grep for the word 'dead' for finding dead paths.

# dcui
Will show the yellow/grey ESXi console menu in a Putty session

# esxcli software vib list
# esxcli software vib update -v 'VIB file name'
Updates the VIB package. See here for more info

# pwd
Show working directory

# passwd
Change password for current user (can be used for root as well)

# date
Show date and time.


Friday, July 19, 2013

False CIM warning on the SCSI controller - vSphere 5.

We have had a number of CIM warnings in the Hardware Status tab in vCenter on the ESXi 5 hosts related to the SCSI controller. We have seen it on HP servers with a P220i and a P410i SCSI controller.



However, there is nothing wrong with the SCSI controller.

Turns out it is a false alarm generated by the hp-smx-provider and it can be fixed by applying a driver update (which effectively changes the hp-smx-provider with the hp-smx-limited driver, both are CIM drivers). The difference between the two is, to my understanding, that with the limited driver you cannot update the ILO firmware via the ESXi console.

There are two ways of installing the four VIBs in the driver package. Either you can uninstall the existing four VIBs and then install the driver package, the zip file, which will install all four VIBs. This requires two reboots.

Or you can unpack the zip file and update only the VIBs that are actually newer than the ones you already have installed. In our case, we only needed to update two of the VIBs. This requires one reboot.

Method 1 (uninstall/install)

List the relevant VIBs

# esxcli software vib list | grep hp

Remove the installed VIBs

# esxcli software vib remove -n hp-smx-provider
# esxcli software vib remove -n char-hpcru
# esxcli software vib remove -n char-hpilo
# esxcli software vib remove -n hp-ams

Reboot

Download the driver package fromt the HP Support Center. The file is called "hp-ams-esxi5.0-bundle-9.3.5-3.zip".

Upload it to the ESXi host and place it in /tmp/.

# cd /tmp/


# esxcli software vib install -d /tmp/hp-ams-esxi5.0-bundle-9.3.5-3.zip


Reboot

Verify that the new VIBs have been installed

# esxcli software vib list | grep hp

Method 2 (update)

List the relevant VIBs

# esxcli software vib list | grep hp

Unpack the "hp-ams-esxi5.0-bundle-9.3.5-3.zip" file and copy the following VIB files to the /var/log/vmware folder:

  • hp-ams-esx-500.9.3.5-02.434156.vib
  • hp-smx-limited-500.03.02.10.3-434156.vib


If you unsure about whether the other two VIBs are newer than the ones already installed, then just copy all four VIBs.

Update the VIBs

# esxcli software vib update -v hp-ams-esx-500.9.3.5-02.434156.vib
# esxcli software vib update -v hp-smx-limited-500.03.02.10.3-434156.vib



Reboot

Verify that the new VIBs have been installed

# esxcli software vib list | grep hp

When the host comes back online after the reboot it will still have the warning. Leave it for about 10-15 minutes and the warning will disappear. Also, you can try to reset the sensors and refresh the page.

Thanks to Anders Mikkelsen for finding this fix and writing the guide to method 1 above.

Thursday, July 18, 2013

Updating the firmware and driver on an HBA on ESXi 5.0

When updating firmware on an HBA, it will update the driver at the same time as both parts are included in the installation VIB file.

First step is to verify the type of HBA in your ESXi host:

Run the following command:

# esxcfg-scsidevs -a

Or see this KB article for more info

In the example below, the HBA is a Qlogic ISP2532 PCI-Express



However, this name will not always be used to (at least not in the case of HP) in the documentation when you're looking for a specific model.

Another way to get info on the HBA type can be combined with locating the firware version (see next step)

Next step is to identify the current firmware level of the HBA:

For a Qlogic HBA, use the following command:

# more /proc/scsi/qla2xxx/0
(The '0' can also be a '1' or a '2')

For other types of HBAs, see this KB article


In the screendump above, you can see firmware and driver version and also another name for the HBA, QMH2562.

Now that we know the model and firmware version, we can look for the latest recommended version.

Go to this KB article to find the latest version.

For HP, it takes you to the following site: http://vibsdepot.hp.com/hpq/recipes/ where you should choose the PDF called: "HP-VMware-Recipe.pdf" which is the latest one.

In the PDF, you can search for you model, in this case QMH2562:


Find your model and follow the link. It will let you download a zip file. In this example the file is called:

qla2xxx-934.5.6.0-887798.zip

Unpack the zip file and locate the .vib file which will be used for the update.

In this case the .vib file is called:

scsi-qla2xxx-934.5.6.0-1OEM.500.0.0.472560.x86_64.vib

Now, transfer the .vib file to the following folder on the ESXi host:

/var/log/vmware

Run the following command:

# esxcli software vib update -v scsi-qla2xxx-934.5.6.0-1OEM.500.0.0.472560.x86_64.vib


Reboot to finish updating.

And then verify that the update has properly installed by re-running:

# more /proc/scsi/qla2xxx/0

You can also get detailed VIB package info by running:

# esxcli software vib list | grep scsi
# esxcli software vib get -n scsi-qla2xxx


Wednesday, July 17, 2013

Locating orphaned vmdk's

When you have had a VMware environment running for several years and have had many admins interacting with this environment, there is a fair chance that there will be a number of orphaned vmdk's on the LUNs which are taking up valuable space.

I found a post on yellow-bricks.com explaining two different ways of locating these vmdk's:

The first option is to simply look for all flat files that haven't been changed during the past seven days. This is easy and it works on ESXi 5.0. However, be careful before you start deleting files as a powered off VM, for example, also will show up in such a search (as well as probably also VMs with snapshots).

# find -iname "*-flat.vmdk" -mtime +7


Alternatively, you can pipe the output of the find command to ls see when each file has been modifed:

# find -iname "*-flat.vmdk" -mtime +7 | xargs ls –alh

And if you feel very safe that all the files can be deleted, you delete them all at once:

# find -iname "*-flat.vmdk" -mtime +7 | xargs rm -f

The second option which is a more thorough way of doing it is to run a powershell script (can also be found in the post) which looks for vmdk's which are not registered to a .vmx file. The script has been slightly modified by Danni Finne and can be found here:

Thursday, July 11, 2013

ESXi 5.0 host disconnects due to CPU crossdup error

If a host is disconnected in vCenter and you cannot reconnect the host and you see the following entries in the /var/log/vmkwarning file:

2013-07-05T06:54:02.084Z cpu7:7757089)WARNING: UserObj: 675: Failed to crossdup fd 13, /vmfs/devices/char/vob/External type CHAR: Busy
2013-07-05T06:54:02.084Z cpu7:7757089)WARNING: UserObj: 675: Failed to crossdup fd 14, /vmfs/devices/char/vob/iScsi type CHAR: Busy
2013-07-05T06:54:02.084Z cpu7:7757089)WARNING: UserObj: 675: Failed to crossdup fd 15, /vmfs/devices/char/vob/Migrate type CHAR: Busy
2013-07-05T06:54:02.084Z cpu7:7757089)WARNING: UserObj: 675: Failed to crossdup fd 16, /vmfs/devices/char/vob/PageReti type CHAR: Busy
2013-07-05T06:54:02.084Z cpu7:7757089)WARNING: UserObj: 675: Failed to crossdup fd 17, /vmfs/devices/char/vob/Visorfs type CHAR: Busy
2013-07-05T06:54:02.084Z cpu7:7757089)WARNING: UserObj: 675: Failed to crossdup fd 18, /vmfs/devices/char/vob/Hardware type CHAR: Busy
2013-07-05T06:54:02.084Z cpu7:7757089)WARNING: UserObj: 675: Failed to crossdup fd 19, /vmfs/devices/char/vob/Vfat type CHAR: Busy
2013-07-05T06:54:02.084Z cpu7:7757089)WARNING: UserObj: 3232: Unimplemented operation on 0x410028907210/SOCKET_UNIX_SERVER


It can be fixed by following this KB:


So, SSH to the host and edit the file (/etc/vmware/vpxa/vpxa.cfg) using vi text editor.

The following line should be changed, the value should be increased to 256 from originally 128:

'ThreadStackSizeKb'256'/ThreadStackSizeKb'

Vpxa.cfg is read-only so you can change it with:

# chmod 744 vpxa.cfg

And when done – to get file back to read only:

# chmod 444 vpxa.cfg

Restart the vpxa service and reconnect the host.

# /etc/init.d/vpxa restart

Manually recreating a vmdk descriptor file

Yesterday, I had an svMotion that crashed as the vCenter server (vSphere 5.0 u1) was rebooted while the svMotion task was in progress. Afterwards, a copy of the VM named "servername (2)" had been created in vCenter and a number of empty vmdk files were created at the destination LUNs. There were a file lock on these files so they could not be deleted. and I wasn't able to vMotion, snapshot or continue the svMotion on the source VM.

I tried moving just one of the vmdks with svMotion which resulted in the VM crashing and after that it wouldn't power back on.

Below is the error:


The error states that it cannot find one of the vmdk's. I looked in the relevant folder and discovered that the vmdk descriptor file was missing, the folder only contained the vmdk-flat file (the data file).

Under Edit Settings for the VM, the disk was still there but its size was set to 0 MB, see below:



I found a KB from VMware on how to recreate this descriptor file. And it worked fine.

Loosely, the steps were the following (the KB explains it in details):

Identify scsi controller type, lsilogic in this case.

Identify the size of the vmdk-flat file

Create the vmdk as a thin vmdk

Delete the empty vmdk data file

Edit the vmdk descriptor file with the vi text editor. This included the vmdk flat file name, the vHW version, and deleting the line about thin provisioning.

Verify the consistency of the descriptor with vmkfstools -e:



Thursday, March 14, 2013

vMotion error at 63% due to CBT file lock

A number of times in the past couple of years, we've had issues with vMotion on ESX 4.1 which happened after storage/SAN breakdowns/issues. ESX doesn't handle losing its storage very well and this can create locks on the VMs that can only be fix by rebooting the host (and shutting the hung VMs down first).

However, the other day I experienced the same sort of error on a ESXi 5.0 cluster which had not had any storage issues. This is quite inconvenient when you can't put a host into maintenance mode.

When initiating a vMotion, the VM fails at 63% with the following error:

"The VM failed to resume on the destination during early power on. 
Reason: Could not open/create change tracking file.
Cannot open the disk '/vmfs/volumes/xxxxxx/vmname.vmdk' or one of the snapshot disks it depends on"

It should be mentioned that for this customer we use Symantec Netbackup 7.5 with agentless .vmdk backup. To speed up the backup process we have enabled Changed Block Tracking (CBT) on the VMs.

I found this KB article but it only related to ESX 4.0 and 4.1 and also the suggestion is to just disable CBT which is not an option.

After a talk with VMware Support, we found the error.

It turns out that there is a lock on one or more of the .ctk files which are the files that keep track of changes to the .vmdks. These ctk files are created automatically when CBT is enabled. If one or more of these files are deleted, they will be recreated automatically.
In a normal setup, the .ctk files will only be locked for a few seconds when the backup software accesses the file.

The error looks like this:



To fix it, do the following:

Putty to one of the ESX hosts (remember to enable SSH under security profiles first).
Cd to the directory of the .vmx file

List all the .ctk files:

#ls -al | grep ctk

For each ctk file, verify whether the file has a lock

#vmkfstools -D vmname-ctk.vmdk

look for "mode" in the output. If it is "mode 0" your fine. If "mode 1" there's a lock. For "mode 2" something is completely wrong...


If you find a lock on a file, create a tmp directory and move the ctk file there (do this for all ctk's with locks):

#mkdir tmp

#mv vmname-ctk tmp

This will also work when the VM is powered on.

And you're done. After this, the VM will vMotion without failing.

This has been tested and works both on a ESX 4.1 classic cluster (where I had the same issue) and ESXi 5.

The VMware engineer could not give me an exact root cause but he was fairly sure that it was related to the backup software and that something had gone wrong while this software has been accessing these files.

Tuesday, March 12, 2013

Locate WWN from console on ESXi 5.x

Sometimes for urgent cases it can be necessary to obtain the HBA's WWN's to get the storage zoned before the network configurations are done and the hosts are online. On Blade servers, this info can be found on the enclosure OA but for rack mounted servers you can only get it from the console.

From the console (press Alt-F1 at the console, remember to enable shell access first under troubleshooting) login as root and run the following command:

# esxcfg-scsidevs -a

look for the lines starting with vmhba1 and vmhba2 (vmhba0 is typically the scsi controller) and the fc.XXX:XXX. The last numbers after the ":" is the WWN (see screen dump below)


Monday, February 25, 2013

BL460c G6 automatically powers on when shut down

Multiple times I've experienced that an ESX host automatically boots when you shut it down. This is fairly annoying when you try to shut it down to, for exampel, have a memory stick replaced by the hardware guys - just to find out that it powers itself back on after a couple of minutes.

I've seen it before but haven't really spend much energy on it. However, this other day we had a very consistent example where the blade server powered itself everytime you shut it down - or powered it off - after a couple of minutes.

Any 'auto power on' features in the ILO and enclosure OA were disabled. Also ASR was disabled in the BIOS.

The culprit turned out to be Wake-On LAN in the BIOS. As soon as this feature was disabled, the blade server stayed powered off. As far as I know we don't have any devices on the network broadcasting magic packets, but it still happened. As long as you're not using the DPM setting, it should be safe to turn off the WOL feature.





Friday, January 18, 2013

Disabling cores in BIOS for BL460c Gen8

Due mainly to licensing rules imposed by Oracle and Microsoft, there is an increasing demand for either locking VMs to specific hosts (like with VM-host-affinity rules) or for decreasing the number of physical CPUs or logical cores in the ESX hosts.

For HP hardware it is possible to order Blade servers with 2, 4, 6, or 8 cores - at least for BL460c Gen8. But in my company, we like to keep things as standard as possible, not having too many different hardware models.

As per Gen8, it is possible to disable a given number of cores in the BIOS. It has to be increased/decreased in pairs from 1 to 8. So as a minimum you can have 1 core enabled on each CPU. It is not possible to deactivate one of the physical CPUs.