Thursday, March 14, 2013

vMotion error at 63% due to CBT file lock

A number of times in the past couple of years, we've had issues with vMotion on ESX 4.1 which happened after storage/SAN breakdowns/issues. ESX doesn't handle losing its storage very well and this can create locks on the VMs that can only be fix by rebooting the host (and shutting the hung VMs down first).

However, the other day I experienced the same sort of error on a ESXi 5.0 cluster which had not had any storage issues. This is quite inconvenient when you can't put a host into maintenance mode.

When initiating a vMotion, the VM fails at 63% with the following error:

"The VM failed to resume on the destination during early power on. 
Reason: Could not open/create change tracking file.
Cannot open the disk '/vmfs/volumes/xxxxxx/vmname.vmdk' or one of the snapshot disks it depends on"

It should be mentioned that for this customer we use Symantec Netbackup 7.5 with agentless .vmdk backup. To speed up the backup process we have enabled Changed Block Tracking (CBT) on the VMs.

I found this KB article but it only related to ESX 4.0 and 4.1 and also the suggestion is to just disable CBT which is not an option.

After a talk with VMware Support, we found the error.

It turns out that there is a lock on one or more of the .ctk files which are the files that keep track of changes to the .vmdks. These ctk files are created automatically when CBT is enabled. If one or more of these files are deleted, they will be recreated automatically.
In a normal setup, the .ctk files will only be locked for a few seconds when the backup software accesses the file.

The error looks like this:



To fix it, do the following:

Putty to one of the ESX hosts (remember to enable SSH under security profiles first).
Cd to the directory of the .vmx file

List all the .ctk files:

#ls -al | grep ctk

For each ctk file, verify whether the file has a lock

#vmkfstools -D vmname-ctk.vmdk

look for "mode" in the output. If it is "mode 0" your fine. If "mode 1" there's a lock. For "mode 2" something is completely wrong...


If you find a lock on a file, create a tmp directory and move the ctk file there (do this for all ctk's with locks):

#mkdir tmp

#mv vmname-ctk tmp

This will also work when the VM is powered on.

And you're done. After this, the VM will vMotion without failing.

This has been tested and works both on a ESX 4.1 classic cluster (where I had the same issue) and ESXi 5.

The VMware engineer could not give me an exact root cause but he was fairly sure that it was related to the backup software and that something had gone wrong while this software has been accessing these files.

Tuesday, March 12, 2013

Locate WWN from console on ESXi 5.x

Sometimes for urgent cases it can be necessary to obtain the HBA's WWN's to get the storage zoned before the network configurations are done and the hosts are online. On Blade servers, this info can be found on the enclosure OA but for rack mounted servers you can only get it from the console.

From the console (press Alt-F1 at the console, remember to enable shell access first under troubleshooting) login as root and run the following command:

# esxcfg-scsidevs -a

look for the lines starting with vmhba1 and vmhba2 (vmhba0 is typically the scsi controller) and the fc.XXX:XXX. The last numbers after the ":" is the WWN (see screen dump below)


Monday, February 25, 2013

BL460c G6 automatically powers on when shut down

Multiple times I've experienced that an ESX host automatically boots when you shut it down. This is fairly annoying when you try to shut it down to, for exampel, have a memory stick replaced by the hardware guys - just to find out that it powers itself back on after a couple of minutes.

I've seen it before but haven't really spend much energy on it. However, this other day we had a very consistent example where the blade server powered itself everytime you shut it down - or powered it off - after a couple of minutes.

Any 'auto power on' features in the ILO and enclosure OA were disabled. Also ASR was disabled in the BIOS.

The culprit turned out to be Wake-On LAN in the BIOS. As soon as this feature was disabled, the blade server stayed powered off. As far as I know we don't have any devices on the network broadcasting magic packets, but it still happened. As long as you're not using the DPM setting, it should be safe to turn off the WOL feature.





Friday, January 18, 2013

Disabling cores in BIOS for BL460c Gen8

Due mainly to licensing rules imposed by Oracle and Microsoft, there is an increasing demand for either locking VMs to specific hosts (like with VM-host-affinity rules) or for decreasing the number of physical CPUs or logical cores in the ESX hosts.

For HP hardware it is possible to order Blade servers with 2, 4, 6, or 8 cores - at least for BL460c Gen8. But in my company, we like to keep things as standard as possible, not having too many different hardware models.

As per Gen8, it is possible to disable a given number of cores in the BIOS. It has to be increased/decreased in pairs from 1 to 8. So as a minimum you can have 1 core enabled on each CPU. It is not possible to deactivate one of the physical CPUs.


Thursday, October 4, 2012

Passed the VCP5 exam today!

I finally got around to taking the VCP5 exam today and passed with 472 out of 500 points (94%). That's one more for the collection, VCP3-4-5, not too shabby! I should go out and buy something...


Sunday, September 16, 2012

Most important new features in vSphere 5.1

I was going over the "What's new in vSphere 5.1" sheet and wanted to point out the, from an operational standpoint, what is the most important changes.


  1. Improved vMotion which lets you vMotion even without having shared storage (vMotion+svMotion). This is described in this post. For customer transition projects, this can probably come in handy.
  2. vSphere web client: This is now the default interface for managing vSphere - it will probably take a little getting used to for the server admins.
  3. Zero-downtime upgrade for VMware tools: Not having to reboot the VMs after tools upgrade is a big step forward (as an IT service provider, it can be close to impossible getting a maintenance window for all your VMs)
  4. Larger VMs - up to 64 vCPUs (you will have to have sufficient underlying hardware though, so unfortunately it can't be simulated in the home lab :))
  5. Virtual hardware v9. Upgrading will require VM downtime. One can only hope that, in future releases,  vHW upgrades can be done in-place.



Improved vMotion in vSphere 5.1 - data moving vMotion

I heard about the new and improved data moving vMotion in the VMworld keynote and wanted to try it out in the home lab. The improvement consists of vSphere being able to perform a simultaneous vMotion+svMotion so you can change both datastore and host at the same time.

I was expecting this feature to be available from the vSphere client by right clicking the VM and choosing 'migrate'. However, this is not the case. The option is there but it is greyed out stating that the VM has to be powered off to perform this action, see screenshot below:


I found an article on yellow-bricks pointing towards the vSphere web client. And for a deep dive, see this post by Frank Denneman.

From the vSphere web client the option is available by right-clicking the VM and choosing 'Migrate', see below.


One apparent limitation is that you cannot migrate between Datacenters, only between cluster within a given Datacenter.


Other than that, the feature works as expected. I did a vMotion plus datastore move from local storage to shared storage. This is the second feature (here's the first one) I've found that is only available in the vSphere web client and not in the vSphere client which leads one to assume that VMware is actually serious about moving future administration away from the vSphere client.