Showing posts with label ESXi 5. Show all posts
Showing posts with label ESXi 5. Show all posts

Sunday, December 22, 2013

VMFS heap depletion

Over the past couple of days, we've had a VM that has crashed a number of times. When you try to open the VM console you get a black screen and a yellow MKS error at the top of the console. Strangely enough the VM can still be pinged. After powering it off and on again it boots but after not too long the same thing happens again. Also, vMotion did not work for a number of VMs and fails with the following error:

"The operation is not allowed in the current state"

In the vmkwarning log there are the following entries:

"WARNING: HBX: 1889: Failed to initialize VMFS3 distributed locking on volume 50d136be-62d92875-869a-10604bace2cc: Out of memory"

"WARNING: Fil3: 2034: Failed to reserve volume f530 28 1 50d136be 62d92875 6010869a cce2ac4b 0 0 0 0 0 0 0"

"WARNING: Heap: 2525: Heap vmfs3 already at its maximum size. Cannot expand."

"WARNING: Heap: 2900: Heap_Align(vmfs3, 6160/6160 bytes, 8 align) failed.  caller: 0x41800d8e84e9"

I found a KB article and a post from Cormac Hogan that explains the issue.

In ESXi 5.0 U1 the default VMFS heap size is set to 80 which means that the maximum total size of open vmdk files is 8 TB. When that limit is reached, then VMs can't access their disks.

There are two ways to fix this:

  • Upgrade to ESXi 5.1 U1 (or ESX 5.0 patch 5)
  • Increase the VMFS3.MaxHeapSizeMB to 256 (default is 80) in Configuration -> Advanced Settings and reboot the host

Upgrading to ESXi 5.1 U1 increases the maximum file total size of open vmdk's to 60 TB in stead of 8 TB.

Increasing the heap size to 256 will increase the maximum to 25 TB.




Monday, July 22, 2013

Practical commands for the ESXi console

For troubleshooting on ESX(i), I always end up in the console even on ESXi 5.x. There are a number of practical commands that I can usually remember, but not always. So here they are for reference in random order:

# tail -n 50 'file name'
Shows last 50 lines of a file

# tail -f 'file name'
Outputs continuously what is being written to the file

# more 'file name' | grep -C 10 'search string'
Outputs the line with the word you search for including 10 lines on each side of the entry.

# less 'file name'
A good alternative to 'more' and 'cat'. Lets you navigate back and forth with the keyboard arrows. Use 'w' for page up and 'z' for page down

# find / -name 'search string'
Search for something. Further described in this post.

# find -iname "*-flat.vmdk" -mtime +7 |  xargs ls –alh
Finds files older than 7 days and list them including when they've last been changed. See this post for more info on xargs.

Using the vi text editor. See this post.

Typen characters with ASCII code (hold the Alt key while inputting the number on the numeric keyboard). See more here.
@ - Alt+64
| - Alt+124

# esxcfg-scsidevs –a
# esxcli storage core adapter list
Both commands show info on the SCSI controller type, HBA type, WWNs. Here's more info.

# esxcfg-scsidevs –l
# esxcli storage core device list
Both commands show various info on LUNs including exact size

# esxcfg-mpath -l
# esxcli storage core path list
Shows info about the storage paths. Will show naa device id, LUN id, and state of the paths. You can grep for the word 'dead' for finding dead paths.

# dcui
Will show the yellow/grey ESXi console menu in a Putty session

# esxcli software vib list
# esxcli software vib update -v 'VIB file name'
Updates the VIB package. See here for more info

# pwd
Show working directory

# passwd
Change password for current user (can be used for root as well)

# date
Show date and time.


Friday, July 19, 2013

False CIM warning on the SCSI controller - vSphere 5.

We have had a number of CIM warnings in the Hardware Status tab in vCenter on the ESXi 5 hosts related to the SCSI controller. We have seen it on HP servers with a P220i and a P410i SCSI controller.



However, there is nothing wrong with the SCSI controller.

Turns out it is a false alarm generated by the hp-smx-provider and it can be fixed by applying a driver update (which effectively changes the hp-smx-provider with the hp-smx-limited driver, both are CIM drivers). The difference between the two is, to my understanding, that with the limited driver you cannot update the ILO firmware via the ESXi console.

There are two ways of installing the four VIBs in the driver package. Either you can uninstall the existing four VIBs and then install the driver package, the zip file, which will install all four VIBs. This requires two reboots.

Or you can unpack the zip file and update only the VIBs that are actually newer than the ones you already have installed. In our case, we only needed to update two of the VIBs. This requires one reboot.

Method 1 (uninstall/install)

List the relevant VIBs

# esxcli software vib list | grep hp

Remove the installed VIBs

# esxcli software vib remove -n hp-smx-provider
# esxcli software vib remove -n char-hpcru
# esxcli software vib remove -n char-hpilo
# esxcli software vib remove -n hp-ams

Reboot

Download the driver package fromt the HP Support Center. The file is called "hp-ams-esxi5.0-bundle-9.3.5-3.zip".

Upload it to the ESXi host and place it in /tmp/.

# cd /tmp/


# esxcli software vib install -d /tmp/hp-ams-esxi5.0-bundle-9.3.5-3.zip


Reboot

Verify that the new VIBs have been installed

# esxcli software vib list | grep hp

Method 2 (update)

List the relevant VIBs

# esxcli software vib list | grep hp

Unpack the "hp-ams-esxi5.0-bundle-9.3.5-3.zip" file and copy the following VIB files to the /var/log/vmware folder:

  • hp-ams-esx-500.9.3.5-02.434156.vib
  • hp-smx-limited-500.03.02.10.3-434156.vib


If you unsure about whether the other two VIBs are newer than the ones already installed, then just copy all four VIBs.

Update the VIBs

# esxcli software vib update -v hp-ams-esx-500.9.3.5-02.434156.vib
# esxcli software vib update -v hp-smx-limited-500.03.02.10.3-434156.vib



Reboot

Verify that the new VIBs have been installed

# esxcli software vib list | grep hp

When the host comes back online after the reboot it will still have the warning. Leave it for about 10-15 minutes and the warning will disappear. Also, you can try to reset the sensors and refresh the page.

Thanks to Anders Mikkelsen for finding this fix and writing the guide to method 1 above.

Thursday, July 18, 2013

Updating the firmware and driver on an HBA on ESXi 5.0

When updating firmware on an HBA, it will update the driver at the same time as both parts are included in the installation VIB file.

First step is to verify the type of HBA in your ESXi host:

Run the following command:

# esxcfg-scsidevs -a

Or see this KB article for more info

In the example below, the HBA is a Qlogic ISP2532 PCI-Express



However, this name will not always be used to (at least not in the case of HP) in the documentation when you're looking for a specific model.

Another way to get info on the HBA type can be combined with locating the firware version (see next step)

Next step is to identify the current firmware level of the HBA:

For a Qlogic HBA, use the following command:

# more /proc/scsi/qla2xxx/0
(The '0' can also be a '1' or a '2')

For other types of HBAs, see this KB article


In the screendump above, you can see firmware and driver version and also another name for the HBA, QMH2562.

Now that we know the model and firmware version, we can look for the latest recommended version.

Go to this KB article to find the latest version.

For HP, it takes you to the following site: http://vibsdepot.hp.com/hpq/recipes/ where you should choose the PDF called: "HP-VMware-Recipe.pdf" which is the latest one.

In the PDF, you can search for you model, in this case QMH2562:


Find your model and follow the link. It will let you download a zip file. In this example the file is called:

qla2xxx-934.5.6.0-887798.zip

Unpack the zip file and locate the .vib file which will be used for the update.

In this case the .vib file is called:

scsi-qla2xxx-934.5.6.0-1OEM.500.0.0.472560.x86_64.vib

Now, transfer the .vib file to the following folder on the ESXi host:

/var/log/vmware

Run the following command:

# esxcli software vib update -v scsi-qla2xxx-934.5.6.0-1OEM.500.0.0.472560.x86_64.vib


Reboot to finish updating.

And then verify that the update has properly installed by re-running:

# more /proc/scsi/qla2xxx/0

You can also get detailed VIB package info by running:

# esxcli software vib list | grep scsi
# esxcli software vib get -n scsi-qla2xxx


Wednesday, July 17, 2013

Locating orphaned vmdk's

When you have had a VMware environment running for several years and have had many admins interacting with this environment, there is a fair chance that there will be a number of orphaned vmdk's on the LUNs which are taking up valuable space.

I found a post on yellow-bricks.com explaining two different ways of locating these vmdk's:

The first option is to simply look for all flat files that haven't been changed during the past seven days. This is easy and it works on ESXi 5.0. However, be careful before you start deleting files as a powered off VM, for example, also will show up in such a search (as well as probably also VMs with snapshots).

# find -iname "*-flat.vmdk" -mtime +7


Alternatively, you can pipe the output of the find command to ls see when each file has been modifed:

# find -iname "*-flat.vmdk" -mtime +7 | xargs ls –alh

And if you feel very safe that all the files can be deleted, you delete them all at once:

# find -iname "*-flat.vmdk" -mtime +7 | xargs rm -f

The second option which is a more thorough way of doing it is to run a powershell script (can also be found in the post) which looks for vmdk's which are not registered to a .vmx file. The script has been slightly modified by Danni Finne and can be found here:

Thursday, July 11, 2013

ESXi 5.0 host disconnects due to CPU crossdup error

If a host is disconnected in vCenter and you cannot reconnect the host and you see the following entries in the /var/log/vmkwarning file:

2013-07-05T06:54:02.084Z cpu7:7757089)WARNING: UserObj: 675: Failed to crossdup fd 13, /vmfs/devices/char/vob/External type CHAR: Busy
2013-07-05T06:54:02.084Z cpu7:7757089)WARNING: UserObj: 675: Failed to crossdup fd 14, /vmfs/devices/char/vob/iScsi type CHAR: Busy
2013-07-05T06:54:02.084Z cpu7:7757089)WARNING: UserObj: 675: Failed to crossdup fd 15, /vmfs/devices/char/vob/Migrate type CHAR: Busy
2013-07-05T06:54:02.084Z cpu7:7757089)WARNING: UserObj: 675: Failed to crossdup fd 16, /vmfs/devices/char/vob/PageReti type CHAR: Busy
2013-07-05T06:54:02.084Z cpu7:7757089)WARNING: UserObj: 675: Failed to crossdup fd 17, /vmfs/devices/char/vob/Visorfs type CHAR: Busy
2013-07-05T06:54:02.084Z cpu7:7757089)WARNING: UserObj: 675: Failed to crossdup fd 18, /vmfs/devices/char/vob/Hardware type CHAR: Busy
2013-07-05T06:54:02.084Z cpu7:7757089)WARNING: UserObj: 675: Failed to crossdup fd 19, /vmfs/devices/char/vob/Vfat type CHAR: Busy
2013-07-05T06:54:02.084Z cpu7:7757089)WARNING: UserObj: 3232: Unimplemented operation on 0x410028907210/SOCKET_UNIX_SERVER


It can be fixed by following this KB:


So, SSH to the host and edit the file (/etc/vmware/vpxa/vpxa.cfg) using vi text editor.

The following line should be changed, the value should be increased to 256 from originally 128:

'ThreadStackSizeKb'256'/ThreadStackSizeKb'

Vpxa.cfg is read-only so you can change it with:

# chmod 744 vpxa.cfg

And when done – to get file back to read only:

# chmod 444 vpxa.cfg

Restart the vpxa service and reconnect the host.

# /etc/init.d/vpxa restart

Manually recreating a vmdk descriptor file

Yesterday, I had an svMotion that crashed as the vCenter server (vSphere 5.0 u1) was rebooted while the svMotion task was in progress. Afterwards, a copy of the VM named "servername (2)" had been created in vCenter and a number of empty vmdk files were created at the destination LUNs. There were a file lock on these files so they could not be deleted. and I wasn't able to vMotion, snapshot or continue the svMotion on the source VM.

I tried moving just one of the vmdks with svMotion which resulted in the VM crashing and after that it wouldn't power back on.

Below is the error:


The error states that it cannot find one of the vmdk's. I looked in the relevant folder and discovered that the vmdk descriptor file was missing, the folder only contained the vmdk-flat file (the data file).

Under Edit Settings for the VM, the disk was still there but its size was set to 0 MB, see below:



I found a KB from VMware on how to recreate this descriptor file. And it worked fine.

Loosely, the steps were the following (the KB explains it in details):

Identify scsi controller type, lsilogic in this case.

Identify the size of the vmdk-flat file

Create the vmdk as a thin vmdk

Delete the empty vmdk data file

Edit the vmdk descriptor file with the vi text editor. This included the vmdk flat file name, the vHW version, and deleting the line about thin provisioning.

Verify the consistency of the descriptor with vmkfstools -e:



Thursday, March 14, 2013

vMotion error at 63% due to CBT file lock

A number of times in the past couple of years, we've had issues with vMotion on ESX 4.1 which happened after storage/SAN breakdowns/issues. ESX doesn't handle losing its storage very well and this can create locks on the VMs that can only be fix by rebooting the host (and shutting the hung VMs down first).

However, the other day I experienced the same sort of error on a ESXi 5.0 cluster which had not had any storage issues. This is quite inconvenient when you can't put a host into maintenance mode.

When initiating a vMotion, the VM fails at 63% with the following error:

"The VM failed to resume on the destination during early power on. 
Reason: Could not open/create change tracking file.
Cannot open the disk '/vmfs/volumes/xxxxxx/vmname.vmdk' or one of the snapshot disks it depends on"

It should be mentioned that for this customer we use Symantec Netbackup 7.5 with agentless .vmdk backup. To speed up the backup process we have enabled Changed Block Tracking (CBT) on the VMs.

I found this KB article but it only related to ESX 4.0 and 4.1 and also the suggestion is to just disable CBT which is not an option.

After a talk with VMware Support, we found the error.

It turns out that there is a lock on one or more of the .ctk files which are the files that keep track of changes to the .vmdks. These ctk files are created automatically when CBT is enabled. If one or more of these files are deleted, they will be recreated automatically.
In a normal setup, the .ctk files will only be locked for a few seconds when the backup software accesses the file.

The error looks like this:



To fix it, do the following:

Putty to one of the ESX hosts (remember to enable SSH under security profiles first).
Cd to the directory of the .vmx file

List all the .ctk files:

#ls -al | grep ctk

For each ctk file, verify whether the file has a lock

#vmkfstools -D vmname-ctk.vmdk

look for "mode" in the output. If it is "mode 0" your fine. If "mode 1" there's a lock. For "mode 2" something is completely wrong...


If you find a lock on a file, create a tmp directory and move the ctk file there (do this for all ctk's with locks):

#mkdir tmp

#mv vmname-ctk tmp

This will also work when the VM is powered on.

And you're done. After this, the VM will vMotion without failing.

This has been tested and works both on a ESX 4.1 classic cluster (where I had the same issue) and ESXi 5.

The VMware engineer could not give me an exact root cause but he was fairly sure that it was related to the backup software and that something had gone wrong while this software has been accessing these files.

Tuesday, March 12, 2013

Locate WWN from console on ESXi 5.x

Sometimes for urgent cases it can be necessary to obtain the HBA's WWN's to get the storage zoned before the network configurations are done and the hosts are online. On Blade servers, this info can be found on the enclosure OA but for rack mounted servers you can only get it from the console.

From the console (press Alt-F1 at the console, remember to enable shell access first under troubleshooting) login as root and run the following command:

# esxcfg-scsidevs -a

look for the lines starting with vmhba1 and vmhba2 (vmhba0 is typically the scsi controller) and the fc.XXX:XXX. The last numbers after the ":" is the WWN (see screen dump below)


Saturday, August 11, 2012

Resetting the root password on ESXi 5 (and ESXi 4)

Yesterday, we had a fairly nasty situation at work where a standalone ESXi 4.1 host had to be rebooted. After reboot it did not automatically reconnect to vCenter and so a manual reconnect was done which prompted for root password. Unfortunately, we did not have the root password (don't ask). The host was joined to a domain but it could not be added to vCenter by using domain credentials and ssh to host with domain credentials did not work either. So, having 19 VMs down and no way to power them back on, I was basically screwed (all VMs were residing on DAS).

According to this VMware KB article there is no supported way to reset the root password on an ESXi v4 or v5 other than to reinstall it (or do a repair). I contacted VMware support and they sent me a guide for doing it in an unsupported way. This method is similar to what other guides on the web suggests. I went through the process but unfortunately i didn't work.

Finally, I mounted the ESXi 4.1 install ISO and did a repair. This resets most host configurations such a root password, network configuration, ntp settings, domain etc. After this I could set the password, reconnect to vCenter and then I had to reconfigure the host. Fortunately, the VMs were not completely gone from vCenter but were presented as greyed out orphaned VMs. So I could still see which LUNs the VMs were residing on. That way, the .vmx files could be located (except for one VM that had been renamed in vCenter without svMotioning or migrating it to another LUN afterwards...), the orphaned VM could be removed and the VM could be readded. It was quite a boring process but a least it worked.

Today, I wanted to recreate the password reset method in my home lab to see if I had actually done it in the correct way. I can confirm that, at least, on a virtual ESXi 5 it works and it is possible to reset the password to blank.

These are the steps (alternately, you can try this guide):

Download a Linux live bootable ISO. I used KNOPPIX. Mount the ISO and boot the host.

Once booted into KNOPPIX, open a shell.


Run the following set of commands:

# fdisk -l
# mkdir /mnt/disk
# mount /dev/sda5 /mnt/disk

(Mounting the correct device is the tricky part. To me, it was rather confusing which one to choose. For both the Fujitsu server that I dealt with and for the virtual ESXi, though, it was in sda5 that the state.tgz file was located. VMware suggested using the following command for HP servers: # mount /dev/cciss/c0d0p5 /mnt/disk - c0d0p5 is controller 0, disk 0, partition 5)

# cd /mnt/disk
# ls -al
# cp state.tgz state.tgz.bak
# cd /ramdisk
# mkdir temp
# cd temp
# tar zxf /mnt/disk/state.tgz
# ls -al
# tar zxf local.tgz
# cd etc
# nano shadow


Blank out the encrypted password. For example change root:$1$ywxtUqvn$9e1iXjGVd45T5IAgRxAuV.:13358:0:99999:7:::
to root::13358:0:99999:7:::

See below screendumps for before and after:





Save the shadow file.

Run the following commands to repackage everything:

# cd ..
# rm -rf local.tgz
# tar zcf local.tgz *
# chmod 755 local.tgz
# rm -rf /mnt/disk/state.tgz
# tar zcf /mnt/disk/state.tgz local.tgz
# ls -al /mnt/disk/
# umount /mnt/disk
# shutdown -r now

When ESXi boots up it has no root password set (blank)

Friday, April 27, 2012

View HBA firmware version from service console

To view HBA firmware version from service console (ESX classic) go to /proc/scsi/qla or lpfc820.
Here you will typically find to text files, e.g. '2' and '3'. Run a 'cat' or 'more' on the files (see screendump below). See this post for more info


For ESXi v5.x, see this link.

Tuesday, October 11, 2011

How to run XenServer 6.0 on vSphere 5 - with nested Windows Server 2008 R2 VM

It is possible to install XenServer 6.0 in a virtual machine on vSphere ESXi 5 and then with a few tweaks you can even run a nested Windows Server 2008 R2 VM on the virtual XenServer 6.0.

To install XenServer 6.0 in a VM, first follow this guide to configure ESXi 5.0 (or watch this youtube video).

One important step is to execute the following command from the console:

echo 'vhv.allow = "TRUE"' >> /etc/vmware/config

Otherwise, configure like the guide. Once the custom VM has been created, to be able to choose ESXi 5 as operating system, go to Edit Settings -> Options -> Guest Operating System choose 'Other' and then choose VMware ESXi 5.x. This will ensure that you won't receive the "HVM is required for this operation" error when trying to boot the win2k8R2 vm (it is possible to change this after the install of XenServer as well).


Download the install .iso from citrix.com 

Mount iso and install XenServer

When done, you will get startup screen as below


Download XenCenter from citrix.com and install

Add the the XenServer to XenCenter

Create a new VM, choose win2k8 R2 64-bit, mount ISO, install.

Done.