Thursday, July 21, 2011

ESXTOP to the rescue - VM latency

Earlier on I have mostly used ESXTOP for basic troubleshooting reasons such as CPU ready and the like. Last weekend we had a major incident which was caused by a power outage which affected a whole server room. After the power was back on we had a number VMs that was showing very poor performance - as in it took about one hour to log in to Windows. It was quite random which VMs it was. The ESX hosts looked fine. After a bit of troubleshooting the only common denominator was that the slow VMs all resided on the same LUN. When I contacted the storage night duty the response was that there was no issue on the storage system.

I was quite sure that the issue was storage related but I needed some more data. The hosts were running v3.5 so troubleshooting towards storage is not easy.

I started ESXTOP to see if I could find some latency numbers. I found this excellent VMware KB article which pointed me in the right direction.

  • For VM latency, start ESXTOP and press 'v' for VM storage related performance counters.
  • The press 'f' to modify counters shown, then press 'h', 'i', and 'j' to toggle relevant counters (see screendump 2) - which in this case is latency stats (remember to stretch the window to see all counters)
  • What I found was that all affected VMs had massive latency towards the storage system for DAVG/cmd (see screendump 1) of about 700 ms (rule of thumb is that max latency should be about 20 ms). Another important counter is KAVG/cmd which is time commands spend in the VMkernel, the ESX host, (see screendump 3). So there was no latency in the ESX host and long latency towards the storage system.

After pressing the storage guys for a while, they had HP come take a look at it, and it turned out that there was a defect fiber port in the storage system. After this was replaced everything worked fine and latency went back to nearly zero.

In this case, it was only because I had proper latency data from ESXTOP that I could be almost certain that the issue was storage related.


Screendump 1
Screendump 2
Screendump 3

Sunday, July 17, 2011

Changing IP and VLAN on host - no VM downtime

It is possible to change the service console (COS) IP and VLAN id for hosts in a cluster without having VM downtime (see this post for changing hostname). The trick is to change the COS IP first on all hosts and then wait with the changing of the vMotion IP until all COS IP's have been changed. This way, you will be able to put the hosts into maintenance mode one by one and vMotion will still work with the old IP even though COS IP's will differ in range and VLAN id.

NB. It may be neccesary to disable HA for the cluster before you begin as the HA agent will not be able to configure on the hosts when IP's don't match for all hosts.

  1. Enter maintenance mode
  2. Update the DNS entry on the DNS server
  3. Log on to the vCenter server and flush the DNS: ipconfig /flushdns
  4. Go to ILO, DRAC or something similar for the host (you will loose remote network connection when changing the IP) and change the IP (use this KB article for inspiration): [root@server root]# esxcfg-vswif -i a.b.c.d -n w.x.y.z vswif0 , where a.b.c.d is the IP address and w.x.y.z is the subnet mask.
  5. Change the VLAN id (in this case VLAN 12): esxcfg-vswitch -v 12 -p 'Service Console' vSwitch0
  6. Change gateway: nano /etc/sysconfig/network
  7. Change DNS servers: nano /etc/resolv.conf
  8. Restart network: service network restart
  9. Ensure that gateway can be pinged
  10. Update the NTP server from the vSphere client if needed.
  11. Continue the process with next host in the cluster
When all COS IP's have been changed, go to the vSphere client and change all vMotion IP addresses and VLAN id's. This will not require any downtime. And then test that vMotion works.
Done.


Changing hostname from the service console

The easiest way to change the hostname is via the vSphere client (see this post for changing IP address and VLAN IP). If, however, this is not an option for some reason, the hostname can be changed from the service console the following way:

This KB article actually explains most of the proces which includes:

-----------------

1. Open the /etc/hosts file with a text editor and modify it so that it reflects the correct hostname.

2. To change the default gateway address and the hostname, edit the /etc/sysconfig/network file and change the GATEWAY and HOSTNAME parameters to the proper values.

3. For the changes to take place, reboot the host or restart the network service with the command:

[root@server root]# service network restart
Note: This command breaks any current network connections to the Service Console, but virtual machines continue to have network connection.

------------------------------

I have experienced that after a reboot, the changes are reset and the hostname is changed back to the original one. To avoid this, there is one more step to be performed (before reboot):

Change the /adv/Misc/HostName parameter in /etc/vmware/esx.conf file (see screendump)