Showing posts with label HA. Show all posts
Showing posts with label HA. Show all posts

Tuesday, September 28, 2010

Restart of ESX management agents

This is just a post to remember the commands for restarting the management agents on an VMware ESX server:

#service mgmt-vmware restart

#service vmware-vpxa restart (the HA agent)

Both of these agents can be restarted without affecting VM operation. Restarting them can be a useful step in troubleshooting if vCenter has trouble connecting to a host or if you experience HA errors.


For restarting mgmt agents in ESXi, this can be done via the console menu interface, see link above.

Thursday, February 11, 2010

Example of an HA error - and a fix

The other day, I got an HA error when trying to add a new host into a cluster. It was weird, as the host was identical to the others - same model, same installation procedure, and everything. In VirtualCenter, the error looked like this:

This piece of information did not help much in relation to troubleshooting.

The only thing that was different with the new host was that is was configured from the service console (COS) as its NICs were DOA. I had used my own guide for this, so I thought I was in good shape ;-).

A more descriptive error was to be found in the VirtualCenter agent log file on the host (/var/log/vmware/vpx/vpxa.log). Grepping for the word "error" gave the following output:

errorcat = "hostipaddrsdiffer",
errotext = "cmd addnote failed for primary node: Host misconfigured. IP address of ... not found on local interface"

Earlier on, I had changed the IP address, as the first one assigned was already in use, but I'd forgotten to change the IP address in the /etc/hosts file. After doing that and restarting the network (service network restart), everything worked fine.

As a side node, I can mention that it can be pretty confusing manoeuvering through the various log files. Check this post by Eric Siebert for further explanation of VMware log files on VI3.

Thursday, November 5, 2009

Configuration notes for HA

A while back, we experienced a number of inconvient HA failover false positives where several hundred VMs were powered down even though there was nothing wrong with the hosts. The cause of these incidents were apparently a hick-up in the network lasting more than 15 seconds. To avoid such issues, we decided to disable HA until we were absolutely that we had a proper HA configuration.

In the following, there is a quick guide to the HA settings, that we use. These correspond to current best practice.

For reference, we have used the HA deepdive article from Yellow-bricks and article by Scott Lowe on HA configuration notes.

Das.failuredetectiontime
the default timeout for HA is 15 seconds. Best practice is to increase this to 60 seconds or 60.000 miliseconds. To do this, add the following entry under VMware HA -> Advanced options:

Option: das.failuredetectiontime
Value: 60.000

The input is validated, so if you spell it wrong you will be prompted with an error.

Das.isolationaddress
The default isolation address is the default gateway which is pinged if there is no contact between the hosts. However, the default gateway can be some arbitrary place in the network, so it can sometimes be useful to insert one or more extre isolation addresses. It makes sense to add an IP as close to the host as possible e.g. a virtual IP on a switch.

Option: das.isolationaddressX (X=1,2,3,...9)
Value: IP address

Host isolation response
For fibre channel storage, we choose "leave powered on". In a HA failover situation, the active primary node in the cluster will try to boot the VM on the failed host. However, if the host is not down, there will be a vmfs file lock on the VMs and therefore they can't be restarted. HA will try to restart VMs five times. Worst case scenario is that VMs on a host loose network connection... (in vSphere, default response has been changed to "shut down").
For iSCSI storage and other storage over IP, the best practice isolation response is power off to avoid split brain situations (two hosts having write access to a vmdk at the same time).

Cisco switches and port fast
In a Cisco network environment, make sure that 'spanning-tree port fast trunk' is configured on all physical switch ports connected to the ESX host. This ensures that ports are never in 'listen' or 'learn' state - only in 'forwarding' state. So if e.g. one of the uplinks to the COS goes down, you don't risk an isolation response because the delay to put the other port/uplink into forwarding state is longer than the isolation timeout.

Example on a configured interface on a Catalyst IOS based switch:

interface GigabitEthernet0/1
description #VMWare ESX trunk port#
no ip address switchport
switchport trunk encapsulation dot1q
switchport trunk allowed vlan
switchport mode trunk
switchport nonegotiate
spanning-tree portfast trunk

HP Blade enclosures - primary and secondary nodes
Due to the fact that there can be no more than five primary nodes in a cluster, a basic design rule is that there should be no more than a maximum of four hosts in a Blade enclosure per cluster. If five or more hosts (and they all happen to be primary nodes) are located in an enclosure and it fails (which happens...), then no VMs will be started. This matter is explained well in the Yellow-bricks article mentioned above. Furthermore, clusters should be spread over a minimum of two enclosures.