Thursday, July 8, 2010

Disaster recovery: Procedure in case of site failure


Here's a short example of a procedure for recovering a VMware cluster from a site failure. The example scenario consists of two ESX4 hosts on replicated storage divided on seperate locations. There's no automatic failover for storage between sites, manual breaking of the mirror is required.

1.

Log into vCenter and verify whether or not storage is available for the cluster. If storage is unavailable, create an incident ticket for the storage group with priority urgent and with a request to:

“Manually break the mirror for the “"Customer X" replicated storage group” used by ESXA and ESXB”

The ticket should be followed by a phone call to the storage day/night duty to notify of the situation.

2.

When mirror has been broken, rescan remaining hosts in the cluster. This rescan can possibly time out. If this happens, reboot the hosts.

After the rescan/reboot all shared LUNs will be missing on the hosts. These should be added/mounted manually from the console (step 3) (in ESX4u1 there's a bug in the add "storage" wizzard, so it doesn't work from the vSphere client, see this post for more info)

3.

Putty to each of the hosts and run the following commands:

#esxcfg-volume –l

This will list available volumes. For each volume, run the following command:

#esxcfg-volume –M label or UUID>

For example:

#esxcfg-volume –M PSAM_REPL_001

See screendump below for further exemplification:



4.

From the vSphere client, for each of the available hosts go to Configuration -> Storage and click “Refresh”. Verify that all LUNs appear as before the site failure

5.

Power on all VMs

6.

Done. In this situation, storage will run from the secondary site. The storage group will be able to reverse the replication seamlessly at a later stage when failed site is operational again. This does not require involvement from the VMware group.

Site redundancy with manual breaking of storage mirror

We have just installed a site redundant cluster for a customer. The cluster consists of two ESX4 hosts on replicated EMC CLariion storage. The ESX servers as well as the storage reside on different locations (preferrably we would have liked to have done it with storage virtualisation and seamless storage failover ala Datacore or SVC, but this was not an option..).

The site redundancy is enabled by using replicated storage. Should the site with the active LUNs fail, then the storage mirror can be broken manually and operation can be resumed on the remaining site.

One thing we discovered we that resignaturing of the LUNs is no longer necessary, as it was in previous versions when a mirror had been broken. This means that LUNs can be remounted directly without modifications, see Fibre Channel SAN Configuration guide pp. 74-76.
Earlier, you had to first break the mirror, then resignature your LUNs with the advanced feature LVM.resignature and then add the LUNs. This changed the UUID (and the label on the LUNs for that matter) which means that all VM had to be manually reregistered in virtualcenter. This is a bit time consuming and not something you want to spend yor time on in a disaster scenario.

In vCenter, you can use the "add storage" wizzard to remount the LUNs. However, there's a known bug in the software so it does not work. In stead, it has has to be done from command line with the following command (rescan the HBAs first. if it hangs, then reboot):

# esxcfg-volume -l (to list available volumes)
# esxcfg-volume -M (to persistently mount volume)

See this post for example site recovery procedure