Thursday, July 8, 2010

Disaster recovery: Procedure in case of site failure


Here's a short example of a procedure for recovering a VMware cluster from a site failure. The example scenario consists of two ESX4 hosts on replicated storage divided on seperate locations. There's no automatic failover for storage between sites, manual breaking of the mirror is required.

1.

Log into vCenter and verify whether or not storage is available for the cluster. If storage is unavailable, create an incident ticket for the storage group with priority urgent and with a request to:

“Manually break the mirror for the “"Customer X" replicated storage group” used by ESXA and ESXB”

The ticket should be followed by a phone call to the storage day/night duty to notify of the situation.

2.

When mirror has been broken, rescan remaining hosts in the cluster. This rescan can possibly time out. If this happens, reboot the hosts.

After the rescan/reboot all shared LUNs will be missing on the hosts. These should be added/mounted manually from the console (step 3) (in ESX4u1 there's a bug in the add "storage" wizzard, so it doesn't work from the vSphere client, see this post for more info)

3.

Putty to each of the hosts and run the following commands:

#esxcfg-volume –l

This will list available volumes. For each volume, run the following command:

#esxcfg-volume –M label or UUID>

For example:

#esxcfg-volume –M PSAM_REPL_001

See screendump below for further exemplification:



4.

From the vSphere client, for each of the available hosts go to Configuration -> Storage and click “Refresh”. Verify that all LUNs appear as before the site failure

5.

Power on all VMs

6.

Done. In this situation, storage will run from the secondary site. The storage group will be able to reverse the replication seamlessly at a later stage when failed site is operational again. This does not require involvement from the VMware group.

3 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. You could also speed up step 3 with a one and a half liner like:

    for i in `esxcfg-volume -l | awk '{print $3}'|awk -F/ '($1="VMFS3") {print $2}'`; do esxcfg-volume -M $i; done

    ReplyDelete
  3. What are the disaster recovery procedures need to be followed in case of the loss of database from the server?

    ReplyDelete

Note: Only a member of this blog may post a comment.