Thursday, July 8, 2010

Disaster recovery: Procedure in case of site failure


Here's a short example of a procedure for recovering a VMware cluster from a site failure. The example scenario consists of two ESX4 hosts on replicated storage divided on seperate locations. There's no automatic failover for storage between sites, manual breaking of the mirror is required.

1.

Log into vCenter and verify whether or not storage is available for the cluster. If storage is unavailable, create an incident ticket for the storage group with priority urgent and with a request to:

“Manually break the mirror for the “"Customer X" replicated storage group” used by ESXA and ESXB”

The ticket should be followed by a phone call to the storage day/night duty to notify of the situation.

2.

When mirror has been broken, rescan remaining hosts in the cluster. This rescan can possibly time out. If this happens, reboot the hosts.

After the rescan/reboot all shared LUNs will be missing on the hosts. These should be added/mounted manually from the console (step 3) (in ESX4u1 there's a bug in the add "storage" wizzard, so it doesn't work from the vSphere client, see this post for more info)

3.

Putty to each of the hosts and run the following commands:

#esxcfg-volume –l

This will list available volumes. For each volume, run the following command:

#esxcfg-volume –M label or UUID>

For example:

#esxcfg-volume –M PSAM_REPL_001

See screendump below for further exemplification:



4.

From the vSphere client, for each of the available hosts go to Configuration -> Storage and click “Refresh”. Verify that all LUNs appear as before the site failure

5.

Power on all VMs

6.

Done. In this situation, storage will run from the secondary site. The storage group will be able to reverse the replication seamlessly at a later stage when failed site is operational again. This does not require involvement from the VMware group.

Site redundancy with manual breaking of storage mirror

We have just installed a site redundant cluster for a customer. The cluster consists of two ESX4 hosts on replicated EMC CLariion storage. The ESX servers as well as the storage reside on different locations (preferrably we would have liked to have done it with storage virtualisation and seamless storage failover ala Datacore or SVC, but this was not an option..).

The site redundancy is enabled by using replicated storage. Should the site with the active LUNs fail, then the storage mirror can be broken manually and operation can be resumed on the remaining site.

One thing we discovered we that resignaturing of the LUNs is no longer necessary, as it was in previous versions when a mirror had been broken. This means that LUNs can be remounted directly without modifications, see Fibre Channel SAN Configuration guide pp. 74-76.
Earlier, you had to first break the mirror, then resignature your LUNs with the advanced feature LVM.resignature and then add the LUNs. This changed the UUID (and the label on the LUNs for that matter) which means that all VM had to be manually reregistered in virtualcenter. This is a bit time consuming and not something you want to spend yor time on in a disaster scenario.

In vCenter, you can use the "add storage" wizzard to remount the LUNs. However, there's a known bug in the software so it does not work. In stead, it has has to be done from command line with the following command (rescan the HBAs first. if it hangs, then reboot):

# esxcfg-volume -l (to list available volumes)
# esxcfg-volume -M (to persistently mount volume)

See this post for example site recovery procedure

Thursday, May 20, 2010

My VMworld session ready for public voting

Update: Unfortunately, my session was not among the lucky winners. Apparantly, the world is not ready for exciting service descriptions ;-) In stead, I'll be going to VMworld in CPH as an attendee.

My session has passed the internal review and is now ready for public voting. It is placed under 'Private Cloud - Management' and the title is:

Defining your services and offerings on vSphere

Description:

As virtual infrastructures (VI) comprise a complex set of technologies, varying perceptions of virtual infrastructures and virtual servers, tend to exist. Ask any VI admin, a sales person, or a customer and you will likely get three different answers. As organizations grow, the degree of specialization typically increases, which augments the number of departments that contribute in the service delivery model. A lack of definitions for input, output and responsibility areas between these interfaces can have a negative impact such as prolonged delivery times and an unclear delivery and pricing model. Another consequence of not defining your services is that someone else will do it for you. This could be the sales department or a solution architect that sell a custom solution due to a lack of existing building blocks. These solutions typically do not scale well and the technical design tends to be less than optimal. Services, whether it be an ‘ESX operations service’ or a ‘virtual Windows server service’, need to be defined, standardized, and published in a service catalogue. Furthermore, there should be a clear distinction between an internal service and an external customer offering. These matters will be addressed in this session as well as different examples of how a virtual infrastructure- and a virtual server service can be defined. This session builds on the theoretical framework of the updated ITIL v3, specifically with a focus on Service Design and the Service Catalogue.

Sunday, April 18, 2010

Dilbert strip - Owned! ;-)

Ahh, that guy's funny ;-)




Wednesday, March 31, 2010

Identifying your WWN id's via ILO

For the storage department to be able to zone up one or more LUNs to a given ESX host, they need three pieces of information:

  • ESX host name (FQDN)
  • WWN id's of the HBA's
  • If new LUN, then the size of the LUN. If you're zoning existing LUNs, then they need to know the storage group that the host should be added to (this can be done by providing hostname of one or two existing hosts that already have that zoning).
The WWN id can be identified both from the VI client (Configuration -> Storage Adapters) and from the service console. But this can only be done after ESX has been installed.

Sometimes, it can be useful to be able to fetch WWN info before the host has been installed. This way, the storage department can begin zoning right away.

To identify WWN id's from ILO

  • Log into ILO either directly or via the blade enclosure
  • Go to the Information tab of your server
  • WWN id can be found under the info box for your HBA (see screendump below)





Monday, February 15, 2010

Howto: Installing VMware tools in a Linux VM

Installing VMware tools in a Linux VM take a few more steps than on a Windows VM. This is done the following way (tested on VMware Workstation 7 and Ubuntu Desktop 9.04 VM appliance).
  • install the guest OS (click here to see if guest OS is supported)
  • to exit the gui to simulate no X server: sudo service gdm stop and then alt+f1 to get console
  • right click the VM and choose install/update VMware tools. This will connect the cdrom with the VMware tools ISO file (if files are not already available, they will be downloaded) but you still need to mount the cdrom manually: sudo mount /dev/scd0 /media/cdrom (if folder don't exist, create it first)
  • copy the tar file to /tmp folder and untar it: tar -xvf VMware-tools-vXX.tar.gz
  • ls to the untar'ed folder and run vmware-install.pl: sudo ./vmware-install.pl
  • start the gui: sudo service gdm start or simply startx
  • verify that VMware tools are running: sudo ps -auxwww 'pipe-symbol' grep vm (look for /usr/bin/vmtoolsd and you will also find the balloon driver vmmemctl). You can also check if the vmtools startup script has been put into the startup folder /etc/rc0.d/
link to VMware KB article on installing VMtools

Thursday, February 11, 2010

Example of an HA error - and a fix

The other day, I got an HA error when trying to add a new host into a cluster. It was weird, as the host was identical to the others - same model, same installation procedure, and everything. In VirtualCenter, the error looked like this:

This piece of information did not help much in relation to troubleshooting.

The only thing that was different with the new host was that is was configured from the service console (COS) as its NICs were DOA. I had used my own guide for this, so I thought I was in good shape ;-).

A more descriptive error was to be found in the VirtualCenter agent log file on the host (/var/log/vmware/vpx/vpxa.log). Grepping for the word "error" gave the following output:

errorcat = "hostipaddrsdiffer",
errotext = "cmd addnote failed for primary node: Host misconfigured. IP address of ... not found on local interface"

Earlier on, I had changed the IP address, as the first one assigned was already in use, but I'd forgotten to change the IP address in the /etc/hosts file. After doing that and restarting the network (service network restart), everything worked fine.

As a side node, I can mention that it can be pretty confusing manoeuvering through the various log files. Check this post by Eric Siebert for further explanation of VMware log files on VI3.