Monday, July 13, 2015

Permanent Device Loss (PDL) and HA on vSphere 5.5

At my current client we are doing a number of non-functional requirement (NFR) tests involving storage. One of them is about removing a LUN to see if HA kicks in.

The setup is an EMC VPLEX Metro stretched cluster (or cross-cluster) configured in Uniform mode. So active-active setup with site replicated storage and 50% of hosts on each site. And vSphere 5.5.

What complicates things in a stretched Metro cluster in Uniform mode is that even though storage is replicated between sites, the ESXi hosts only see storage on their own site. So if you kill a LUN on one site A in VPLEX, the hosts in site A will not be able to see LUNs on site B and HA therefore is required.

My initial thought was that cutting/killing a LUN on the VPLEX would make the VMs on that LUN freeze indefinitely until storage becomes available again. This is what happened earlier with vSphere 4.x and it was a real pain for the VMware admins (an all-path-down (APD) scenario).

However, as of vSphere 5 U1 and later, HA can now handle a Permanent Device Loss (PDL) where a LUN becomes unavailable while the ESXi hosts are still running - and the array is still able to communicate with the hosts (if array is down, you have an APD and HA will not kick in).

In vSphere 5.5, HA will work automatically if you configure two advanced settings which are non-default, go to ESXi host -> Configuration -> Advanced Settings and set the following:

  • VMkernel.Boot.terminateVMOnPDL = yes
  • Disk.AutoremoveOnPDL = 0

This has been documented well by Duncan Epping on Yellow-Bricks and on Boche.net. And a bit more info here for 5.0 U1.

See screen dump below for settings:

A reboot of the ESXi host is required for the two changes to take effect.



No comments:

Post a Comment

Note: Only a member of this blog may post a comment.