Sunday, November 22, 2009

VLAN testing in ESX 3.5

In larger organisations, typically, the network department and the VMware group are seperated in different teams. So as a VMware administrator you need to ask the network department to trunk VLANs to the physical switch ports that your ESX is connected to. It happens that the network department misses a port or a VLAN which means that you can end up with a VM loosing network connection after e.g. a VMotion. Unfortunately, the responsibility can land on the VMware administrator for putting a host into production without testing VLAN connectivity. Unfair, but that's life.

But testing VLANs the manual way is rather time consuming. Especially if you have multiple hosts with multiple nics and multiple VLANs. The number of test cases quickly amount to the impossible. If, for example, you have five hosts, five VLANs and 4 NICs in each host, that means (5 x 5 x 4) 100 test cases.

The traditional way of testing is to create a vSwitch with only one vmnic connected. Then connect a VM on that vSwitch with one of the VLANs. Configure an IP address in the address space of the VLAN and ping the gateway. Do this for all the VLANs, and then connect the next vmnic to the vSwitch and start over.

The following method speeds up VLAN testing significantly (in this case from 100 to 16 test cases). It is not totally automated, but I have found it very useful nonetheless.

The basics of it is that you configure a port group to listen on all available VLANs and then you enable VLAN tagging inside the VM and do your testing from there:

1. Create a port group on the vSwitch with ID 4095. This will allow the VM to connect to all available VLANs available to the host.

2. Enable VLAN tagging from inside the VM. This only works with the E1000 intel driver which only ships with 64 bit Windows. So if you have a 32 bit Windows server, then you need to first modify the .vmx file and then download and install the intel E1000 driver from within Windows (Update: Even for Win 64 bit, you need to download and install E1000 manually. The advanced VLAN option is not included in the default driver). This link describes how this is done. Note that when modifying the .vmx, add the following line:

Ethernet0.virtualDev = "e1000"

Note that if you use the default Flexible nic to begin with, there's no existing entry for the nic in the .vmx, so just add the new entry.

Under Edit Settings for the VM, attach the NIC to the VLAN with id 4095.

3. Now you can add VLANs in the VM. Go to the Device Manager and then Properties for the E1000 NIC. There's a tab that says VLANs (see screendump below). As you add VLANs, a seperate NIC or "Local Area Connection" is created for each VLAN. It is set for DHCP, so if there's a DHCP server on that network it will receive an IP automatically. If not, you will need to configure an IP for that interface manually (e.g. by requesting a temporary IP from the network department.). For quickly configuring the IP, you can run the following command from CMD or a batch (.cmd) script:

netsh int ip set address "local area connection 1" static 192.168.1.100 255.255.255.0 192.168.1.254 1

4. Now we will use the Tracert (traceroute) command to test connectivity. The reason that we can't use Ping is the following: If you have multiple VLANs configured and you ping a gateway on a given VLAN - and the VLANs happen to be routable - then you will recieve a response from one of the other VLANs even though the one your are testing is not necessarily working.

But when using Tracert, then you can be sure that if the gateway is reached in the first jump, then the VLAN works. If the VLAN doesn't work, then you will see Tracert doing multiple jumps (via one of the other VLANs) before reaching the gateway (or it will fail if there's no connectivity at all). You can create a simple .cmd file with a list of gateways that you execute from the CMD prompt. Example file:

tracert 192.168.1.254
tracert 10.10.1.254
tracert 10.10.2.254

See below for example screendump.

Before running the batch script you need to have only one physical nic connected to the vSwitch. You can do this in one of two ways. 1) create a seperate vSwitch and connect only one vmnic at a time. Then you control it from VC. Or 2) you unlink all vmnics but one from the service console (COS) with the following commands:

ssh to the ESX host
esxcfg-vswitch -l (to see current configuration)
esxcfg-vswitch -U vmnic1 vSwitch0 (this unlinks vmnic1 from vSwitch0)
esxcfg-vswitch -L vmnic0 vSwitch0 (this links vmnic0 to vSwitch0)

These commands work instantaneously so you don't have to restart the network or anything. Then you run through the test on one vmnic at a time. When done with a host, you VMotion the VM to the next host in the cluster and continue the test from there.


6 comments:

  1. Great post ! i'll try ASAP :)
    BTW, the intel's VLAN screenshot is not clickable ;)

    ReplyDelete
  2. Thanks ;-) The VLAN screenshot has been corrected, now clickable...

    ReplyDelete
  3. Hi - just out of curiosity, if in step one you state
    "Create a port group on the vSwitch with ID 4095."

    and then later on in the guide you say
    "Before running the batch script you need to have only one physical nic connected to the vSwitch. You can do this in one of two ways. 1) create a seperate vSwitch and connect only one vmnic at a time. Then you control it from VC."

    Couldn't you instead just edit the VLAN 4095 Port Group settings to "override vSwitch failover order" and then mark one vmnic as active and the other as unused? That way all traffic for that Port Group is forced through the intended vmnic and then you can switch the vmnics around to test the other.

    With this approach, you can then test VLANs that are added over time after the initial build phase (for instance 6 months down the line when additional VLANs need to be added and tested) without worrying about forcing all vSwitch traffic of existing VLANs through a single vmnic just to verify a newly added VLAN.

    Just a thought at any rate.

    Otherwise, great article and appreciate the time you put into it. I'm keen to start using this method to save labour on our builds going forward.

    ReplyDelete
  4. From a testing perspective, the result will be the same either way. If you have three or more vmnics on the vSwitch, then creating a separate vSwitch will typically not be a problem. However, if you have only two vmnics on the production vSwitch, then overiding the failover policy on the port group level is better to ensure redundancy and performance. Good point.

    ReplyDelete
  5. Is something similar doable on ESXi? Can't create a 4095 vlan :-/

    ReplyDelete
  6. The same applies to ESXi. VLAN 4095 can be created on both ESX and ESXi. Also on ESXi 5.0.x. See http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1004074

    ReplyDelete

Note: Only a member of this blog may post a comment.