Tuesday, October 2, 2018

Using Iperf3 for bandwidth and througput test on Linux

At my current client we had to test the network speed between Azure and a local site.
Initially we used Rsync to copy files back and forth and although it gives an ok indication, it does not show the full line speed as Rsync encrypts data during transfer (among other things).

Iperf3 is a really easy to use and simple tool to test the bandwidth or line speed between to machines. This can be either Windows or Linux.

Below shows how to install and run Iperf3:

The test was done on RHEL 7.5 VMs:

1) Install Iperf3 on both the "client" and the "server":

# sudo yum install iperf3

2) Ensure that TCP traffic is allowed inbound on the "server":

# sudo firewall-cmd --zone=public --add-port=5201/tcp --permanent

# sudo firewall-cmd --reload

If you want to run test with UDP, then the following commands should be run:

# sudo firewall-cmd --zone=public --add-port=5201/udp --permanent

# sudo firewall-cmd --reload

3) Start Iperf3 on the "server" and put it in listen mode:

# iperf3 -s

4) Start Iperf3 on the "client" with -c and specify the IP of the server:

# iperf3 -c 192.168.1.25
(replace above IP with IP of your server)

That's it. This will run the test within around a minute and show the result, see screen dumps below.

When we ran the test, we could not max out the 1 Gbit line with TCP. So we changed to UDP and increased the packet size with the following command:

# iperf3 -c 192.168.1.26 --bandwidth 10G  --length 8900 --udp -R -t 180

-c specifies to run command as client
--bandwidth emulates or assumes a 10 Gbit line (even if we just have 1 Gbit)
--length is the packet size
--udp
-R specifies to run the test in reverse. So instead of sending data, you are retrieving data. This is useful in that you can test both ways without changing the setup
-t is amount of seconds. We specified 180 seconds to let it run a bit longer.

Run Iperf3 --help for more options

Below shows a standard test between two VMs in Azure. Results are shown both on the client and on the server.




Tuesday, September 25, 2018

Fixing a corrupt /etc/sudoers file in Linux VM in Azure

I was editing the /etc/sudoers file with nano on a linux VM (RHEL 7.5) in Azure trying to remove or disable being prompted for a password every time I sudo.

I added the following to the file

root        ALL=(ALL:ALL) ALL
myadminuser     ALL=(ALL:ALL) ALL     NOPASSWD: ALL

Apparently that does not follow the correct syntax so immediately after I was not able to sudo. Below is the error meesage:

[myadminuser@MYSERVER ~]$ sudo reboot
>>> /etc/sudoers: syntax error near line 93 <<<
sudo: parse error in /etc/sudoers near line 93
sudo: no valid sudoers sources found, quitting
sudo: unable to initialize policy plugin


Since on the Azure VMs you don't have the root password, then you're stuck as the regular user do not have permissions to edit the sudoers file and you can't sudo to root.

You could mount the VM disk to another VM and then edit the file that way, but that is cumbersome.

Fix:

From the Azure portal start Cloud CLI, choose Powershell

Run the following command to make /etc/sudoers editable by master

az vm run-command invoke --resource-group YOUR_RESOURCE_GROUP --name YOURVM --command-id RunShellScript --scripts "chmod 446 /etc/sudoers"

This gives the regular user permission to edit the file

with nano or VI undo the changes (i just deleted the NOPASSWD: ALL): 

nano /etc/sudoers (no sudo since you have access)

after edit, run the below command to configure default access to file.

az vm run-command invoke --resource-group YOUR_RESOURCE_GROUP --name YOURVM --command-id RunShellScript --scripts "chmod 440 /etc/sudoers"

I got the fix from the following link. Note that the syntax has changed a bit.

The useful thing about this command is that you can execute any command as root on your VMs as long as you have access to the Azure portal.

How to edit /etc/sudoers:

To ensure that you don't introduce the wrong syntax in the file, use the command to edit:

visudo

This will open the file using vi editor and if you use wrong syntax you'll get a warning/error.

See this link for a quick guide using vi editor

Update: 2018.11.07: On RHEL 7.5 and with visudo, the below lines work, meaning that with the command:
# sudo su -
you're not prompted for passwd

root    ALL=(ALL)       ALL
myadminuser    ALL=(ALL)       NOPASSWD: ALL


Saturday, June 17, 2017

Amazon AWS - first steps after creating an account

After creating an account in Amazon AWS, there are a couple of steps to be done before you start provisioning resources. This is all fairly well described in the AWS documentation, so the below info is just to summarize the steps:

What you want to do is to first add some additional security to the root user and then to create an IAM user with admin rights that will be used going forward. Root user should not be used.


  1. Log into https://console.aws.amazon.com 
  2. Go to Services -> IAM
  3. Under Security Status it will state that you have already deleted your root access keys. That is because you haven't created any (this is not the same as your account password, access keys are used to e.g. sign programmatic requests using SDK or REST).
  4. Before enabling multi-factor authentication (MFA), you need a software MFA app. Google Authenticator is a free app for both iPhone and Android. Download this app to your phone.
  5. To enable MFA under IAM, go to: Security Status -> Activate MFA on your root account ->  Manage MFA. This will open a simple wizard. Choose software MFA. A bar code will be presented that should be scanned from the phone. Open Google Authenticator, click the '+' sign and choose 'Scan barcode'. This will add an entry in the app. Type in two consecutive keys in the wizard and that's it. Next time you log in to the account, it will prompt for the six digit key after entering the password.
  6. To create a new user and group for daily use, go to Services -> IAM -> Users -> Add user. This will open a wizard. If you haven't done so already, you'll be prompted to create a group also to place the user in. This group should have full administrative access. Choose the first option in the list, 'AdministratorAccess', this will grant full access
  7. Once the user is created, a direct link to the AWS console will be created that will look somethng like: https://1562xxxxxxxx.signin.aws.amazon.com/console
  8. To create access keys for the user, go to IAM -> Users -> choose the user -> Security credentials tab -> click Create Access Key. This will let you do a one time download of the Access Key ID and the Secret Access Key
  9. On the same Security credentials tab, MFA can be enabled for this user by clicking the pencil next to 'Assigned MFA device'. The wizard will be the same as for the root user. When scanning the bar code, a second entry will show up in Google Authenticator, see screen dump below (so one for root account and one for the user)
  10. As a last step you can apply a password policy to your IAM users to make all the check boxes green, see screen dump below.
  11. Done. Now you can log out from your root account and only use the admin user going forward (which should be used for creating further users and groups to do the actual work)




Thursday, August 20, 2015

Nexus 1000v - will the network fail if the VSMs fail?

At current client there has been some concern regarding the robustness of the network in relation to Nexus 1000v. It's a vSphere 5.5 environment running in Vblock with Nexus 1000v switches (bundled with Vblock).

The question was whether the network on the ESXi hosts is dependent on the two management 1KV VMs and if the network will fail entirely if these two VMs are down.

Furthermore, there was a question whether all ESXi traffic flows through these two VMs, the management VMs were being perceived as actual switches.

The answer is, for most, probably pretty straight forward but I decided to verify anyway.

Two notes first:

1) By adding Nexus 1000v to your environment you may receive some benefits. But you also add complexity. Through the looking glass of vCenter, it is simply easier to understand and manage a virtual distributed switch (vDS). Some network admins may disagree of course.

2) From Googling a bit and also from general experience, it doesn't seem like that many people are actually using the 1KV's. There is not much info to be found online and the stuff there is seems a bit outdated.

That said, let's get to it:

The Nexus 1000v infrastructure consists of two parts.

1) Virtual Supervisor Module (VSM). This is a small virtual appliance for management and configuration. You can have one or two. With two VMs, they run in active/passive mode

2) Virtual Ethernet Modules (VEM). These modules are installed/pushed to each of the ESXi hosts in the environment

All configuration of networks/VLANs is done in the VSMs (NX-OS interface) and then pushed to the VEMs. From vCenter it looks like a regular vDS but you can see in the description that it is an 1KV, see below:


Even if both VMSs should fail, the network will continue to work as before on all ESXi hosts. The network state and info/configuration is kept separately on all ESXi hosts in the VEM. However, control is lost and no changes can be done before the VSMs are up and sync'ed again.

For the other question, no, the VM traffic does not flow through the VSM, they are only for management. The VM ethernet traffic flows through the pNICs in the ESXi hosts and on to the network infrastructure. The same as with standard virtual switches and vDS'es. This means that the VSMs cannot be a bandwidth bottleneck or single point of failure in that sense.

For documentation: See 45 seconds of this video from time: 14.45 to 15.30.

Below are two diagrams that show the overview of VSM and VEM:




Wednesday, July 22, 2015

Cold Migration of Shared Disks (Oracle RAC and NFS clusters)

Certain applications use shared disks (Oracle RAC and NFS clusters due to clustering features). These can be vMotion’ed between hosts (for RAC you have to be careful for monster VMs and high loads as the cluster timeout has low threshold that can be reached during cut-over), but svMotion is not possible. Migration of disks will have to be done while both virtual machines (or all VMs in cluster) are shut down (cold migration). The method involves shutting down both primary and secondary node, removing the shared disk that has to be migrated (without deleting it) from the secondary node, migrating the disk to new LUN from primary VM, and then re-adding the disk to secondary node after migration is completed (including configuration of multi-writer flag for disk). After this both VMs can be booted.

Note: This is not RDM disks but regular vmdk's with multi-writer flag set

Instruction steps

Steps to migrate shared disks (Oracle RAC and NFS)

  • Identify the two VMs that share disks, note the VM names
  • Identify the disk(s) that should be migrated to new LUN, note the scsi ID for each disk (e.g. SCSI (1:0))
  • Note (mostly for Oracle RAC) if disk is configured in Independent and persistent mode
  • Ensure maintenance/blackout windows is in place
  • Shut down both VMs
  • For secondary VM, go to Edit Settings -> Options -> General -> Configuration Parameters (see screen dump below) and verify if the “multi-writer” flag is set for the disks to be moved
  • While both VMs are shut down, remove the disk(s) from the secondary VM (without deleting it)
  • From primary VM, right click and choose Migrate. Migrate the disk(s) to the new LUN
  • Wait for the process to finish
  • On secondary VM, go to Edit Settings -> Hardware -> Add. Select Hard disk and Use existing hard disk. Browse for the disk in the new location and click add. Make sure the same SCSI ID is used as before
  • For secondary VM, go to Edit Settings -> Options -> General -> Configuration Parameters -> Add row and add the multi-writer flag to each of the re-added disks.
  • (If disk is/was configured in Independent and persistent mode, go to Edit settings -> Hardware -> Mark the disk -> under Mode, check the Independent check-box and verify that the Persistent option is set)
  • Boot the primary VM, boot the secondary VM
  • Ensure that application is functioning as expected. Done



Monday, July 13, 2015

Permanent Device Loss (PDL) and HA on vSphere 5.5

At my current client we are doing a number of non-functional requirement (NFR) tests involving storage. One of them is about removing a LUN to see if HA kicks in.

The setup is an EMC VPLEX Metro stretched cluster (or cross-cluster) configured in Uniform mode. So active-active setup with site replicated storage and 50% of hosts on each site. And vSphere 5.5.

What complicates things in a stretched Metro cluster in Uniform mode is that even though storage is replicated between sites, the ESXi hosts only see storage on their own site. So if you kill a LUN on one site A in VPLEX, the hosts in site A will not be able to see LUNs on site B and HA therefore is required.

My initial thought was that cutting/killing a LUN on the VPLEX would make the VMs on that LUN freeze indefinitely until storage becomes available again. This is what happened earlier with vSphere 4.x and it was a real pain for the VMware admins (an all-path-down (APD) scenario).

However, as of vSphere 5 U1 and later, HA can now handle a Permanent Device Loss (PDL) where a LUN becomes unavailable while the ESXi hosts are still running - and the array is still able to communicate with the hosts (if array is down, you have an APD and HA will not kick in).

In vSphere 5.5, HA will work automatically if you configure two advanced settings which are non-default, go to ESXi host -> Configuration -> Advanced Settings and set the following:

  • VMkernel.Boot.terminateVMOnPDL = yes
  • Disk.AutoremoveOnPDL = 0

This has been documented well by Duncan Epping on Yellow-Bricks and on Boche.net. And a bit more info here for 5.0 U1.

See screen dump below for settings:

A reboot of the ESXi host is required for the two changes to take effect.



Thursday, April 2, 2015

Dead paths in ESXi 5.5 on LUN 0

At a client recently, going over the ESXi logs, I found that a certain entry was spamming the /var/log/vmkwarning logs. This was not just on one host but on all hosts. The entry was:

Warning: NMP: nmpPathClaimEnd:1192: Device, seen through path vmhba1:C0:T1:L0 is not registered (no active paths)


As it was on all hosts, the indication was that the error or misconfiguration is not in the ESXi hosts but probably at the storage layer.

In vCenter, two dead paths for LUN 0 were shown on each host under Storage Adapters. However, it didn't seem to affect any LUNs actually in use:


The environment is running Vblock with Cisco UCS hardware and VNX7500 storage. ESXi hosts boots from LUN. UIM is used to deploy both LUNs and hosts. VPLEX is used for active-active between sites (Metro cluster)

The ESXi boot LUN has id 0 and is provisioned directly via VNX. The LUNs for virtual machines are provisioned via the VPLEX and their id's starts from 1.

However, ESXi still expects a LUN with id 0 from the VPLEX. If not, the above error will show.

Fix

To fix the issue, present a small "dummy" LUN to all the hosts via the VPLEX with LUN id 0. It can be a thin provisioned 100 MB LUN. Rescan the hosts. But don't add the datastore to the hosts, just leave it presented to the hosts but not visible/usable in vCenter. This will make the error go away.

When storage later has to be added, the dummy LUN will show as an available 100 MB LUN and likely operations guys will know not to add this particular LUN.

From a storage perspective the steps are the following:


  • Manually create a small thin lun on the VNX array 
  • Present to VPLEX SG on the VNX
  • Claim  the device on VPLEX
  • Create virtual volume
  • Present to Storage-views with LUN ID 0
  • Note.  Don’t create datastore on the lun.
Update 2015.07.21; 
According to VCE, adding this LUN 0 is not supported with UIM(P (provisioning tool for Vblock). We started seeing issues with the re-adapt function for UIM/P and storage issues after that. So we had to remove the LUN 0. So far, there is no fix if using UIM/P.