Virtual Infrastructure Tips - Azure and VMware: Troubleshooting

Showing posts with label Troubleshooting. Show all posts

Tuesday, April 15, 2025

Azure: Troubleshoot connectivity to a key vault with a private endpoint

If you have a key vault that you can't reach there can be multiple reasons for this. Two of the main ones are DNS issues and firewall blocks.

This post will go over those two issues and show a couple of ways to test for connectivity.

When working in hybrid setups with an on-prem location connected to Azure either via VPN or ExpressRoute then it happens that you can create and see the key vault (this also goes for e.g. storage accounts and other PaaS services) but you get an error when trying to add a secret or other content to it. The error can mention e.g. "the connection to the data plane failed".

To troubleshoot, first we ensure that the local IP of the private endpoint can be resolved.

nslookup myown-keyvault.vault.azure.net

On Linux you can use the command dig to get slightly better lookup details than with nslookup:

dig myown-keyvault.vault.azure.net

This should resolve to a local IP. If it doesn't resolve at all or if it returns a public IP, there is something wrong with the DNS setup.

More info on troubleshooting DNS can be found here.

If that works, you can check for connectivity from your source. This can be done in a couple of ways and either of them is fine.

From Windows run:

tnc myown-keyvault.vault.azure.net -port 443

From Linux run (netcat and nc is same command):

nc -zv myown-keyvault.vault.azure.net 443

All data to a key vault goes over port 443, so if you have connectivity on that port and it can resolve the IP, then you should be good.

Alternatively try:

$(Invoke-WebRequest -UseBasicParsing -Uri https://myown-keyvault.vault.azure.net/healthstatus).Headers

Or from Linux:

curl -i https://myown-keyvault.vault.azure.net/healthstatus

There is more info on troubleshooting behind a firewall here.

Tuesday, April 17, 2012

Could not power on VM - lock was not free

The other day we experienced an incident on the SAN storage with high latency and even loss of connection to the SAN. This can generate a lot of really unpleasant errors on the ESX hosts. Even after the SAN is brought back to a stable state we've seen hosts that won't boot, VM's that won't vMotion and VMs that won't power on due to file locks.

If you receive a 'locked file error' (like screendump below) and your VM won't boot there are a couple of ways to go about it. This VMware KB article explains it quite well. Either you can cold migrate the VM to the other hosts in the cluster (to find the ESX host with the lock) and then try to boot it from there or you can try to locate specifically which host has the lock.

If the vCenter log does not tell you specifically which files are locked, this can be viewed in the vmware.log which is located in the VM folder. If you just tried to power on the VM, then relevant info should be at the end of the log file.

In the example below, it is the swap that is still locked.

This can be verified by running the touch command on the locked file.
With vmkfstools you can get the mac address that has the lock:

# vmkfstools -D /vmfs/volumes///

In the screendump below, the MAC address has been highlighted.

The same info can be found in the /var/log/vmkernel log

Once you have the MAC address you can find a match by, for example, logging in to vCenter or onto the Blade enclosure. When you have a match, cold migrate the VM to the relavant ESX host and boot it.

Thursday, July 21, 2011

ESXTOP to the rescue - VM latency

Earlier on I have mostly used ESXTOP for basic troubleshooting reasons such as CPU ready and the like. Last weekend we had a major incident which was caused by a power outage which affected a whole server room. After the power was back on we had a number VMs that was showing very poor performance - as in it took about one hour to log in to Windows. It was quite random which VMs it was. The ESX hosts looked fine. After a bit of troubleshooting the only common denominator was that the slow VMs all resided on the same LUN. When I contacted the storage night duty the response was that there was no issue on the storage system.

I was quite sure that the issue was storage related but I needed some more data. The hosts were running v3.5 so troubleshooting towards storage is not easy.

I started ESXTOP to see if I could find some latency numbers. I found this excellent VMware KB article which pointed me in the right direction.

For VM latency, start ESXTOP and press 'v' for VM storage related performance counters.
The press 'f' to modify counters shown, then press 'h', 'i', and 'j' to toggle relevant counters (see screendump 2) - which in this case is latency stats (remember to stretch the window to see all counters)
What I found was that all affected VMs had massive latency towards the storage system for DAVG/cmd (see screendump 1) of about 700 ms (rule of thumb is that max latency should be about 20 ms). Another important counter is KAVG/cmd which is time commands spend in the VMkernel, the ESX host, (see screendump 3). So there was no latency in the ESX host and long latency towards the storage system.

After pressing the storage guys for a while, they had HP come take a look at it, and it turned out that there was a defect fiber port in the storage system. After this was replaced everything worked fine and latency went back to nearly zero.

In this case, it was only because I had proper latency data from ESXTOP that I could be almost certain that the issue was storage related.

Screendump 1

Screendump 2

Screendump 3

Virtual Infrastructure Tips - Azure and VMware

Tuesday, April 15, 2025

Azure: Troubleshoot connectivity to a key vault with a private endpoint

Tuesday, April 17, 2012

Could not power on VM - lock was not free

Thursday, July 21, 2011

ESXTOP to the rescue - VM latency

Search This Blog

Blog statistics

Blog Archive

Links

About Me

Labels