Early notification of vmhba resets in a vSAN cluster

In the past years, I have experienced some vSAN performance issues due to faulty hardware. The goal was to know at an early stage whether there are hardware errors that can lead to performance degradation.

One problem I’ve seen a few times are hardware related problems that lead to a high latency, outstanding io’s and congestions at the backend storage. I was wondering if it is possible to spot these kinds of issues earlier? I started searching in vRealize Log Insight.

I found some events afterwards during my research. In the period prior to the performance issues, many “Power-on Reset on vmhba” messages had been written in the vobd.log and vmkernel.log. At first it was a few events per day, but as time passed the frequency with which the events came increasingly higher and finally led to a very poor vSAN performance.

In the following steps I will explain how you can define an email alert in vRealize Log Insight that helps to detect this kind of issues at an early stage. Now it’s possbile to take early action to avoid potential problems.

Step 1. Create a query that search for “Power-on Reset occurred on vmhba” events

Step 2. Create an alert from the query

Step 3. Define the alert

Add Skyline to Customer Connect

Recently VMware has rebranded their customer portal “My VMware” to “Customer Connect”. One of the cool new features which I really appreciate is the app launcher. For example, you can add Skyline to the app launcher, so you can go directly to Skyline (SSO) after logging into Customer Connect. Follow the next steps to add Skyline to VMware apps.

Step1. Login to VMware Customer Connect and select the Customize button

Step2. Select VMware Apps and add VMware Skyline to My Apps

Step 3. Done. You have added VMware Skyline to VMware apps.

When you go back to home or log in again the next time you will see VMware Skyline in your apps and you can go directly to Skyline.

Cannot install the vCenter Server agent(HA) service. Unknown installer error

It had been a while since I have installed a non HCI VMware cluster. After installing the ESXi hosts, the updates and multipath software were installed. The storage team has made the datastores available. Nothing special. After installation, the host has been taken out of maintenance mode. Then there was an error “Error: “Cannot install the vCenter Server agent service. Unknown installer error“. See VMware KB #2083945 and VMware KB #2056299.

I have followed all standard procedures to resolve HA errors:

  • Right click the affected host. Reconfigure for vSphere HA
  • Reconfigure HA on a cluster level.  Turn Off vSphere HA and  Turn ON vSphere HA
  • Disconnect and reconnect the affected host

After performing the above options, the issue was still unsolved. Next I wanted to know if the HA (fdm) agent is installed or not. I ssh to the host and ran the following command:

Esxcli software vib list | grep fdm

The output was empty. I realized that the HA agent was not installed. In VMware KB #2056299 is written about a vib dependency. That made me realize that besides the VMware updates also multipath software was installed, Dell EMC PowerPath/VE. This turned me out to the right direction to solve the problem.

Solution:

  • Ssh to the affected host(in maintenance mode)
  • Esxcli software vib list or Esxcli software vib list | grep power. The results are three vibs: powerpath.plugin.esx, powerpath.cim.esx and powerpath.lib.esx
  • Uninstall the three vibs running the following command: esxcli software vib remove –vibname=powerpath.plugin.esx –vibname=powerpath.cim.esx –vibname=powerpath.lib.esx
  • Reboot the host
  • Esxcli software vib list | grep power The output shlould be empty.
  • Leaving maintenance mode. The HA agent is now installing. After the HA agent is installed enter maintenance mode again
  • Esxcli software vib list | grep fdm The output should be similar like: vmware-fdm VMware VMwareCertified 2021-02-16
  • Reinstall Dell EMC PowerPath/VE. Installing the same version PowerPath/VE gave a VUM error even after restarting the host. To resolve this error I’ve installed a newer version of PowerPath/VE. This version was installed succesful.
  • Leaving maintenance mode

In my case the PowerPath/VE vibs dependecies were causing the issue. Another dependency can also cause this problem. I am aware that looking for the right dependence can be a difficult job. I hope I have at least been able to help you start the search in the right direction.

November 2022. An update about this issue can be read here.

Disable ESXi host alerts in vROPS when a host is in maintenance mode

For many years I have used Veeam Management Pack for VMware to monitor the VMware environments. After the switch to vROPS I never really missed it. However, whenever an ESXi host was in maintenance, there was an option that I could not find in vROPS. It’s such a small thing that makes you think “I still have to do something with that”. What it is? If an ESXi host is in maintenance, I don’t want any alerts from this host.

Recently, a colleague pointed me to an article that offers a solution to this problem. It is actually a very simple solution that only needs to be configured once. Because the article was already several years old, I have rewritten it based on the most recent version vROPS 8.2.x.

Use Case – An administrator wants to disable alerts on a ESXi host which has been put into maintenance mode in vCenter. This to avoid any alerts from this ESXi host inside of vROPS, while the admin wants to continue to collect metrics from this ESXi host.
Goal – Do this automatically without any manual changes in vROPS. As soon as a host is in maintenance mode in vCenter, vROPS should be aware of this and should stop alerting on the host in vROPS.
Solution – This can be achieved by a one time configuration using Custom Groups and Policy.

1- Create a new policy in vROPS named “Policy ESXi Hosts in maintenance mode”. This policy can be created under the default policy.
Go to Administration -> Policies -> Add
2- Select the default policy and click on the Add symbol to add a new policy.
3- Give it a name and description as shown below.

4- Click on Alerts and Symptom Definitions and filter the list of alerts with only host system alerts. We want a filtered list so that we can disable these in one go.

5- Now press CTRL + A on the keyboard to select all of them, you can also click on Actions -> Select All.
6- Click on Actions – > State -> Disable 

7- Click on Save and now you can see he new policy under your default policy.

8- Create a new custom group named “Group ESXi hosts in maintenance mode”. Use the following criteria to dynamically add members to this custom group based on ESXi host property which vROPS collects every few minutes.
Click on Environment -> Custom Groups -> Click on Add to add a new custom group.
Make sure to select the policy “Policy ESXi hosts in maintenance mode” which we created earlier.

9- Click on Preview to see if you are getting results. If there is any host in maintenance mode it will be displayed in the preview.

10 – Finally go into Administration -> Policies -> Active Policies and set the newly created policy at priority 1.

Now, as soon as you will put an ESXi host into maintenance mode in vCenter, within a few minutes it will be discovered as a ESXi host in maintenance in vROPS. It will be added to the new created custom group “Group ESXi hosts in maintenance mode”. Now all the alerts from this ESXi hosts are disabled. You will not see any alerts as long it is in maintenance mode.

Once the ESXi host is out of maintenance mode it will be moved out of the custom group.
Do note, that if you add any new alerts in the future, related to hosts, you would need to make sure that they are disabled in this policy.

Reference: Vxpresss blogspot

VMware vCLS datastore selection

Recently I noticed that after updating a VMware vCenter from 6.7 to 7.0 u1 the new VMware vCLS VMs where placed on datastores that are not meant for VMs.

Starting with vSphere 7.0 Update 1, vSphere Cluster Services (vCLS) is enabled by default and runs in all vSphere clusters.
vCLS ensures that if vCenter Server becomes unavailable, cluster services remain available to maintain the resources and health of the workloads that run in the clusters.

The datastore for vCLS VMs is automatically selected based on ranking all the datastores connected to the hosts inside the cluster. A datastore is more likely to be selected if there are hosts in the cluster with free reserved DRS slots connected to the datastore. The algorithm tries to place vCLS VMs in a shared datastore if possible before selecting a local datastore. A datastore with more free space is preferred and the algorithm tries not to place more than one vCLS VM on the same datastore. You can only change the datastore of vCLS VMs after they are deployed and powered on.

You can perform a storage vMotion to migrate vCLS VMs to a different datastore.

If you want to move vCLS VMs to a different datastore or attach a different storage policy, you can reconfigure vCLS VMs. A warning message is displayed when you perform this operation.

Conclusion: If datastores used that are intended for e.g. repository purposes, it is possible that the vCLS files are placed on that datastores. You can tag vCLS VMs or attach custom attributes if you want to group them separately.

Reference: docs.vmware.com

VMware vSphere 7 first impression

Yesterday VMware released Version 7 of vSphere. After downloading the necessary software, I built a nested vSAN 7 cluster in my lab. This is not a deep technical blogpost just my first impression.

vSphere logo 2020


I chose a fresh installation instead of an upgrade. This has to do with the available resources in my lab. The installation was simple as usual.

  • Deploy 4 nested ESXi hosts
  • Install vCSA
  • Create a cluster
  • Configure networks
  • Create vSAN
  • Deploy vm’s
  • Setup Skyline
  • Setup Backup

Deploying nested ESXi

When creating the nested ESXi hosts don’t forget to check the CPU option “Expose hardware assisted virtualization to the guest OS”. This is required if you want a working nested ESXi.

CPU hardware assisted virtualization enabled

After spinning up the ESXi installation and just before the deployment, the following warning occurred.

CPU Warning during ESXi setup

This message is due to the obsolete CPU type of the physical ESXi host. Because it’s a lab we ignore the warning and start the deployment. After a few minutes the installation is finished.

Hooray!

vCenter vCSA

The first thing that is noticed, is the absence of the vSphere-Client. Nobody used the vSphere-client either. So only the native HTML5 client is available.

vSphere UI

vSAN cluster

I’ve manually created a local vSAN cluster. I prefer this method because it gives more flexibility than the Cluster quickstart wizard. There are a lot of new and enhanced features.

New:

  • Simplify Cluster Updates with vSphere Lifecycle Manager
  • Native File Services for vSAN

Enhancements:

  • Integrated DRS awareness of Stretched Cluster configurations
  • Immediate repair operation after a vSAN Witness Host is replaced
  • Stretched Cluster I/O redirect based on an imbalance of capacity across sites
  • Accurate VM level space reporting across vCenter UI for vSAN powered VMs
  • Improved Memory reporting for ongoing optimization
  • Visibility of vSphere Replication objects in vSAN capacity views
  • Support for larger capacity devices
  • Native support for planned and unplanned maintenance with NVMe hotplug
  • Removal of Eager Zero Thick (EZT) requirement for shared disk in vSAN
  • The complete information can be found here:

The vSAN capacity monitoring has also been greatly improved. It gives a good overview of the current and historical capacity usage.

Capacity Usage
Capacity History

Virtual Machines

Windows 2019 is now available as Guest OS.

Windows 2019 available as Guest OS

Skyline

Skyline gives a daily overview of security findings and recommendation from VMware environments. That is why I immediately added this cluster to Skyline. I wonder if there are any findings and recommendations after the first collection of data.

Update Skyline April 4, 2020

vSphere7 lab is connected to VMware Skyline. Already two recommendations. Good to see it works.

vSphere 7 connected to VMware Skyline

Backup

The vm’s in this environment must also be backed up. I have choose to use the backup solution from Veeam, V10. I don’t know if Veeam currently supports vSphere 7, but it works in my lab.

Conclusion

VMware has released multiple enhancements and improvements with vSphere 7. vSphere 7 remains the strong engine of a modern SDDC. In addition to vSphere7, VMware has also released VMware Cloud Foundation 4.0 and VMware Tanzu. There is a lot to read and learn about all the new and enhanced VMware products.

What’s new in vSAN 7.0

Yesterday, VMware announced the following new software .

  • VMware vSphere 7.0
  • VMware Cloud Foundation 4.0
  • VMware Tanzu

With the announcement of VMware vSphere 7.0, vSAN 7.0 has also become available.

An overview of new and enhanced functions.

New:

  • Simplify Cluster Updates with vSphere Lifecycle Manager
  • Native File Services for vSAN
  • Deploy More Modern Applications on vSAN with Enhanced Cloud Native Storage

Enhancements:

  • Integrated DRS awareness of Stretched Cluster configurations
  • Immediate repair operation after a vSAN Witness Host is replaced
  • Stretched Cluster I/O redirect based on an imbalance of capacity across sites
  • Accurate VM level space reporting across vCenter UI for vSAN powered VMs
  • Improved Memory reporting for ongoing optimization
  • Visibility of vSphere Replication objects in vSAN capacity views
  • Support for larger capacity devices
  • Native support for planned and unplanned maintenance with NVMe hotplug
  • Removal of Eager Zero Thick (EZT) requirement for shared disk in vSAN

The complete information can be found here:

https://blogs.vmware.com/virtualblocks/2020/03/10/announcing-vsan-7/

What’s new in vSAN 7.0

https://www.youtube.com/watch?v=a8q6dqBnPtw&feature=youtu.be

Not enough free space to upload VxRail update

As you probably know, I like the VxRail HCI concept. Yet there is one point in my opinion that can still be improved.
Sometimes a log must also be generated for support purposes in a VxRail cluster. After creating a new log bundle it can be downloaded but not deleted with the result that these logs remain on the VxRail manager (VxRm). Not a problem in itself but it has often happened to me that not enough free space is available while uploading the new VxRail code. The example below shows that “/dev/sda3” is 80% full.

vxrm:~ # df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 3.9G 0 3.9G 0% /dev
tmpfs 3.9G 4.0K 3.9G 1% /dev/shm
tmpfs 3.9G 393M 3.6G 10% /run
tmpfs 3.9G 0 3.9G 0% /sys/fs/cgroup
/dev/sda3 16G 7.6G 8.1G 80% /
/dev/sda1 124M 39M 80M 33% /boot
/dev/mapper/data_vg-store1 2.0G 3.1M 1.9G 1% /data/store1
/dev/mapper/data_vg-store2 14G 9.3G 3.8G 72% /data/store2
tmpfs 850M 0 850M 0% /run/user/123
tmpfs 850M 0 850M 0% /run/user/4000

The following command finds temporary large files that are usually left behind after an update or generating a support log bundle. Always take a snapshot before make any change.

Find /tmp -type f -size +20000k -exec ls -lh {} \; | awk ‘{ print $9 “: ” $5 }’

Check the output and delete the large files in “/tmp”. As can be seen in the overview below, “/dev/sda3” is now filled up for only 52%. This is more than enough to upload the VxRail update.

vxrm:~ # df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 3.9G 0 3.9G 0% /dev
tmpfs 3.9G 4.0K 3.9G 1% /dev/shm
tmpfs 3.9G 393M 3.6G 10% /run
tmpfs 3.9G 0 3.9G 0% /sys/fs/cgroup
/dev/sda3 16G 7.6G 8.1G 52% /
/dev/sda1 124M 39M 80M 33% /boot
/dev/mapper/data_vg-store1 2.0G 3.1M 1.9G 1% /data/store1
/dev/mapper/data_vg-store2 14G 9.3G 3.8G 72% /data/store2
tmpfs 850M 0 850M 0% /run/user/123
tmpfs 850M 0 850M 0% /run/user/4000

My conclusion is that I prefer a download stream of the support log bundle instead of placing the file on the VxRail Manager. Maybe in a future release?

The above is just an example. It’s at your own risk to make any changes. You can always log a support case at Dell Support if you encounter this issue.

WSFC on vSAN, backup & restore

After a week in Barcelona for VMworld Europe 2019 I got home with a lot of new information and ideas. This post is about Windows Server Failover Cluster(WSFC) on vSAN and how to backup and restore. WSFC is now fully supported on vSphere 6.7 update3 and for the Dell VxRail users, code 4.7.300.

I started with reading VMware KB74786. It’s a good start and describes the straight forward deployment.

First I have deployed two Windows Server 2016 vm’s in a vSAN cluster. After the initial deployment I have added the failover cluster file server role should on both vm’s. Now it was time to power-off both vm’s and add a Paravirtual SCSI controller with a physical bus sharing to both vm’s.

The next step is reconfiguring vm1 and add two new disks. The first disk is 5GB(Quorum) and the second disk is 50GB(Fileserver data). After reconfigure the vm’s it’s time to power-on them again.

At vm1 I brought the new disks online and format them as NTFS. The next step is crucial before the cluster can be created. If you forgot this steps the disks are not detected in the cluster configuration. Power-off vm2 and add the two existing disks from vm1 to the Paravirtual SCSI controller. Power-on vm2 after reconfiguring.

The creation of the cluster is now straight forward as on physical hardware. You need a cluster-core FQDN and for the fileserver role you need a cluster-cap FQDN. There is a lot of documentation available about configure a Windows failover cluster and otherwise ask your favourite Windows admin :-).

After the deployment I did some failover and failback tests. I was surprised of the speed of the failover. I know there were not many client connections, but I am really impressed.

Backup and restore

I was already convinced that WSFC on vSAN should work. But how to backup and restore the cluster and the data on it? I was thinking about this because snapshots are unsupported with WSFC on vSAN. See VMware KB74786.

I’ve performed the backup and restore tests in my testlab with Veeam B&R 9.5 update 4b.

The backup and restore test configuration:

First I have excluded the two vm’s from snapshot backup. The next step is create a new protection group for virtual failover clusters in the inventory view. In the active directory tab, search and add the two nodes and the cluster-core. In the exclusion tab of the new protection group I have unmarked “Exclude All virtual machines”. This is important otherwise the the cluster nodes can’t be added to the protection group. Use a service account with enough permissions and keep the defaults in options tab. After completing the new protection group wizard, the Veeam Agent for Windows will be deployed on the cluster nodes. A reboot is needed. Using Veeam Agent for Windows is the trick in this test. I considered the cluster and nodes as if they were physical. That’s why I used Veeam Agent for Windows. The final step is configure a backup job and backup! After this initial backup I created the recovery iso for both nodes for a bare metal restore(bmr).

I’ve succesful do the following restores from a Veeam Windows ReFS landingzone server.

  • File / folder
  • Volume restore
  • Bare metal restore

Everything went normal. Only a bmr restore with recovery iso is a bit different then bmr a physical server. You have to keep the following in mind. Normally when you create a recovery iso all the network drivers are included in the iso. VMware VMXNET3 driver is not included. I’ve asked Veeam support if it’s possible to add the VMXNET3 driver? It’s not possible. There is an option to load a driver during the startup of the recovery iso. During my test I was able to browse the the driver in the Windows folder: C:\Windows\System32\DriverStore\FileRepository\vmxnet3.inf_amd64_583434891c6e8231. And load it succesful. In the future maybe there are other ways of achieve this.

During the bmr restore I was only able to recover the system volumes only. This by design, I guess, because normally the other cluster node, including the data volumes are online. Finally I’ve succesful tested a recovery of an entire cluster data volume.

Conclusion:

The test deployment WSFC on vSAN helped me better to understand how it works. I see definitely possibilities for WSFC on vSAN.

The backup and restore tests helped me to find an answer how to backup and restore a WSFC on vSAN cluster. The tested backup configuration is supported by Veeam. I logged a case and asked them and they confirmed! Keep in mind that your guest-os is supported. See the Veeam release notes document.

Cheers!

ESXi hosts not in maintenance in SCOM

Recently I have upgrade VMware clusters from version 6.0 to 6.5. Upgrading went smoothly but I noticed that the ESXi hosts were not in maintenance mode (mm) in SCOM.

Setup: MS SCOM, Veeam Management Pack for System Center. Other ESXi hosts managed by another vCSA were not causing this issue.

After some investigation by Veeam Support the root cause was found. The different versions of ESXi before and after the upgrade was causing the issue.

The solution was easy and straight forward. Clearing the SCOM agent cache on all the Veeam Collector(s) and Veeam Enterprise Server (VES) and “Rebuild the full topology” in the Veeam VES management webpage. How long it takes to rebuild depends on the size of your environment. I have waited a few hours and put an ESXi host in maintenance mode and back out of maintenance. Everything works as usual.