Get notificated when vSAN Force Provisioning is enabled and applied to a vm

Recently I was asked if it is possible to receive an email notification if a vSAN storage policy with Force Provisioning enabled is applied to a vm. In this blogpost I want to show that this is possible.

Use Case – An administrator wants apply an vSAN storage policy with Force Provisioning enabled to virtual machines beacause of a possible shortage of vSAN storage capacity. In my opinion not a very good idea in a production environment!

Goal – Get an email notification when a vSAN storage policy with Force Provisioning enabled is applied to an vm. There is also the wish to make this visable in a dashboard.

Solution – With a bit of reverse engineering and vRealize Log Insight (vRLI) it’s possible to achieve this.

Setup lab – VMware vSphere 7.0 Update 3a vSAN cluster and vRLI 8.6.

The first step is create a storage policy with Force Provisioning enabled. We name this policy “FP VM Storage Policy”

We need to apply the new storage policy “FP VM Storage Policy” to our test vm “sbpm01”

The policy is successful applied to the vm “sbpm01”.

Now we need some reverse engineering because it’s not possible to grep the name of the storage policy in vRLI. We move to the sbpm01 events in vCenter.

Note the following two details:

  1. Event Type ID: com.vmware.pbm.profile.associate
  2. Associated storage policy: 98df0443-5244-49af-9069-ad9fdbfedb52

This is the information we need to created a new filter in vRLI Interactive Analytics.

Add the associated storage policy id “98df0443-5244-49af-9069-ad9fdbfedb52″ to the text field and add an extra filter (+ ADD FILTER). Choose from the pull down menu “vc_event_type” contains com.vmware.pbm.profile.associate. Choose a time window. In our example I choose “Latest 24 hours of data”. You can also choose here the last hour or last 5 minutes of data. It depends on the time you applied the policy and when you search in vRLI.

In de results above you don’t see the name of the vm with the applied storage policy. In the last image above here you see on the right of Events the Field Table section. Select Field Table. Search for the row with name vc_vm_name. Below here is the vm friendly name displayed of the vm with new applied storage policy.

Finally you want a email notification and a dashboard. I am not going to explain here how to create an email notification and a dashboard. This can be done in vRLI at the same way you normally create notifications and dashboards. Press the icon (1) to create a email notification and press the icon (2) to create a dashboard.

If you want receive an email notification if the vm get another storage policy applied. Then you should create another filter including the following two details.

  1. Event Type ID: com.vmware.pbm.profile.dissociated
  2. Associated storage policy: 98df0443-5244-49af-9069-ad9fdbfedb52

Conclusion:

In this blog post I wanted to demonstrate that it is possible with vRLI to receive an email notification if a vm has a storage policy applied where Force Provisioning is enabled. A disadvantage is that if the vm gets a different storage policy with different settings, this email notification is no longer valid, because the notifications are based on this specific storage policy id.

I have shown that it is works but as far as I believe it is not a solution for a production environment.

VxRail 7.0.300 GA

What’s new in VxRail 7.0.300

VxRail software version 7.0.300 includes VMware ESXi 7.0 Update 3, VMware vSAN 7.0 Update 3 and VMware vCSA 7.0
Update 3a with support for external storage and introduction to satellite nodes.

New features

Operationalize the edge with VxRail satellite nodes:
You can deploy the E660, E660F, and V670F as single VMware vSphere nodes with no VMware vSAN to address VxRail edge deployments that require a smaller footprint. You can configure satellite nodes with an optional PowerEdge RAID controller to add resiliency for local disks. The satellite nodes are managed by a new or existing standard cluster with VMware vSAN running 7.0.300.

Control satellite nodes from a central location:
You can deploy a VxRail Manager VM that can control all satellite nodes from a centralized host management location in VMware vCenter. You can add, remove, and update satellite nodes from one access point using VxRail Manager.

Expanded storage option for VxRail dynamic nodes:
You can deploy VxRail dynamic nodes as part of a PowerFlex 2-layer architecture. Deploy VxRail dynamic nodes cluster as compute only node leveraging PowerFlex storage for hosting the workload VMs.

Protocol support for VxRail dynamic nodes:
NVMe-FC is supported with PowerStore and PowerMax storage arrays that are attached to dynamic nodes.

VMware ESXi 7.0 Update 3, VMware vSAN 7.0 U3, VMware vCSA 7.0 Update 3a support. The major changes for VxRail include:
Support upgrade of the VMware vSAN Witness Host (dedicated) in vLCM as part of the coordinated cluster remediation workflow for VMware vSAN 2-Node and Stretched Clusters.

  1. Stretched Cluster Enhancement to allow the ability to tolerate planned or unplanned downtime of a site and the witness in a stretched cluster deployment.
  2. Nest Fault Domain in a 2-node configuration
  3. Easy VMware vSAN cluster shutdown and start-up
  4. Upgrade note for VxRail with external storage

Source: https://dl.dell.com/content/docu98130

Extracting VxRail code 7.0.2xx failed at 50%

Sometimes you run into an issue that can keep you busy for hours and afterwards the cause remains easy to solve. Recently I ran into such an issue.

There was a minor update that needs to be done. It was a VxRail code upgrade from 7.0.x to 7.0.2xx.

The upgrade was basically like all other upgrades:

  1. Run VxVerify
  2. If there are findings in the results, solve them before starting the upgrade
  3. Upload the desired VxRail target code
  4. Start the upgrade
  5. Done

The results of the vxVerify were fine, no issues detected.

While uploading the target VxRail code everything looks fine but during the extraction of the upgrade bundle it failed at 50%. So I start a retry but the extraction of the upgrade bundle failed again at 50%. At the Cluster level we noticed the following error.

VXR1F4114 ALARM Upload of upgrade composite bundle unsuccessful VxRail Update ran into a problem… Error extracting upgrade bundle 7.0.2xx. Failed to upload bundle. Please refer to log for more details.

I opened a support request by Dell Support and in the meantime I start to examine the lcm-web.log in /var/log/mystic. I found some errors and failures but they did not lead directly to the root cause. There were errors about upgrade bundles couldn’t uploaded but those events were too general. I noticed the VxRail node that was mentioned at last in the log before the extraction failed.

Dell Support was now also working on the case. The support engineer also noted that the VxRail node I suspected was causing the problem.

I won’t go into too much detail, but at some point we checked the status of the “dcism-netmon-watchdog” service on that particular VxRail node.

[root@ESXi03:~] /etc/init.d/dcism-netmon-watchdog status
iSM is active (not running)

I had seen recently the same service status on another VxRail nodes running on code 7.0.x. Restarting the service won’t start the service. So I restarted the VxRail node. After the restart it could take some minutes before the service is restarted. I checked the service again.

[root@ESXi03:~] /etc/init.d/dcism-netmon-watchdog status
iSM is active (running)

Finally we restarted(retry) the VxRail code extraction. Both the VxRail code extraction and VxRail upgrade were successful.

vSAN detected an unrecoverable medium or checksum error

If there is a hardware issue that could cause problems within a vSAN cluster, you want to know as early as possible. Once you know this, you may have time to resolve the issue before business is compromised.

Cause:

I have seen the following error several times in the results of a VxRail VxVerify check, which is performed to identify issues in a VxRail cluster before an update.

Error:

++++++++++++++++++++++

2021-10-08 15:01:00.012 esxi01.vrmware.nl vcenter-server: vSAN detected an unrecoverable medium or checksum error for component AB1234 on disk group DG5678

++++++++++++++++++++++

It could be possible that an underlying hardware device (physical disk) is causing this error. This is why you want to be informed as early as possible if there is an error that can cause an vSAN issue in the near future. This allows you to proactively carry out repair work, without any downtime to business operations.

Resolution:

How do you find out on which physical disk the component resides on? You need to identify the following information (first 3 bullets). The 4th bullet is about the vm which can be possible affected by the issue.

  • VMware Host
  • Diskgroup
  • Disk
  • Virtual Machine where the component belongs to

Let’s start to identify the disk where the component resides:

  1. Write down the component and diskgroup from the error
  2. Ssh to an arbitrary ESXI server in the vSAN cluster. It doesn’t matter what server you choose. Type the following command:
    esxcli vsan debug object list –all > /tmp/objectlist.txt
  3. Transfer /tmp/objectlist.txt to local pc
  4. Open objectlist.txt and search for component AB1234.

Snippet from objectlist.txt:
++++++++++++++++++++++

Configuration:      

RAID_5

Component: AB1234

Component State: ACTIVE,  Address Space(B): 39369834496 (36.67GB),  Disk UUID: 52ec6170-5298-7f14-1069-d0d3872b742a,  Disk Name: naa.PD9012:1

Votes: 1,  Capacity Used(B): 39753613312 (37.02GB),  Physical Capacity Used(B): 39359348736 (36.66GB),  Host Name: esxi03.vrmware.nl

Type: vdisk

Path: /vmfs/volumes/vsan:1234567890/vm01.vmdk (Exists)

++++++++++++++++++++++

All the info you need to identify the disk is almost all here, VMware Host, Diskgroup and VM. To indentify the possible affected disk you need to switch to vCenter gui.

Move to Cluster > Host (esxi03.vrmware.local) > Monitor > Performance > Disks > Diskgroup (DG5678) > Whole Group (pull down). Here do you find the disk naa.PD9012

Conclusion:

Now you know that component AB1234 resides on disk naa.PD9012 in diskgroup DG5678 and the component belongs to vm01.vmdk.

I would advise always contact VMware GS for support in any production environment or Dell Support in case of a VxRail cluster. They will provide further support and help you to fix this error.

Hopefully this helps you.

VMware vCLS datastore selection part 2

Last year I wrote an blog post about the VMware vCLS datastore selection. This blog post is one of the most read articles on my website. This does indicate that there is a need to be able to choose a datastore on which the vCLS vms are placed.

Today VMware announced vSphere 7.0 update 3. In this update there is also an improvement on the vCLS datastore selection. It’s now possible to choose the datastore on which the vCLS vms should be located.

In the following video on the VMware vSphere YouTube channel move on to 20 minutes to learn more about the vCLS vms datastore selection improvement.

Another improvement is that the vCLS vms now have a unique identifier. This is useful when you have multiple clusters managed by the same vCenter.

It’s always good to see that a vendor is listening to the customers’ needs to further improve a product.

Update vCenter vCSA 7.0 Update 2 failed

Last night I spend several hours to update the vCenter in my lab from vCSA 7.0. Update 1d to vCSA 7.0 Update 2. The update kept going wrong. I’ve staged the update package first. After staging the update was completed I started the installation. This result in the following error: Exception occured in postInstallHook.

So I tried Resume.

This seems so far so good. Continue.

Continue the Installation.

Same error again! Let’s retry Resume and Cancel in the next step.

Cancel.

Now I get stuck. So I quit to get some sleep :-).

This morning I woke up and received an e-mail from WIlliam Lam that he has written a new blog post about an error during upgrade to vCenter vCSA 7.0 Update 2, “Exception occurred in install precheck phase“. This is a different error than the error that I experienced yesterday but I have seen this error also during one of the attemps.

Here an overview of the errors during my attempts:

  • Exception occured in postInstallHook
    This error appears after staging the update and install it later
  • Exception occurred in install precheck phase
    This error appears after stage and install at the same time.

Now let’s try the workaround from William Lam that should result in a working vCenter vCSA 7.0 Update 2.

  • Create a snapshot of the vCSA
  • Stage the update file
  • SSH to the vCSA
  • Move to folder /etc/applmgmt/appliance/
  • Remove the file software_update_state.conf
  • Move to folder /usr/lib/applmgmt/support/scripts
  • Run script ./software-packages.py install –url –acceptEulas

The update ended with an PostgreSQL error and vCenter is not working after the update. I rebooted the appliance one more time without any result.

Conclusion:

vCenter vCSA 7.0 Update 2 is in my opinion not ready for deployment at this moment. I rollback the snapshot and decide to wait for an updated version of vCenter vCSA 7.0 Update 2.

I will update this blog post later.

Introducing VMware vSAN 7.0 U2

Today Duncan Epping posted this video “Introducing VMware vSAN 7.0 U2”.

Since the introduction, I am a fan of VMware vSAN Native File Services. With the introduction of vSAN 7.0 Update 2, vSAN Native File Services is also available for stretched vSAN clusters. How cool is that!

vSphere 7.0 Update 2 is already available for download.

Complete vSAN 7.0 Update 2release notes here.

You can find the vSphere 7.0 Update 2 release notes here.

Source: Yellow-bricks.com

VM Summary Customize View

Recently I updated the vCenter appliance in my lab to version vCSA 7.0 update 1d. After updating I was clicking a bit through the environment. By coincidence I saw the following button when I opened the summary page of a VM.

Curious as I am, I clicked the button. But first the regular view below. This view, which everyone knows, is now called the classic view.

After clicking the “Swith To New View” button an customize view will appear.

What immediately stands out is the fresh widget view. It’s a small change, but I’m a fan of it right away. I been wondering ever since when this view was introduced. I searched the VMware documentation but I cannot find it. It is certainly not available in versions prior to vCenter vCSA 7. Maybe it has been available for a while but I haven’t noticed it before.

If you still prefer the classic view. You can just as easily switch back to your old trusted view.

You can easily adjust what you want to see and what not. If you know when this customize view was introduced, please leave a comment.

Early notification of vmhba resets in a vSAN cluster

In the past years, I have experienced some vSAN performance issues due to faulty hardware. The goal was to know at an early stage whether there are hardware errors that can lead to performance degradation.

One problem I’ve seen a few times are hardware related problems that lead to a high latency, outstanding io’s and congestions at the backend storage. I was wondering if it is possible to spot these kinds of issues earlier? I started searching in vRealize Log Insight.

I found some events afterwards during my research. In the period prior to the performance issues, many “Power-on Reset on vmhba” messages had been written in the vobd.log and vmkernel.log. At first it was a few events per day, but as time passed the frequency with which the events came increasingly higher and finally led to a very poor vSAN performance.

In the following steps I will explain how you can define an email alert in vRealize Log Insight that helps to detect this kind of issues at an early stage. Now it’s possbile to take early action to avoid potential problems.

Step 1. Create a query that search for “Power-on Reset occurred on vmhba” events

Step 2. Create an alert from the query

Step 3. Define the alert