UPDATE: Aria Operations for Logs persistent static route on second network interface on behalf of Log Forwarding

There is an update on the blog post “vRealize Log Insight persistent static route on second network interface on behalf of Log Forwarding”. Beginning with version Aria Operations for Logs 8.14, the VAME_CONF_NET is no longer available, it’s removed in this version.

The information in the orignal blog post can still be used. Only now you need to manually create and predefine the 10-eth1.network configuration. See example at the bottom of the original post.

There are two important things to consider:

  1. Keep in mind that multic-nic configurations in Aria Ops for Logs is not officially supported
  2. After every update the 2nd interface must be recreated and configured again

Orginal post:

Recently I wanted to test whether it is possible to configure vRealize Log Insight (vRLI) log forwarding to a second network interface to reach a log target in another network segment that could not be reached from the default vRLI appliance ip address.

The first step is adding a second network interface to the appliance. In this example we use the following network configuration.

  1. VMnic1
    Vlan10, IP 10.1.1.10, Subnetmask 255.255.255.0, Gateway 10.1.1.1
  2. VMnic2
    Vlan20, IP 20.2.2.20, Subnetmask 255.255.255.0, Gateway 20.2.2.1
  3. In this example the log forwarding target ip address is 30.3.3.233

To configure the second network interface open a SSH session to vRLI appliance. Move to /opt/vmware/share/vami/ and run the network configuration script. vami_config_net. Eth1 is now also available for configuration. First select ‘0’ for a configuration overview. In the results is an error on eth1 displayed. This error keeps us from being able to configure eth1.

After some ‘Trial & Error’ research I noticed the following error during reconfiguring eth1 “can’t open /etc/systemd/network/10-eth1.network

In the directory /etc/systemd/network is the file “10-eth1.network” not present. The name of the file could be different then here in this example. It depends on the number of network interfaces. I fixed this issue by creating this file manual.

  1. touch /etc/systemd/network/10-eth1.network
  2. chmod 644 /etc/systemd/network/10-eth1.network
  3. Config the second network interface. Go to the directory /opt/vmware/share/vami/ and run the network configution script. vami_config_net. Eth1 is now also available for configuation.
  4. Check the new configuration by selecting option 0. If Ok press 1 to exit
  5. Restart the network, systemctl restart systemd-networkd.service

Now this issue is fixed we can move on to configure the persistant static route for vRLI log forwarding.

Edit /etc/systemd/network/10-eth1.network

The file shlould look like this before editing:

[Match]
Name=eth1
[Network]

Gateway=10.1.1.1
Address=20.2.2.20/24
DHCP=no

[DHCP]
UseDNS=false

Now add route information at the end of the file:

[Match]
Name=eth1
[Network]

Gateway=10.1.1.1
Address=20.2.2.20/24
DHCP=no

[DHCP]
UseDNS=false
[Route]
Gateway=20.2.2.1
Destination=30.3.3.233/24

Save the file and restart the network service.

systemctl restart systemd-networkd.service

Check if the new route is present.

route -n

Test if you reach the destionation from CLI. I used Syslog over UDP port 514.

nc -v 30.3.3.233 514

Answer if configuration is working:

[30.3.3.233 514] open

The last step is to configure the vRLI Log Forwarding Destination.

Send a test event and check if the event is received by the target.

Keep in mind that multic-nic configurations in vRLI are not officially supported.

Also credits to this blog post that pushed me in the right direction.

PowerCLI script to get Syslog.Global.Host advanced setting

The following script may be useful if you are in the process of migrating vRealize Log Insight to a new appliance/cluster. You can use this script before, during and after migrating to check the settings of Syslog.Global.Loghost of all ESXi hosts in vCenter.

# Connect to vCenter Server
Connect-VIServer <vCenterServer>

# Get all ESXi hosts in the cluster
$hosts = get-vmhost

# Loop through each ESXi host and get the syslog.global.loghost advanced setting
foreach ($esxi in $hosts) {
    $setting = Get-AdvancedSetting -Entity $esxi -Name 'syslog.global.loghost'
    Write-Host "$($esxi.Name): $($setting.Value)"
}

# Disconnect from vCenter Server
Disconnect-VIServer <vCenterServer> -Confirm:$false

For example: You can use the script during vRealize Log Insight in the following way:

  • Before migration
    Check the current configured syslog endpoint
  • During migration
    Check the current and new syslog endpoints are configured
  • After migration
    Check the new configured syslog endpoint

vRealize Log Insight Admin Alert: SSL Certificate Error (Host =vrli.vrmware.nl)

This time a short post about a vRealize Log Insight (vrLI) configuration issue that took too long to solve. In the end, the solution is simple after I found the documentation. Finding the right documentation was the hardest part.

Just briefly the reason of this setup. I want ESXi hosts to use Syslog over SSL to send logging encrypted to vRLI.

While adding the vCenter I configured the hosts to use SSL.

After configuring, everything seemed to work fine, until I got a vRLI Admin mail with the following alert:

This alert is about your Log Insight installation on https://vrli.vrmware.nl/

SSL Certificate Error (Host = vrli.vrmware.nl) triggered at 2023-04-16T09:23:53.412Z

This notification was generated from Log Insight node (Host = vrli.vrmware.nl, Node Identifier = de568ad3-d4e3-7f8a-b543-cef17632af11).

Syslog client esx01.vrmware.nl disconnected due to a SSL handshake problem. This may be a problem with the SSL Certificate or with the Network Time Service. In order for Log Insight to accept syslog messages over SSL, a certificate that is validated by the client is required and the clocks of the systems must be in sync.

Log messages from esx01.vrmware.nl are not being accepted, reconfigure that system to not use SSL or see Online Help for instructions on how to install a new SSL certificate .

This message was generated by your Log Insight installation, visit the Documentation Center for more information.

Time couldn’t be the issue in my case. So it had to be a certificate issue. The issue was caused by the vRLI certificate that wasn’t in the ESXi host truststore.

Per ESXi host, the following steps should be taken to solve the issue. Step 3 is only a verification step.

  1. openssl s_client -connect [FDQN or IP vRLI]:1514 < /dev/null | openssl x509 -outform PEM >> /etc/vmware/ssl/castore.pem
    Example: openssl s_client -connect vrli.vmware.nl:1514 < /dev/null | openssl x509 -outform PEM >> /etc/vmware/ssl/castore.pem
  2. esxcli system syslog reload
  3. esxcli network ip connection list | grep 1514

If ESXi hosts have the vRLI certificate in their truststore, the vRLI Admin mail (1x per day per vRLI node) should no longer occur.

Here is the link to the VMware documentation. This documentation is actually for vRLI Cloud which is a different product than standard vRLI although they overlap in some areas. The documentation for vRLI will be updated according to VMware GSS.

So this is probably why the vRLI documentation on this topic was so hard to find. Hopefully this blog post will save you a lot of time.

vRealize Log Insight failed to start Journal Service

Recently, I started an ssh session on a vRealize Log Insight (vRLI) appliance. After entering name and password a message was prompted to change the password immediately. After I did this, the ssh session was dropped and I had to log in again. The password appeared to be unchanged. I repeated this step several times and then decided to restart the vRLI appliance. Hoping this would fix the problem.

After waiting for some time, the appliance does not appear to come online and appears to be in a loop during startup. The screenshot below shows that the Journal Service does not want to start.

I have seen similar issue more often and I suspect a partition has filled up. I couldn’t log in because the appliance won’t start. In the following action plan, I explain step by step how I solved this problem.

Action Plan:

1, Take a snapshot
2. Edit the VM options in the next step before restarting the vm
3. Change Boot Options, Change Boot Delay in 1000 milliseconds and Force Bios Setup

4. Restart the VM
5. Bios Exit Discarding Changes

6. Press ‘E’ direct during the start
7. Type “rw init=/bin/bash” at the end of the line and press “F10”

8. Type “df -h”. In our case we see that the root partition is full

9. Let’s clean up some old audit and authentication logs

10. Let’s check the root partition again. Type “df -h”.

11. Reboot the appliance and we’re done.

After cleaning up the root partition, the appliance started normally. I was also able to change the root password without any problems. Don’t forget change the Boot Options, Change Boot Delay back in 0 milliseconds and remove the snapshot if everything is working fine and you’re happy with this solution.

You can always open a VMware SR if you need help. 🙂

vRealize Log Insight 8.10 bad performance

For a customer, I deployed a new vRealize Log Insight 8.10 cluster as a test. After initial installation and configuration of the first vRLI node I experienced a performance that was particularly bad. It wasn’t always this bad, but it frequently occurred.

The bad performance initially revealed itself when choosing one of the options in the menu on the left side of the screen. It seemed like the whole screen was frozen. After 10 till 30 seconds the requested information appeared.

I decided to first complete the cluster setup. After adding some vCenters and importing Content Packs, technically everything worked as designed. The bad performance was still there even after temporarily adding additional memory to the nodes.

During one of the moments that the performance was decreased I saw something in the lower left-hand corner. There were the following messages displayed:

  • Read <vRLI FQDN>
  • Transferring data from <vRLI FQDN>
  • Connecting to cdn.pendo.io

The last bullet was the key to resolve this issue. I searched some information about “cdn.pendo.io” and found this article about “Join or Leave the VMware Customer Experience Improvement Program“. At the bottom of the article is a section “What to do next

After CEIP is enabled, when a user logs in to vRealize Log Insight, they see a banner at the top of their window that asks whether they want Pendo to collect data based on their interaction with the user interface.

  • If the user clicks Accept, Pendo collects their data and sends it to VMware
  • If the user clicks Decline, Pendo does not collect their data

In the General Configuration section of the vRLI configuration I deselected the Usage Reporting (Join the VMware Customer Experience Improvement Program). This resolved the issue for me.

Finally, in this case there was no need to join VMware CEIP. So, therefore it is an acceptable solution for me.

Get notificated when vSAN Force Provisioning is enabled and applied to a vm

Recently I was asked if it is possible to receive an email notification if a vSAN storage policy with Force Provisioning enabled is applied to a vm. In this blogpost I want to show that this is possible.

Use Case – An administrator wants apply an vSAN storage policy with Force Provisioning enabled to virtual machines beacause of a possible shortage of vSAN storage capacity. In my opinion not a very good idea in a production environment!

Goal – Get an email notification when a vSAN storage policy with Force Provisioning enabled is applied to an vm. There is also the wish to make this visable in a dashboard.

Solution – With a bit of reverse engineering and vRealize Log Insight (vRLI) it’s possible to achieve this.

Setup lab – VMware vSphere 7.0 Update 3a vSAN cluster and vRLI 8.6.

The first step is create a storage policy with Force Provisioning enabled. We name this policy “FP VM Storage Policy”

We need to apply the new storage policy “FP VM Storage Policy” to our test vm “sbpm01”

The policy is successful applied to the vm “sbpm01”.

Now we need some reverse engineering because it’s not possible to grep the name of the storage policy in vRLI. We move to the sbpm01 events in vCenter.

Note the following two details:

  1. Event Type ID: com.vmware.pbm.profile.associate
  2. Associated storage policy: 98df0443-5244-49af-9069-ad9fdbfedb52

This is the information we need to created a new filter in vRLI Interactive Analytics.

Add the associated storage policy id “98df0443-5244-49af-9069-ad9fdbfedb52″ to the text field and add an extra filter (+ ADD FILTER). Choose from the pull down menu “vc_event_type” contains com.vmware.pbm.profile.associate. Choose a time window. In our example I choose “Latest 24 hours of data”. You can also choose here the last hour or last 5 minutes of data. It depends on the time you applied the policy and when you search in vRLI.

In de results above you don’t see the name of the vm with the applied storage policy. In the last image above here you see on the right of Events the Field Table section. Select Field Table. Search for the row with name vc_vm_name. Below here is the vm friendly name displayed of the vm with new applied storage policy.

Finally you want a email notification and a dashboard. I am not going to explain here how to create an email notification and a dashboard. This can be done in vRLI at the same way you normally create notifications and dashboards. Press the icon (1) to create a email notification and press the icon (2) to create a dashboard.

If you want receive an email notification if the vm get another storage policy applied. Then you should create another filter including the following two details.

  1. Event Type ID: com.vmware.pbm.profile.dissociated
  2. Associated storage policy: 98df0443-5244-49af-9069-ad9fdbfedb52

Conclusion:

In this blog post I wanted to demonstrate that it is possible with vRLI to receive an email notification if a vm has a storage policy applied where Force Provisioning is enabled. A disadvantage is that if the vm gets a different storage policy with different settings, this email notification is no longer valid, because the notifications are based on this specific storage policy id.

I have shown that it is works but as far as I believe it is not a solution for a production environment.

vSAN detected an unrecoverable medium or checksum error

If there is a hardware issue that could cause problems within a vSAN cluster, you want to know as early as possible. Once you know this, you may have time to resolve the issue before business is compromised.

Cause:

I have seen the following error several times in the results of a VxRail VxVerify check, which is performed to identify issues in a VxRail cluster before an update.

Error:

++++++++++++++++++++++

2021-10-08 15:01:00.012 esxi01.vrmware.nl vcenter-server: vSAN detected an unrecoverable medium or checksum error for component AB1234 on disk group DG5678

++++++++++++++++++++++

It could be possible that an underlying hardware device (physical disk) is causing this error. This is why you want to be informed as early as possible if there is an error that can cause an vSAN issue in the near future. This allows you to proactively carry out repair work, without any downtime to business operations.

Resolution:

How do you find out on which physical disk the component resides on? You need to identify the following information (first 3 bullets). The 4th bullet is about the vm which can be possible affected by the issue.

  • VMware Host
  • Diskgroup
  • Disk
  • Virtual Machine where the component belongs to

Let’s start to identify the disk where the component resides:

  1. Write down the component and diskgroup from the error
  2. Ssh to an arbitrary ESXI server in the vSAN cluster. It doesn’t matter what server you choose. Type the following command:
    esxcli vsan debug object list –all > /tmp/objectlist.txt
  3. Transfer /tmp/objectlist.txt to local pc
  4. Open objectlist.txt and search for component AB1234.

Snippet from objectlist.txt:
++++++++++++++++++++++

Configuration:      

RAID_5

Component: AB1234

Component State: ACTIVE,  Address Space(B): 39369834496 (36.67GB),  Disk UUID: 52ec6170-5298-7f14-1069-d0d3872b742a,  Disk Name: naa.PD9012:1

Votes: 1,  Capacity Used(B): 39753613312 (37.02GB),  Physical Capacity Used(B): 39359348736 (36.66GB),  Host Name: esxi03.vrmware.nl

Type: vdisk

Path: /vmfs/volumes/vsan:1234567890/vm01.vmdk (Exists)

++++++++++++++++++++++

All the info you need to identify the disk is almost all here, VMware Host, Diskgroup and VM. To indentify the possible affected disk you need to switch to vCenter gui.

Move to Cluster > Host (esxi03.vrmware.local) > Monitor > Performance > Disks > Diskgroup (DG5678) > Whole Group (pull down). Here do you find the disk naa.PD9012

Conclusion:

Now you know that component AB1234 resides on disk naa.PD9012 in diskgroup DG5678 and the component belongs to vm01.vmdk.

I would advise always contact VMware GS for support in any production environment or Dell Support in case of a VxRail cluster. They will provide further support and help you to fix this error.

Hopefully this helps you.