Troubleshoot the Splunk Add-on for VMware
Data collection issues
Gaps in data collection
Gaps in data collection or slow data collection (example: data only coming in equal or greater to every 20 minutes) sometimes requires a restart of your scheduler. Any updates to
ta_vmware_collections.conf requires a restart of the scheduler to take effect. Collection configurations using the UI do not require a restart of the scheduler.
vCenter connectivity issues
The Splunk Add-on for VMware cannot make read-only API calls to vCenter Server systems
Inability to make read-only API calls means that you do not have the appropriate vCenter Server service login credentials for each vCenter Server. Obtain vCenter Server service login credentials for each vCenter server.
The Splunk Add-on for VMware is not receiving data
If you have configured vCenter Server 5.0 but no data is coming in, the vCenter Server 5.0 and 5.1 are missing WSDL files that are required for Splunk Add-on for VMware to make API calls to vCenter Server.
Resolve this issue by installing the missing VMware WSDL files as documented in the vSphere Web Services SDK WSDL workaround in the VMware documentation. Note that the
programdata folder is typically a hidden folder.
The DCNs are forwarding data using index=_internal tests, but Splunk App for VMware is not collecting any API data
API data collection issues are typically caused by one of two issues:
- Network connectivity issues from the Scheduler to the DCNs.
- You have not changed the DCN admin account password from its default value.
To resolve this issue:
- In the Splunk Add-on for VMware Collection Configuration page, verify the accuracy of the settings in the collection page.
- Verify that the
adminpassword for each DCN is not set to
- Verify that each DCN has a fixed IP address. If Splunk App for VMware uses DCN host names instead of fixed IP addresses, verify that DNS lookups resolve to the correct IP addresses.
Hydra scheduler proxy access error
If you attempt to use a proxy server to connect to Splunk Web and receive the following proxy error message:
URLError: <urlopen error Tunnel connection failed: 403 Proxy Error>
You will also see the following error message in your log files:
The hydra scheduler checks Splunk Web's proxy settings, and is trying to connect to a data collection node (DCN) through the proxy server. You cannot install a scheduler if you use a proxy server for Splunk Web.
Fix this problem by deploying and setting up your Splunk Enterprise instance inside the same network as your data collection nodes without the use of a proxy server.
Permissions in vSphere
Splunk Add-on for VMware must use valid vCenter Server service credentials to gain read-only access to vCenter Server systems using API calls. The account's vSphere role determines access privileges.
The following sections list the permissions for the vCenter server roles for all of the VMware versions that Splunk App for VMware supports.
Permissions to use your own syslog server
Best practice dictates that use your own syslog server, and that you install a Splunk Enterprise forwarder on the server to forward syslog data. Use these permissions to collect data from the ESXi hosts using your own syslog server. These system-defined privileges are always present for user-defined roles.
Permissions to use an intermediate forwarder
Use these permissions if you configure your ESXi hosts to forward syslog data to one or more intermediate Splunk Enterprise forwarders. Use the vSphere client to enable the syslog firewall for the specific hosts. Note that in vSphere 5.x you do not need to add permissions beyond the default ones vSphere provides when creating a role.
Splunk add-on for VMware sets SSL for WebUI as Default
Disable WebUI SSL in the Splunk Add-on for VMware to prevent web.conf from overriding your deployment's SSL settings.
$SPLUNK_HOME/etc/system/local/ and make the following change to
[settings] enableSplunkWebSSL = false
Not getting esxilogs while forwarding it to indexers which are in a cluster.
Or on indexers, you see the following ERROR message in splunkd.log:
ERROR AggregatorMiningProcessor - Uncaught Exception in Aggregator, skipping an event: Can't open DateParser XML configuration file "/opt/splunk/etc/apps/Splunk_TA_esxilogs/default/syslog_datetime.xml": No such file or directory - data_source="/data/log_files/syslog/<hostname>.log", data_host="<hostname>", data_sourcetype="vmw-syslog"
While esxilogs are directly forwarded to indexers (which are in cluster), splunkd.log on indexers will show the above error.
Reason: Splunk is not able to find custom datetime (syslog_datetime.xml) file which is used to extract dates and timestamps from events.
The following parameter is set for this in props.conf.
DATETIME_CONFIG = /etc/apps/Splunk_TA_esxilogs/default/syslog_datetime.xml
As indexers are in cluster, Splunk_TA_esxilogs on indexers would be installed under slave-apps (/etc/slave-apps/) hence above path would not exist.
- On cluster-master, Create local directory in the $SPLUNK_HOME/etc/master-apps/Splunk_TA_esxilogs directory, if not present.
- If not present, create props.conf file in the
$SPLUNK_HOME/etc/master-apps/Splunk_TA_esxilogs/local directoryand add the below stanza and configuration to it:
[vmw-syslog] DATETIME_CONFIG = /etc/slave-apps/Splunk_TA_esxilogs/default/syslog_datetime.xml
- Push the bundle on indexers.
Inventory data fields are not getting extracted using spath command
The Splunk Add-on for VMware collects the VMware infrastructure inventory data. Inventory data can contain JSON content that exceeds the default spath command character limit of 5000 characters.
If you're using the spath command to extract inventory data and the event contains more than 5000 characters, see Update the default character count limitations for the search commands.
Enable cluster DRS service error: lookup table "TimeClusterServicesAvailability" is empty on some dashboards
Here are troubleshooting steps for enabling cluster DRS service if you see the error
Lookup table "TimeClusterServicesAvailability" is empty on the following cluster compute resource related dashboards:
- Capacity Planning for Clusters-CPU Headroom
- Capacity Planning for Clusters-Memory Headroom
- Capacity Planning (Clusters)
- Cluster details
If you do not want to enable cluster DRS service, ignore the error.
The add-on is not able to get following required metrics, so the TimeClusterServicesAvailability lookup is empty:
Enable cluster DRS service of the configured vCenter to get the required metrics:
- Log in to configured vCenter using vsphere client.
- Navigate to Home > Inventory > Hosts and Clusters.
- Right click on Cluster.
- Open Cluster in Settings
- Go to Cluster features and click Turn on vSphere DRS.
Troubleshoot the error "ValueError: unsupported pickle protocol: 3" in hydra worker logs
The Splunk add-on for VMware is unable to run the hydra worker script and following logs in hydra worker:
ERROR [ta_vmware_collection_worker://worker_process2:28696] Problem with hydra worker ta_vmware_collection_worker://worker_process2:28696: unsupported pickle protocol: 3 Traceback (most recent call last): File "/home/splunker/splunk/etc/apps/SA-Hydra/bin/hydra/hydra_worker.py", line 622, in run self.establishMetadata() File "/home/splunker/splunk/etc/apps/SA-Hydra/bin/hydra/hydra_worker.py", line 64, in establishMetadata metadata_stanza = HydraMetadataStanza.from_name("metadata", self.app, "nobody") File "/home/splunker/splunk/etc/apps/SA-Hydra/bin/hydra/models.py", line 610, in from_name host_path=host_path) File "/home/splunker/splunk/lib/python2.7/site-packages/splunk/models/base.py", line 533, in get return self._from_entity(entity) File "/home/splunker/splunk/etc/apps/SA-Hydra/bin/hydra/models.py", line 345, in _from_entity obj.from_entity(entity) File "/home/splunker/splunk/lib/python2.7/site-packages/splunk/models/base.py", line 903, in from_entity super(SplunkAppObjModel, self).from_entity(entity) File "/home/splunker/splunk/lib/python2.7/site-packages/splunk/models/base.py", line 661, in from_entity return self.set_entity_fields(entity) File "/home/splunker/splunk/etc/apps/SA-Hydra/bin/hydra/models.py", line 544, in set_entity_fields from_api_val = wildcard_field.field_class.from_apidata(entity, entity_attr) File "/home/splunker/splunk/etc/apps/SA-Hydra/bin/hydra/models.py", line 123, in from_apidata obj = cPickle.loads(b64decode(val)) ValueError: unsupported pickle protocol: 3
The add-on is unable to deserialize the python object that is serialized using another python version than the current python version on which add-on is running. This usually happens when the add-on that was running on Python 3, is running on Python 2. Python 2 is unable to deserialize the python object serialized by Python 3.
- Stop the Scheduler from Collection Configuration page.
- Stop Splunk on DCN.
- On DCN, go to
$SPLUNK_HOME/etc/apps/Splunk_TA_vmware/localand remove the following files:
- Start Splunk on DCN.
- Start the Scheduler from Collection Configuration page.
Troubleshoot the error "ImportError: bad magic number in 'uuid': b'\x03\xf3\r\n'" in hydra scheduler logs
The Splunk add-on for VMware is unable to run the hydra scheduler script and the following logs in hydra scheduler, so no jobs are assigned.
Traceback (most recent call last): File "ta_vmware_collection_scheduler.py", line 20, in <module> from hydra.hydra_scheduler import HydraScheduler, HydraCollectionManifest, HydraConfigToken File "/opt/splunk/etc/apps/SA-Hydra/bin/hydra/hydra_scheduler.py", line 11, in <module> import uuid ImportError: bad magic number in 'uuid': b'\x03\xf3\r\n'
This error is caused by the
uuid.pyc file that is compiled on Splunk 7.2.x, Splunk 7.3.x or Splunk 8.x ( Python version 2) and is being run on Splunk version 8.x (Python version 3).
- Stop Scheduler from Collection Configuration page.
- Stop Splunk on Scheduler and DCN machines.
- Remove all the .pyc files existing in following directory on all the DCN machines and scheduler.
- Start Splunk on DCN machine
- Start Splunk on Scheduler machine
- Start Scheduler from Collection Configuration page.
Troubleshoot issue in cluster performance data collection caused by collection interval mismatch across configured vCenter
The add-on is unable to get cluster performance data. The following query doesn't return any results:
index="vmware-perf" source="VMPerf:ClusterComputeResource" | dedup sourcetype | table sourcetype
Also, you get the following error on the search head in
2020-04-23 16:12:27,883 ERROR [ta_vmware_collection_worker://worker_process20:19296] Server raised fault: 'A specified parameter was not correct: interval'
The collection interval is set to different values across the configured vCenters. For example, if the VC1 collection interval is 5 minutes, and VC 2 is set to 3 minutes, then it's possible that the add-on fetches cluster performance data for only one vCenter at a time.
This is because the add-on script caches the collection interval and uses it when fetching cluster performance data. If a vCenter has a different collection interval than this stored value, the DCN throws an error and isn't able to fetch cluster performance data.
Work around this error by setting the collection interval to the same value for all vCenters:
- Connect to the web client
https:// <vcenter server ip/hostname>.
- Select vCenter Server.
- Select Configure > General > Statistic.
- Click Edit.
- Update the collection interval to equal the same value across your configured vCenters.
- Save the configuration.
Virtual machine performance data is missing
Unable to get virtual machine performance data. This query doesn't return any results:
index="vmware-perf" source="VMPerf:VirtualMachine" | dedup sourcetype | table sourcetype
And on the Scheduler machine, you see the following error message in splunkd.log:
03-30-2020 13:34:04.693 +0100 ERROR ExecProcessor - message from "python /opt/splunk/etc/apps/Splunk_TA_vmware/bin/ta_vmware_hierarchy_agent.py" splunk.AuthorizationFailed: [HTTP 403] Client is not authorized to perform requested action; https://127.0.0.1:8089/servicesNS/nobody/Splunk_TA_vmware/storage/passwords/
The admin user has been renamed and Splunk no longer has an "admin" named user.
To collect virtual machine performance data, ta_vmware_hierarchy_agent.py scripted input prepares the list Virtual Machine moids. So if this list isn't created and shared with the data collection node (DCN), the DCN isn't able to collect performance data for them.
For this scripted input, the parameter "passAuth" is used for getting sessionKey for authentication purposes. It's value is admin, which means the 'admin' user is required to do the authentication.
[script://$SPLUNK_HOME/etc/apps/Splunk_TA_vmware/bin/ta_vmware_hierarchy_agent.py] passAuth = admin
There are 2 resolutions for this issue:
- On the scheduler machine, create a new user with the name "admin" and assign the "admin" and splunk_vmware_admin roles to admin user.
- Change the passAuth attribute value to the existing user name on the scheduler machine:
- Add the
passAuth = splunk-system-userparameter value to the following stanza in
- Restart Splunk.
[script://$SPLUNK_HOME/etc/apps/Splunk_TA_vmware/bin/ta_vmware_hierarchy_agent.py] passAuth = splunk-system-user
No data collection when DCN is configured with more than 8 worker processes on Splunk version 8.x
When there are more than 8 worker processes configured, the scheduler throws the following error and data is not collected.
2020-09-30 15:06:50,550 ERROR [ta_vmware_collection_scheduler_inframon://Global pool] [HydraWorkerNode] [establishGateway] could not connect to gateway=https://<DCN>:8008 for node=https://<DCN>:8089 due to a socket error, timeout, or other fundamental communication issue, marking node as dead
In VMware add-on, the scheduler and the DCN communicate with each other through the hydra gateway server. When the add-on is installed on Splunk version 8.x and there are more than 8 worker processes configured for the DCNs, the hydra gateway server takes a longer time to respond to the request. The schedule isn't able to authenticate the hydra gateway server and no jobs are assigned to the DCNs.
On the scheduler machine, go to the Collection Configuration page and edit the configured DCNs to update the worker process count to 8 or less. If more worker processes are required then configure new DCN machines. See Prepare to deploy the DCN for the standard guidelines.
Error for unexpected keyword argument 'rewrite' on Scheduler
When Splunkd is restarted, the DCNs stop collecting data and the scheduler for the Splunk Add-on for VMware throws the following error:
2020-09-21 19:25:01,199 ERROR [ta_vmware_collection_scheduler://puff] Problem with hydra scheduler ta_vmware_collection_scheduler://puff: checkvCenterConnectivity() got an unexpected keyword argument 'rewrite' Traceback (most recent call last): File "/opt/splunk/etc/apps/SA-Hydra/bin/hydra/hydra_scheduler.py", line 2126, in run self.checkvCenterConnectivity(rewrite=True) TypeError: checkvCenterConnectivity() got an unexpected keyword argument 'rewrite'
In the add-on, the "checkvCenterConnectivity" function is defined to check the connectivity of the configured vCenter server every 30 minutes.
Because this function is defined in the Splunk_TA_vmware package and is called from the SA-Hydra scheduler module, it requires a supported SA-Hydra version installed with the Splunk_TA_vmware package on the scheduler instance.
Up grade SA-Hydra or Splunk_TA_vmware to versions that are compatible with each other. Also, make sure the scheduler, DCN, search, and indexer have the same add-on version.
Here's the version compatibility matrix for Splunk_TA_vmware and supported SA-Hydra:
|Splunk_TA_vmware version||SA-Hydra version|
Upgrade the Splunk Add-on for VMware from v3.4.5 to v4.0.0
Source types for the Splunk Add-on for VMware
This documentation applies to the following versions of Splunk® Supported Add-ons: released