Hydra troubleshooting searches in the Splunk Add-on for VMware Metrics

Hydra troubleshooting searches are the search queries of Hydra troubleshooting dashboards that help you identify the issues related to jobs and data collection. The Hydra troubleshooting dashboards are present in the Splunk App for Infrastructure (SAI). If you're not using SAI, follow these steps to use the search-time extractions for these searches.

Prerequisite

The machine on which you are performing these steps must have DCS and DCN logs.

Steps

Select one of the following packages to add search-time extractions:
- Splunk_TA_vmware_inframon
- SA-Hydra-inframon
- Splunk_TA_vcenter
- Splunk_TA_esxilogs
- SA-VMWIndex-inframon

Add the following stanzas in the props.conf file present in the local directory of the selected package. If the props.conf file doesn't exist, create a new props.conf file.

[ta_vmware_hierarchy_agent]
REPORT-hydraloggerfields = hydra_logger_fields

## Original from SA-Hydra
[hydra_scheduler]
REPORT-schedulerfields = hydra_scheduler_log_fields

[hydra_worker]
REPORT-workerfields = hydra_worker_log_fields
REPORT-pool_name_field = pool_name_field_extraction

[source::.../var/log/splunk/*_configuration.log]
REPORT-pool_name_field = pool_name_field_extraction

[hydra_gateway]
REPORT-gatewayfields = hydra_gateway_log_fields

[hydra_access]
REPORT-gatewayfields = hydra_access_log_fields

Add the following search-time extractions to the transforms.conf file present in the local directory of the selected package. If the transforms.conf file doesn't exist, create a new transforms.conf file.

[hydra_logger_fields]
REGEX = ^\d\d\d\d-\d\d-\d\d\s\d\d:\d\d:\d\d,\d\d\d (\w+) \[([\w_]+):\/\/([^\]]+)\] (\[[^\]]+\])?\s?(.+)$
FORMAT = level::$1 input::$2 scheduler::$3 component::$4 message::$5

[hydra_worker_log_fields]
REGEX = ^\d\d\d\d-\d\d-\d\d\s\d\d:\d\d:\d\d,\d\d\d (\w+) \[([\w_]+):\/\/([^:]+):(\d+)\] (\[[^\]]+\])?\s?(.+)$
FORMAT = level::$1 input::$2 worker::$3 pid::$4 component::$5 message::$6

[pool_name_field_extraction]
REGEX = \[pool=([^\]]*)\]
FORMAT = pool::$1
MV_ADD = true

[hydra_scheduler_log_fields]
REGEX = ^\d\d\d\d-\d\d-\d\d\s\d\d:\d\d:\d\d,\d\d\d (\w+) \[([\w_]+):\/\/([^\]]+)\] (\[[^\]]+\])?\s?(.+)$
FORMAT = level::$1 input::$2 scheduler::$3 component::$4 message::$5

[hydra_gateway_log_fields]
REGEX = ^\d\d\d\d-\d\d-\d\d\s\d\d:\d\d:\d\d,\d\d\d (\w+) \[([\w_]+):([^\]]+)\] (\[[^\]]+\])?\s?(.+)$
FORMAT = level::$1 service::$2 pid::$3 component::$4 message::$5

[hydra_access_log_fields]
REGEX = ^\d\d\d\d-\d\d-\d\d\s\d\d:\d\d:\d\d,\d\d\d (\w+) ((\w+) ([^\s]+)) '((\d+) ([^']+))' - - - (\d+)ms$
FORMAT = level::$1 request::$2 method::$3 uri_path::$4 status_full::$5 status::$6 status_message::$7 spent::$8

Make sure the above search-time extractions are globally accessible to all the apps.
Restart your Splunk software.

Hydra framework status searches

Use the search queries of the Hydra Framework Status dashboard to identify issues related to jobs handled by DCN. To enable data population for these search queries, add the search-time extractions to the package in etc/apps and made it globally available.

Query name	Search query	Description
Job Expiration and Failure Count Over Pool	`index=_internal sourcetype=hydra_worker error \| eval status=if(like(_raw, "%expired%"), "Expired Jobs", if(like(_raw, "%Failed to complete job%"), "Failed Jobs", "other")) \| search status!="other" \| stats count by pool, status \| xyseries pool, status, count`	Number of jobs expired or failed for particular pool. DCN (Worker) logs are required to populate this panel.
Job Expirations by DCN	`index=_internal sourcetype=hydra_worker error expired pool="Global pool" \| timechart minspan=1m count by host`	Number of jobs assigned and expired on each DCN versus time. DCN (Worker) logs are required to populate this panel.
Jobs Handled by DCN	`index=_internal sourcetype=hydra_worker "Successfully completed job" pool="Global pool" \| eval head=host+":"+worker \| timechart minspan=1m useother=0 limit=18 count by host`	Number of jobs successfully completed by each DCN versus time. DCN (Worker) logs are required to populate this panel.
Job Scheduling Duration Range (DEBUG level logs only)	`index=_internal source="ta_vmware_collection_scheduler_Global pool*" ("[HydraWorkerNodeManifest] checking health of node" OR "Sprayed all ready jobs onto active nodes") \| transaction startswith="[HydraWorkerNodeManifest] checking health of node" endswith="Sprayed all ready jobs onto active nodes" \| timechart minspan=1m max(duration) min(duration) avg(duration) by input`	Average, Max and Min time taken for Scheduler to assign jobs to DCNs at every iteration versus time. It will populate when DEBUG level is enabled on your scheduler. Scheduler logs are required to populate this panel.
Collection Task Duration Range (Log Scale)	`index=_internal sourcetype=hydra_worker UpdateJobTime pool="Global pool" \| timechart min(time) as "Minimum Execution Time" median(time) as "Median Execution Time" max(time) as "Maximum Execution Time"`	Minimum, Median and Maximum execution time to perform all the task. DCN (Worker) logs are required to populate this panel.
Median Task Performance Over Targets	`index=_internal sourcetype=hydra_worker UpdateJobTime pool="Global pool" \| chart useother=0 median(time) over target by task`	Target (vCenter) and task wise median job execution time reported by Worker on DCN. DCN (Worker) logs are required to populate this panel.
Task Expiration Count Over DCN	`index=_internal sourcetype=hydra_worker error expired pool="Global pool" \| chart useother=0 count over host by task \| rename host as "DCN"`	Task wise no. of jobs assigned and expired on each DCN. DCN (Worker) logs are required to populate this panel.
Task Failure Count Over Target	`index=_internal sourcetype=hydra_worker error "failed to complete job" pool="Global pool" \| chart useother=0 count over host by task`	Task wise no. of jobs assigned and failed on each DCN. DCN (Worker) logs are required to populate this panel.
Last 100 Worker Errors - excluding expiration	`index=_internal sourcetype=hydra_worker error NOT expired \| head 100`	Last 100 errors occurred in worker processes in all DCNs excluding errors which occurred due to job expiration. DCN (Worker) logs are required to populate this panel.
Last 100 Scheduler Errors	`index=_internal source="ta_vmware_collection_scheduler_Global pool*" error \| head 100`	Last 100 errors occurred in Scheduler process. Scheduler logs are required to populate this panel.

Hydra scheduler status

Use the Hydra Scheduler Status page to identify issues related to jobs assigned by your scheduler. To enable data population for these search queries, make sure you have added the search-time extractions to the package in etc/apps and made it globally available.

Some of the following queries require the DEBUG level logs of the Scheduler. To enable the DEBUG level logging for the scheduler, perform the following steps on the scheduler:

Go to Settings > Data inputs.
Select TA-VMware-inframon Collection Scheduler from the inputs.
Click Global pool from Scheduler Name.
Add DEBUG as the logging level.
Click on Save.

Query name	Search query	Description
Job Assignment by DCN	`index=_internal source="ta_vmware_collection_scheduler_Global pool*" number_new_jobs \| timechart minspan=5s max(number_new_jobs) by node`	Number of jobs assigned to each DCN versus time. It will populate when DEBUG level is enabled on scheduler. Scheduler logs are required to populate this panel.
Max Unclaimed Queue Length by DCN	`index=_internal source="ta_vmware_collection_scheduler_Global pool*" "current unclaimed queue" \| timechart minspan=1m max(length) by node`	Number of unclaimed jobs reported by each DCN to Scheduler versus time. It will populate when DEBUG level is enabled on scheduler. Scheduler logs are required to populate this panel.
Dead Nodes	`index=_internal source="ta_vmware_collection_scheduler_Global pool*" "is dead, failed to authenticate user" \| rex "HydraWorkerNode\((?<node>[^\s]+)\)" \| bucket _time span=5m \| stats dc(node) as "Dead Node Count" values(node) as "Dead Nodes" by _time`	List of dead nodes (DCNs) and their count at every 5 minute interval. Scheduler logs are required to populate this panel.
Activity Panel	To see logs of all the success and failure operations: `index="_internal" source="_configuration.log" INFO OR ERROR pool="Global pool"` To see logs of successful operations: `index="_internal" source="_configuration.log" INFO pool="Global pool"` To see logs of failed operations: `index="_internal" source="*_configuration.log" ERROR pool="Global pool"`	It will show the logs of the configuration activities like adding DCN, adding vCenter. It will also have filter for "failure" and "Success" to show the logs as per status of the operation.

Related answers from Splunk Community

Hydra troubleshooting searches in the Splunk Add-on for VMware Metrics

Prerequisite

Steps

Hydra framework status searches

Hydra scheduler status

Comments

Hydra troubleshooting searches in the Splunk Add-on for VMware Metrics

Was this topic useful?