Splunk® Supported Add-ons

Splunk Add-on for VMware Metrics

Hydra troubleshooting searches in the Splunk Add-on for VMware Metrics

Hydra troubleshooting searches are the search queries of Hydra troubleshooting dashboards that help you identify the issues related to jobs and data collection. The Hydra troubleshooting dashboards are present in the Splunk App for Infrastructure (SAI). If you're not using SAI, follow these steps to use the search-time extractions for these searches.

Prerequisite

The machine on which you are performing these steps must have DCS and DCN logs.

Steps

  1. Select one of the following packages to add search-time extractions:
    • Splunk_TA_vmware_inframon
    • SA-Hydra-inframon
    • Splunk_TA_vcenter
    • Splunk_TA_esxilogs
    • SA-VMWIndex-inframon
  2. Add the following stanzas in the props.conf file present in the local directory of the selected package. If the props.conf file doesn't exist, create a new props.conf file.
    [ta_vmware_hierarchy_agent]
    REPORT-hydraloggerfields = hydra_logger_fields
    
    ## Original from SA-Hydra
    [hydra_scheduler]
    REPORT-schedulerfields = hydra_scheduler_log_fields
    
    [hydra_worker]
    REPORT-workerfields = hydra_worker_log_fields
    REPORT-pool_name_field = pool_name_field_extraction
    
    [source::.../var/log/splunk/*_configuration.log]
    REPORT-pool_name_field = pool_name_field_extraction
    
    [hydra_gateway]
    REPORT-gatewayfields = hydra_gateway_log_fields
    
    [hydra_access]
    REPORT-gatewayfields = hydra_access_log_fields
     
  3. Add the following search-time extractions to the transforms.conf file present in the local directory of the selected package. If the transforms.conf file doesn't exist, create a new transforms.conf file.
    [hydra_logger_fields]
    REGEX = ^\d\d\d\d-\d\d-\d\d\s\d\d:\d\d:\d\d,\d\d\d (\w+) \[([\w_]+):\/\/([^\]]+)\] (\[[^\]]+\])?\s?(.+)$
    FORMAT = level::$1 input::$2 scheduler::$3 component::$4 message::$5
    
    [hydra_worker_log_fields]
    REGEX = ^\d\d\d\d-\d\d-\d\d\s\d\d:\d\d:\d\d,\d\d\d (\w+) \[([\w_]+):\/\/([^:]+):(\d+)\] (\[[^\]]+\])?\s?(.+)$
    FORMAT = level::$1 input::$2 worker::$3 pid::$4 component::$5 message::$6
    
    [pool_name_field_extraction]
    REGEX = \[pool=([^\]]*)\]
    FORMAT = pool::$1
    MV_ADD = true
    
    [hydra_scheduler_log_fields]
    REGEX = ^\d\d\d\d-\d\d-\d\d\s\d\d:\d\d:\d\d,\d\d\d (\w+) \[([\w_]+):\/\/([^\]]+)\] (\[[^\]]+\])?\s?(.+)$
    FORMAT = level::$1 input::$2 scheduler::$3 component::$4 message::$5
    
    [hydra_gateway_log_fields]
    REGEX = ^\d\d\d\d-\d\d-\d\d\s\d\d:\d\d:\d\d,\d\d\d (\w+) \[([\w_]+):([^\]]+)\] (\[[^\]]+\])?\s?(.+)$
    FORMAT = level::$1 service::$2 pid::$3 component::$4 message::$5
    
    [hydra_access_log_fields]
    REGEX = ^\d\d\d\d-\d\d-\d\d\s\d\d:\d\d:\d\d,\d\d\d (\w+) ((\w+) ([^\s]+)) '((\d+) ([^']+))' - - - (\d+)ms$
    FORMAT = level::$1 request::$2 method::$3 uri_path::$4 status_full::$5 status::$6 status_message::$7 spent::$8
     
  4. Make sure the above search-time extractions are globally accessible to all the apps.
  5. Restart your Splunk software.

Hydra framework status searches

Use the search queries of the Hydra Framework Status dashboard to identify issues related to jobs handled by DCN. To enable data population for these search queries, add the search-time extractions to the package in etc/apps and made it globally available.

Query name Search query Description
Job Expiration and Failure Count Over Pool

index=_internal sourcetype=hydra_worker error | eval status=if(like(_raw, "%expired%"), "Expired Jobs", if(like(_raw, "%Failed to complete job%"), "Failed Jobs", "other")) | search status!="other" | stats count by pool, status | xyseries pool, status, count

Number of jobs expired or failed for particular pool. DCN (Worker) logs are required to populate this panel.
Job Expirations by DCN

index=_internal sourcetype=hydra_worker error expired pool="Global pool" | timechart minspan=1m count by host

Number of jobs assigned and expired on each DCN versus time. DCN (Worker) logs are required to populate this panel.
Jobs Handled by DCN

index=_internal sourcetype=hydra_worker "Successfully completed job" pool="Global pool" | eval head=host+":"+worker | timechart minspan=1m useother=0 limit=18 count by host

Number of jobs successfully completed by each DCN versus time. DCN (Worker) logs are required to populate this panel.
Job Scheduling Duration Range (DEBUG level logs only)

index=_internal source="*ta_vmware_collection_scheduler_*Global pool*" ("[HydraWorkerNodeManifest] checking health of node" OR "Sprayed all ready jobs onto active nodes") | transaction startswith="[HydraWorkerNodeManifest] checking health of node" endswith="Sprayed all ready jobs onto active nodes" | timechart minspan=1m max(duration) min(duration) avg(duration) by input

Average, Max and Min time taken for Scheduler to assign jobs to DCNs at every iteration versus time. It will populate when DEBUG level is enabled on your scheduler. Scheduler logs are required to populate this panel.
Collection Task Duration Range (Log Scale)

index=_internal sourcetype=hydra_worker UpdateJobTime pool="Global pool" | timechart min(time) as "Minimum Execution Time" median(time) as "Median Execution Time" max(time) as "Maximum Execution Time"

Minimum, Median and Maximum execution time to perform all the task. DCN (Worker) logs are required to populate this panel.
Median Task Performance Over Targets

index=_internal sourcetype=hydra_worker UpdateJobTime pool="Global pool" | chart useother=0 median(time) over target by task

Target (vCenter) and task wise median job execution time reported by Worker on DCN. DCN (Worker) logs are required to populate this panel.
Task Expiration Count Over DCN

index=_internal sourcetype=hydra_worker error expired pool="Global pool" | chart useother=0 count over host by task | rename host as "DCN"

Task wise no. of jobs assigned and expired on each DCN. DCN (Worker) logs are required to populate this panel.
Task Failure Count Over Target

index=_internal sourcetype=hydra_worker error "failed to complete job" pool="Global pool" | chart useother=0 count over host by task

Task wise no. of jobs assigned and failed on each DCN. DCN (Worker) logs are required to populate this panel.
Last 100 Worker Errors - excluding expiration

index=_internal sourcetype=hydra_worker error NOT expired | head 100

Last 100 errors occurred in worker processes in all DCNs excluding errors which occurred due to job expiration. DCN (Worker) logs are required to populate this panel.
Last 100 Scheduler Errors

index=_internal source="*ta_vmware_collection_scheduler_*Global pool*" error | head 100

Last 100 errors occurred in Scheduler process. Scheduler logs are required to populate this panel.

Hydra scheduler status

Use the Hydra Scheduler Status page to identify issues related to jobs assigned by your scheduler. To enable data population for these search queries, make sure you have added the search-time extractions to the package in etc/apps and made it globally available.

Some of the following queries require the DEBUG level logs of the Scheduler. To enable the DEBUG level logging for the scheduler, perform the following steps on the scheduler:

  1. Go to Settings > Data inputs.
  2. Select TA-VMware-inframon Collection Scheduler from the inputs.
  3. Click Global pool from Scheduler Name.
  4. Add DEBUG as the logging level.
  5. Click on Save.
Query name Search query Description
Job Assignment by DCN

index=_internal source="*ta_vmware_collection_scheduler_*Global pool*" number_new_jobs | timechart minspan=5s max(number_new_jobs) by node

Number of jobs assigned to each DCN versus time. It will populate when DEBUG level is enabled on scheduler. Scheduler logs are required to populate this panel.
Max Unclaimed Queue Length by DCN

index=_internal source="*ta_vmware_collection_scheduler_*Global pool*" "current unclaimed queue" | timechart minspan=1m max(length) by node

Number of unclaimed jobs reported by each DCN to Scheduler versus time. It will populate when DEBUG level is enabled on scheduler. Scheduler logs are required to populate this panel.
Dead Nodes

index=_internal source="*ta_vmware_collection_scheduler_*Global pool*" "is dead, failed to authenticate user" | rex "HydraWorkerNode\((?<node>[^\s]+)\)" | bucket _time span=5m | stats dc(node) as "Dead Node Count" values(node) as "Dead Nodes" by _time

List of dead nodes (DCNs) and their count at every 5 minute interval. Scheduler logs are required to populate this panel.
Activity Panel

To see logs of all the success and failure operations:

index="_internal" source="*_configuration.log" INFO OR ERROR pool="Global pool"

To see logs of successful operations:

index="_internal" source="*_configuration.log" INFO pool="Global pool"

To see logs of failed operations:

index="_internal" source="*_configuration.log" ERROR pool="Global pool"

It will show the logs of the configuration activities like adding DCN, adding vCenter. It will also have filter for "failure" and "Success" to show the logs as per status of the operation.
Last modified on 13 September, 2024
Data collection configuration file reference for the Splunk Add-on for VMware Metrics   Third-party software credits for the Splunk Add-on for VMware Metrics

This documentation applies to the following versions of Splunk® Supported Add-ons: released


Was this topic useful?







You must be logged into splunk.com in order to post comments. Log in now.

Please try to keep this discussion focused on the content covered in this documentation topic. If you have a more general question about Splunk functionality or are experiencing a difficulty with Splunk, consider posting a question to Splunkbase Answers.

0 out of 1000 Characters