Use the monitoring console to view search head cluster status and troubleshoot issues
You can use the monitoring console to monitor most aspects of your deployment. This topic discusses the console dashboards that provide insight into search head clusters.
The primary documentation for the monitoring console is located in Monitoring Splunk Enterprise.
Search head clustering dashboards in the monitoring console
There are several search head clustering dashboards under the Search menu:
- Search Head Clustering: Status and Configuration
- Search Head Clustering: Configuration Replication
- Search Head Clustering: Artifact Replication
- Search Head Clustering: Scheduler Delegation
- Search Head Clustering: App Deployment
These dashboards provide a wealth of information about your search head cluster, such as:
- Cluster member instance names and status
- Identification of current captain and captain election activity
- Configuration replication performance
- Artifact replication details
- Scheduler activity
- Deployer activity
View the dashboards themselves for more information. In addition, see Search head clustering dashboards in Monitoring Splunk Enterprise.
Note: You can also use the CLI to get basic information about the cluster. See Use the CLI to view information about a search head cluster.
Troubleshoot the search head cluster
As part of its continuous monitoring of the search head cluster, the monitoring console provides a variety of information useful for troubleshooting. For example:
- The Search Head Clustering: Status and Configuration dashboard shows:
- Search concurrency for various types of searches, with details on running versus limit
- Status, including captaincy and state
- Heartbeat information (discussed elsewhere in this topic)
- Configuration baseline consistency (discussed elsewhere in this topic)
- Artifact count
- Election activity
- The Search Head Clustering: Configuration Replication dashboard shows:
- Warning and error patterns
- Configuration replication activity
- The Search Head Clustering: Artifact Replication dashboard shows:
- Warning and error patterns
- Artifact replication activity
- The Search Head Clustering: Scheduler Delegation dashboard shows:
- Scheduler delegation activity
- The Search Head Clustering: App Deployment dashboard shows:
- Status of app deployments
Troubleshoot heartbeat issues
The Search Head Clustering: Status and Configuration dashboard provides insight into the heartbeats that the cluster members send to the captain. Specifically, it shows, for each member:
- The time that the member last sent a heartbeat to the captain
- The time that the captain last received a heartbeat from the member
These times should be the same or nearly the same. Significant differences in the sent and received times indicate likely problems.
You can also access heartbeat information through the REST API. See the REST API documentation for shcluster/captain/members/{name}.
The role of the heartbeat
Members send a heartbeat to the captain on a regular basis. By default, the member sends a heartbeat every five seconds.
The frequency is defined by the heartbeat_period
attribute in the [shclustering]
stanza of server.conf
on each member. All members must set this attribute to the same value.
The heartbeat is the fundamental communication from the member to the captain. It indicates that the member is alive and part of the cluster. The heartbeat also contains a variety of information, such as:
- Search artifacts
- Dispatched searches
- Alerts and suppressions
- Completed summarization jobs
- Member load information
When the captain receives the heartbeat, it notes that the member is in the "up" state.
After the captain receives a heartbeat from every node, it consolidates all the transmitted information and, in turn, sends members information such as:
- Search artifact logs
- List of overall alerts and suppressions
- Dispatched searches
Impact of heartbeat failure
The captain expects to get a heartbeat from each member on a regular basis, as specified in the heartbeat_timeout
attribute in the [shclustering]
stanza of server.conf
.
By default, the timeout is set to 60 seconds.
The captain only knows about the existence of a member through its heartbeat. If it never receives a heartbeat, it will not know that the member exists.
If, within the specified timeout period, the captain does not get a heartbeat from a member that has previously sent a heartbeat, the captain marks the member as "down". The captain does not dispatch new searches to members in the "down" state.
Causes of heartbeat failure
If the captain does not receive a heartbeat from a member, it usually indicates one of the following situations:
- Member is down or unavailable.
- Network partition between captain and member.
- HTTP request failures. These are visible in
splunkd_access.log
on the captain.
Note: By default, Splunk Enterprise logs only heartbeat failures in splunkd_access.log
. To enable logging for heartbeat successes as well, configure access_logging_for_heartbeats=true
in the [shclustering]
stanza of server.conf
on the captain. If you want this configuration change to persist across captaincy transfer, make the change on all members, not just the current captain.
Troubleshoot configuration baseline consistency
The Search Head Clustering: Status and Configuration dashboard includes information on the consistency of the configuration baseline. This information helps to determine whether configuration changes are being properly replicated across the set of cluster members.
To find this information, go to the Snapshots section of the dashboard and view the Status table. There is one row for each member. The table includes two columns that pertain to baseline consistency:
- Configuration Baseline Consistency. This column contains a ratio that compares the consistency of each member's baseline to the baselines for all other members. For more details, click the ratio. A table to the right then compares the member's baseline consistency against each individual member.
- Number of Unpublished Changes. This column indicates whether there are any sets of configuration changes on the member that have not yet been replicated to the captain. In particular, it notes whether a member is out-of-sync with the captain.
When a baseline mismatch is detected, at least one member requires manual intervention to regain baseline consistency. Examine the consistency comparison table to identify the member that is not in sync with a majority of the other members. To restore consistency, perform a manual resync on the member, using the splunk resync shcluster-replicated-config
command. See Perform a manual resync.
For a discussion of configuration replication, see Configuration updates that the cluster replicates.
Use the CLI to view information about a search head cluster | Deployment issues |
This documentation applies to the following versions of Splunk® Enterprise: 7.0.0, 7.0.1, 7.0.2, 7.0.3, 7.0.4, 7.0.5, 7.0.6, 7.0.7, 7.0.8, 7.0.9, 7.0.10, 7.0.11, 7.0.13, 7.1.0, 7.1.1, 7.1.2, 7.1.3, 7.1.4, 7.1.5, 7.1.6, 7.1.7, 7.1.8, 7.1.9, 7.1.10, 7.2.0, 7.2.1, 7.2.2, 7.2.3, 7.2.4, 7.2.5, 7.2.6, 7.2.7, 7.2.8, 7.2.9, 7.2.10, 7.3.0, 7.3.1, 7.3.2, 7.3.3, 7.3.4, 7.3.5, 7.3.6, 7.3.7, 7.3.8, 7.3.9, 8.0.0, 8.0.1, 8.0.2, 8.0.3, 8.0.4, 8.0.5, 8.0.6, 8.0.7, 8.0.8, 8.0.9, 8.0.10, 8.1.0, 8.1.1, 8.1.2, 8.1.3, 8.1.4, 8.1.5, 8.1.6, 8.1.7, 8.1.8, 8.1.9, 8.1.10, 8.1.11, 8.1.12, 8.1.13, 8.1.14, 8.2.0, 8.2.1, 8.2.2, 8.2.3, 8.2.4, 8.2.5, 8.2.6, 8.2.7, 8.2.8, 8.2.9, 8.2.10, 8.2.11, 8.2.12, 9.0.0, 9.0.1, 9.0.2, 9.0.3, 9.0.4, 9.0.5, 9.0.6, 9.0.7, 9.0.8, 9.0.9, 9.0.10, 9.1.0, 9.1.1, 9.1.2, 9.1.3, 9.1.4, 9.1.5, 9.1.6, 9.1.7, 9.2.0, 9.2.1, 9.2.2, 9.2.3, 9.2.4, 9.3.0, 9.3.1, 9.3.2
Feedback submitted, thanks!