Intermittent authentication timeouts on search peers
Splunk Web users can experience intermittent timeouts from search peers when there are more concurrent searches attempting to run than the search peers can respond to.
A group of search heads can schedule more concurrent searches than some peers are capable of handling with their CPU core count.
On the search head, you might see yellow banners in quick succession warning that a peer or peers are 'Down' due to Authentication Failed and/or Replication Status Failed. Typically this can happen a few times a day, but the banners appear and disappear seemingly randomly.
On the search head, splunkd.log will have messages like:
WARN DistributedPeerManager - Unable to distribute to peer named xxxx at uri htp://xxxx:8089 because peer has status = "Authentication Failed".
WARN DistributedPeerManager - Unable to distribute to peer named xxxx:8089 at uri htp://xxxx:8089 because peer has status = "Down".
WARN DistributedPeerManager - Unable to distribute to peer named xxxx at uri htp://xxxx:8089 because replication was unsuccessful. replicationStatus Failed
These symptoms can appear with or without other Splunk features such as search head clustering and indexer clustering being enabled. The symptoms are more common in environments with two or more search heads.
To diagnose this issue and proceed with its resolution, set up the monitoring console for your deployment. Next time the issue occurs, you can use the monitoring console to view performance data and validate the diagnosis. For monitoring console set up instructions, see Multi-instance deployment monitoring console setup steps.
1. Find an auth-token timeout to scope the time the issue occurred.
The authentication timeout is 10 seconds, so when the auth-tokens endpoint on the peer takes more than 10 seconds to respond, you'll see an auth or peer status banner on the search head.
To find an auth timeout on the peer named in the search head banner:
index=_internal source=*splunkd_access* splunk_server="search_peer_name" auth | timechart max(spent)
Or to find an auth timeout on any peer:
index=_internal source=*splunkd_access* auth spent > 10000 NOT streams | table splunk_server spent _time
2. Examine the load average just before the auth timeout and check for a dramatic increase.
Now that you've established the time frame in step 1, examine metrics.log's load average over the time frame to determine whether the load increased significantly just before the timeouts were triggered. Typically the total time frame is about 2 minutes.
To find the load average:
index=_internal source=*metrics.log* host="search_peer_name" group=thruput | timechart bins=200 max(load_average)
3. Examine the CPU and memory usage on the search peers.
Use the monitoring console Resource Usage dashboard (Settings > Monitoring Console > Resource Usage: Instance) to review the peak resource usage on the search peer during the time scoped above. Look at the Average CPU Usage panel. If you have too many concurrent searches, you will see that the peer uses more than the available percentage of CPU per core.
For example: A healthy 8 core box will show no more than 100% x 8 cores = 800% average CPU usage. In contrast, a box overtaxed with searches typically shows 1000% or more average CPU usage during the time frame where the timeouts appear.
For more information, see Resource usage dashboards in the Monitoring Splunk Enterprise manual.
Examine the concurrent search load. There are typically searches that had dubious scheduling choices made are scoped in inefficient ways.
Use the monitoring console Search Activity dashboards (Settings > Monitoring Console > Search > Activity) to learn about the dispatched searches, the app they were triggered from, search concurrency details, and so on. For more information, see Search activity dashboards.
Once you've identified your pileup of concurrent searches, start work on this list of things to do. All of them are good practices.
- Most importantly, spread the searches out to reduce concurrency. For example: Use the advanced scheduling cron to balance the searches so that they don't all run on the same minute every hour. Read about scheduling reports and configuring scheduled report priority in the Reporting Manual.
- Limit real-time searches. For example, a scheduled real-time alert that looks back over the last 2 minutes that only triggers an email. Unless it triggers a script, this task should be configured as a scheduled search set to run 10 minutes in the past (to address potential source latency) over a 5 minutes window, combined with a cron offset. This offers the same effect without tying down a CPU core across all peers, all the time. Read more about expected performance and known limitations of real-time searches and reports in the Search Manual.
- Re-scope the search time for actual information needs. For example: Scheduled searches that run every 15 minutes over a 4 hour time frame are a waste of limited resources. Unless you have a very good reason why a search should look back an additional 3 hours and 45 minutes on every search (such as extreme forwarder latency), it's a waste of shared resources. Read more about alerts in the Alerting Manual.
- Additionally, there's the option to use limits.conf to lower the search concurrency of all the search heads. Note that if you do only this step, you will get a different set of banners (about reaching the max number of concurrent searches) and you will still not be able to run concurrent searches. But if you do some of the other steps, too, you might want to configure the search concurrency like this:
[search] base_max_searches = 2 # Defaults to 6 max_searches_per_cpu = 1 # Defaults to 1 max_rt_search_multiplier = 1 # Defaults to 1 in 6.0, in 5.x defaults to 3 [scheduler] max_searches_perc = 20 # Defaults to 50 auto_summary_perc = 10 # Defaults to 50
- Add more search peers.
- Add more cores to the search peers.
- As a last resort, there's also the option of increasing the distsearch.conf timeouts. This is a workaround, and will slow down search results during peak load times. Increase the timeouts in distsearch.conf on the search head:
[distributedSearch] statusTimeout = 30 # Defaults to 10 authTokenConnectionTimeout = 30 # Default is 5 authTokenSendTimeout = 60 # Default is 10 authTokenReceiveTimeout = 60 # Default is 10
Dashboard in app is not showing the expected results
Performance degraded in a search head pooling environment
This documentation applies to the following versions of Splunk® Enterprise: 6.5.7, 7.0.0, 7.0.1, 7.0.2, 7.0.3, 7.0.4, 7.0.5, 7.0.6, 7.0.7, 7.0.8, 7.0.9, 7.0.10, 7.0.11, 7.0.13, 7.1.0, 7.1.1, 7.1.2, 7.1.3, 7.1.4, 7.1.5, 7.1.6, 7.1.7, 7.1.8, 7.1.9, 7.1.10, 7.2.0, 7.2.1, 7.2.2, 7.2.3, 7.2.4, 7.2.5, 7.2.6, 7.2.7, 7.2.8, 7.2.9, 7.2.10, 7.3.0, 7.3.1, 7.3.2, 7.3.3, 7.3.4, 7.3.5, 7.3.6, 7.3.7, 7.3.8, 7.3.9, 8.0.0, 8.0.1, 8.0.2, 8.0.3, 8.0.4, 8.0.5, 8.0.6, 8.0.7, 8.0.8, 8.0.9, 8.0.10, 8.1.0, 8.1.1, 8.1.2, 8.1.3, 8.1.4, 8.1.5, 8.1.6, 8.1.7, 8.1.8, 8.1.9, 8.1.10, 8.1.11, 8.1.12, 8.1.13, 8.2.0, 8.2.1, 8.2.2, 8.2.3, 8.2.4, 8.2.5, 8.2.6, 8.2.7, 8.2.8, 8.2.9, 8.2.10, 9.0.0, 9.0.1, 9.0.2, 9.0.3, 9.0.4
Feedback submitted, thanks!