Intermittent authentication timeouts on search peers
Splunk Web users can experience intermittent timeouts from search peers when there are more concurrent searches attempting to run than the search peers can respond to.
A group of search heads can schedule more concurrent searches than some peers are capable of handling with their CPU core count.
On the search head, you might see yellow banners in quick succession warning that a peer or peers are 'Down' due to Authentication Failed and/or Replication Status Failed. Typically this can happen a few times a day, but the banners appear and disappear seemingly randomly.
On the search head, splunkd.log will have messages like:
WARN DistributedPeerManager - Unable to distribute to peer named xxxx at uri htp://xxxx:8089 because peer has status = "Authentication Failed".
WARN DistributedPeerManager - Unable to distribute to peer named xxxx:8089 at uri htp://xxxx:8089 because peer has status = "Down".
WARN DistributedPeerManager - Unable to distribute to peer named xxxx at uri htp://xxxx:8089 because replication was unsuccessful. replicationStatus Failed
The symptoms can appear with or without other Splunk features such as search head pooling and index replication being enabled. The symptoms are more common in environments with two or more search heads.
To properly diagnose this issue and proceed with its resolution, you must deploy and run the SoS technology add-on (TA) on all indexers/search peers. In addition, install the SoS app itself on a search head. Once the TA has been enabled and has begun collecting data, the next time the issue occurs, you will have performance data to validate the diagnosis.
1. Find an auth-token timeout to scope the time the issue occurred.
The authentication timeout is 10 seconds, so when the auth-tokens endpoint on the peer takes more than 10 seconds to respond, you'll see an auth or peer status banner on the search head.
To find an auth timeout on the peer named in the search head banner:
index=_internal source=*splunkd_access* splunk_server="search_peer_name" auth | timechart max(spent)
Or to find an auth timeout on any peer:
index=_internal source=*splunkd_access* auth spent > 10000 NOT streams | table splunk_server spent _time
2. Examine the load average just before the auth timeout and check for a dramatic increase.
Now that you've established the time frame in step 1, examine metrics.log's load average over the time frame to determine whether the load increased significantly just before the timeouts were triggered. Typically the total time frame is about 2 minutes.
To find the load average:
index=_internal source=*metrics.log* host="search_peer_name" group=thruput | timechart bins=200 max(load_average)
3. Examine the CPU and memory usage on the search peers.
Use the SoS view for CPU/Memory Usage (SOS > Resource Usage > Splunk CPU/Memory Usage) to review the peak resource usage on the search peer during the time scoped above. Look at the Average CPU Usage panel. If you have too many concurrent searches, you will see that the peer uses more than the available percentage of CPU per core. For example: A healthy 8 core box will show no more than 100% x 8 cores = 800% average CPU usage. In contrast, a box overtaxed with searches typically shows 1000% or more average CPU usage during the time frame where the timeouts appear.
For more information about your CPU and memory usage, you can run the useful search described below.
Examine the concurrent search load. There are typically searches that had dubious scheduling choices made and/or are scoped in inefficient ways.
Use the SoS Dispatch Inspector view to learn about the dispatched search objects, the app they were triggered from, and their default schedule. Or you can find this information using the useful search provided below.
Once you've identified your pileup of concurrent searches, get started on this list of things you should do. All of them are good practices.
- Most importantly, spread the searches out to reduce concurrency. For example: Use the advanced scheduling cron to balance the searches so that they don't all run on the same minute every hour. Read about scheduling reports and configuring scheduled report priority in the Reporting Manual.
- Limit real-time searches. For example, a scheduled real-time alert that looks back over the last 2 minutes that only triggers an email. Unless it triggers a script, this task should be configured as a scheduled search set to run 10 minutes in the past (to address potential source latency) over a 5 minutes window, combined with a cron offset. This offers the same effect without tying down a CPU core across all peers, all the time. Read more about expected performance and known limitations of real-time searches and reports in the Search Manual.
- Re-scope the search time for actual information needs. For example: Scheduled searches that run every 15 minutes over a 4 hour time frame are a waste of limited resources. Unless you have a very good reason why a search should look back an additional 3 hours and 45 minutes on every search (such as extreme forwarder latency), it's a waste of shared resources. Read more about alerts in the Alerting Manual.
- Additionally, there's the option to use limits.conf to lower the search concurrency of all the search heads. Note that if you do only this step, you will get a different set of banners (about reaching the max number of concurrent searches) and you will still not be able to run concurrent searches. But if you do some of the other steps, too, you might want to configure the search concurrency like this:
[search] base_max_searches = 2 # Defaults to 6 max_searches_per_cpu = 1 # Defaults to 1 max_rt_search_multiplier = 1 # Defaults to 1 in 6.0, in 5.x defaults to 3 [scheduler] max_searches_perc = 20 # Defaults to 50 auto_summary_perc = 10 # Defaults to 50
- Add more search peers.
- Add more cores to the search peers.
- As a last resort, there's also the option of increasing the distsearch.conf timeouts. This is a workaround, and will slow down search results during peak load times. Increase the timeouts in distsearch.conf on the search head:
[distributedSearch] statusTimeout = 30 # Defaults to 10 authTokenConnectionTimeout = 30 # Default is 5 authTokenSendTimeout = 60 # Default is 10 authTokenReceiveTimeout = 60 # Default is 10
If you have SoS installed on your search head, you can use this search to examine search concurrency.
`set_sos_index` sourcetype=ps host="search_peer_name" | multikv | `get_splunk_process_type` | search type="searches" | rex field=ARGS "_--user=(?<search_user>.*?)_--" | rex field=ARGS "--id=(?<sid>.*?)_--" | rex field=sid "remote_(?<search_head>[^_]*?)_" | eval is_remote=if(like(sid,"%remote%"),"remote","local") | eval is_scheduled=if(like(sid,"%scheduler_%"),"scheduled","ad-hoc") | eval is_realtime=if(like(sid,"%rt_%"),"real-time","historical") | eval is_subsearch=if(like(sid,"%subsearch_%"),"subsearch","generic") | eval search_type=is_remote.", ".is_scheduled.", ".is_realtime | timechart bins=200 dc(sid) AS "Search count" by is_scheduled
CPU and memory usage
If you have the SoS App installed on the search head, you can find CPU and memory usage for all search processes at one point based on the intersection of the "ps" run interval and maximum load:
index=sos sourcetype="ps" host="search_peer_name" | head 1 | multikv | `get_splunk_process_type` | search type=searches | rex field=ARGS "--id=(?<sid>.*?)_--" | stats dc(sid) AS "Search count" sum(pctCPU) AS "Total %CPU" sum(eval(round(RSZ_KB/1024,2))) AS "Total physical memory used (MB)"
Dashboard in app is not showing the expected results
Performance degraded in a search head pooling environment
This documentation applies to the following versions of Splunk® Enterprise: 6.0, 6.0.1, 6.0.2, 6.0.3, 6.0.4, 6.0.5, 6.0.6, 6.0.7, 6.0.8, 6.0.9, 6.0.10, 6.0.11, 6.0.12, 6.0.13, 6.0.14, 6.0.15, 6.1, 6.1.1, 6.1.2, 6.1.3, 6.1.4, 6.1.5, 6.1.6, 6.1.7, 6.1.8, 6.1.9, 6.1.10, 6.1.11, 6.1.12, 6.1.13, 6.1.14, 6.2.0, 6.2.1, 6.2.2, 6.2.3, 6.2.4, 6.2.5, 6.2.6, 6.2.7, 6.2.8, 6.2.9, 6.2.10, 6.2.11, 6.2.12, 6.2.13, 6.2.14, 6.2.15