Troubleshoot distributed search and search head pooling
This topic describes issues to be aware of when configuring or using distributed search.
General configuration issues
Clock skew between search heads and search peers can affect search behavior
It's important to keep the clocks on your search heads and search peers in sync, via NTP (network time protocol) or some similar means. If the clocks are out-of-sync by more than a few seconds, you can end up with search failures or premature expiration of search artifacts.
Search head pooling configuration issues
When implementing search head pooling, there are a few potential issues you should be aware of, mainly having to do with coordination among search heads.
Authentication and authorization changes made through a search head's Splunk Web apply only to that search head and not to other search heads in that pool. Each member of the pool maintains its local
$SPLUNK_HOME/etc/system/local. To share configurations across the pool, set them up in shared storage, as described in "Configure search head pooling".
It's important to keep the clocks on your search heads and shared storage server in sync, via NTP (network time protocol) or some similar means. If the clocks are out-of-sync by more than a few seconds, you can end up with search failures or premature expiration of search artifacts.
On each search head, the user account Splunk runs as must have read/write permissions to the files on the shared storage server.
A large percentage of search head pooling issues boil down to insufficient performance.
When deploying or investigating a search head pooling environment, it's important to consider these factors:
- Storage: The storage backing the pool must be able to handle a very high number of IOPS. IOPS under 1000 will probably never work well.
- Network: The communication path between the backing store and the search heads must be high bandwidth and extremely low latency. This probably means your storage system should be on the same switch as your search heads. WAN links are not going to work.
- Server Parallelism: Because Splunk search results in a large number of processes requesting a large number of files, the parallelism in the system must be high. This can require tuning the NFS server to handle a larger number of requests in parallel.
- Client Parallelism: The client operating system must be able to handle a significant number of requests at the same time.
To validate an environment, a typical approach would be:
- Use a storage benchmarking tool, such as Bonnie++, while the file store is not in use to validate that the IOPS provided are robust.
- Use network testing methods to determine that the roundtrip time between search heads and the storage system is on the order of 10ms.
- Perform known simple tasks such as creating a million files and then deleting them.
- Assuming the above tests have not shown any weaknesses, perform some IO load generation or run the real Splunk load while gathering NFS stat data to see what's happening with the NFS requests.
NFS client concurrency limits can cause search timeouts or slow search behavior
The search performance in a search head pool is a function of the throughput of the shared storage and the search workload. The combined effect of concurrent search users and concurrent scheduled searches running will yield a total IOPS that the shared volume needs to support. IOP requirements will also vary by the kind of searches run. To adequately provision a device to be shared between search heads, you need to know the number of concurrent users submitting searches and the number of jobs/apps that will be executed simultaneously.
If searches are timing out or running slowly, you might be exhausting the maximum number of concurrent requests supported by the NFS client. To solve this problem, increase your client concurrency limit. For example, on a Linux NFS client, adjust the
NFS latency for large user count can incur splunk configuration access latency or slow dispatch reaping
Splunk synchronizes the search head pool storage configuration state with the in-memory state when it detects changes. Essentially, it reads the configuration into memory when it detects updates. When dealing either with overloaded search pool storage or with large numbers of users, apps, and configuration files, this synchronization process can reduce performance. To mitigate this, the minimum frequency of reading can be increased, as discussed in "Select timing for configuration refresh".
Warning about unique serverName attribute
Each search head in the pool must have a unique
serverName attribute. Splunk validates this condition when each search head starts. If it finds a problem, it generates this error message:
serverName "<xxx>" has already been claimed by a member of this search head pool in <full path to pooling.ini on shared storage> There was an error validating your search head pooling configuration. For more information, run 'splunk pooling validate'
The most common cause of this error is that another search head in the pool is already using the current search head's
serverName. To fix the problem, change the current search head's
serverName attribute in .
There are a few other conditions that also can generate this error:
- The current search head's
serverNamehas been changed.
- The current search head's GUID has been changed. This is usually due to
To fix these problems, run
splunk pooling replace-member
This updates the
pooling.ini file with the current search head's
serverName->GUID mapping, overwriting any previous mapping.
Artifacts and incorrectly-displayed items in Splunk Web after upgrade
When upgrading pooled search heads, you must copy all updated apps - even those that ship with Splunk (such as the Search app and the data preview feature, which is implemented as an app) - to the search head pool's shared storage after the upgrade is complete. If you do not, you might see artifacts or other incorrectly-displayed items in Splunk Web.
To fix the problem, copy all updated apps from an upgraded search head to the shared storage for the search head pool, taking care to exclude the
local sub-directory of each app.
Important: Excluding the
local sub-directory of each app from the copy process prevents the overwriting of configuration files on the shared storage with local copies of configuration files.
Once the apps have been copied, restart Splunk on all search heads in the pool.
Distributed search error messages
This table lists some of the more common search-time error messages associated with distributed search:
||The specified remote peer is not available.|
||The specified remote peer is not a Splunk server.|
||The specified remote peer is using a duplicate license.|
||Authentication with the specified remote peer failed.|
Use distributed search
About Splunk Deployment Monitor App
This documentation applies to the following versions of Splunk® Enterprise: 5.0, 5.0.1, 5.0.2, 5.0.3, 5.0.4, 5.0.5, 5.0.6, 5.0.7, 5.0.8, 5.0.9, 5.0.10, 5.0.11, 5.0.12, 5.0.13, 5.0.14, 5.0.15, 5.0.16, 5.0.17, 5.0.18