Handle failure of a search head cluster member
When a member fails, the cluster can usually absorb the failure and continue to function normally.
When a failed member restarts and rejoins the cluster, the cluster can frequently complete the process automatically. In some cases, however, your intervention is necessary.
When a member fails
If a search head cluster member fails for any reason and leaves the cluster unexpectedly, the cluster can usually continue to function without interruption:
- The cluster's high availability features ensure that the cluster can continue to function as long as a majority (at least 51%) of the members are still running. For example, if you have a cluster configured with seven members, the cluster will function as long as four or more members remain up. If a majority of members fail, the cluster cannot successfully elect a new captain, which results in failure of the entire cluster. See Search head cluster captain.
- All search artifacts resident on the failed member remain available through other search heads, as long as the number of machines that fail is less than the replication factor. If the number of failed members equals or exceeds the replication factor, it is likely that some search artifacts will no longer be available to the remaining members.
- If the failed member was serving as captain, the remaining nodes elect another member as captain. Since members share configurations, the new captain is immediately fully functional.
- If you are employing a load balancer in front of the search heads, the load balancer should automatically reroute users on the failed member to an available search head.
When the member rejoins the cluster
A failed member automatically rejoins the cluster, if its instance successfully restarts. When this occurs, its configurations require immediate updating so that they match those of the other cluster members. The member needs updates for two sets of configurations:
- The replicated changes, which it gets from the captain. See Updating the replicated changes.
- The deployed changes, which it gets from the deployer. See Updating the deployed changes.
See How configuration changes propagate across the search head cluster for information on how configurations are shared among cluster members.
Updating the replicated changes
When the member rejoins the cluster, it contacts the captain to request the set of intervening replicated changes. In some cases, the recovering member can automatically resync with the captain. However, if the member has been disconnected from the cluster for a long time, the resync process might require manual intervention.
See Replication synchronization issues for details on the recovery synchronization process, including how to perform a manual resync.
Updating the deployed changes
When the member rejoins the cluster, it automatically contacts the deployer for the latest configuration bundle. The member then applies any changes or additions that have been made since it last downloaded the bundle.
Use static captain to recover from loss of majority
This documentation applies to the following versions of Splunk® Enterprise: 6.5.1612 (Splunk Cloud only), 6.6.0, 6.6.1, 6.6.2, 6.6.3, 6.6.4, 6.6.5, 6.6.6, 6.6.7, 6.6.8, 6.6.9, 6.6.10, 6.6.11, 6.6.12, 7.0.0, 7.0.1, 7.0.2, 7.0.3, 7.0.4, 7.0.5, 7.0.6, 7.0.7, 7.0.8, 7.0.9, 7.0.10, 7.0.11, 7.1.0, 7.1.1, 7.1.2, 7.1.3, 7.1.4, 7.1.5, 7.1.6, 7.1.7, 7.1.8, 7.2.0, 7.2.1, 7.2.2, 7.2.3, 7.2.4, 7.2.5, 7.2.6, 7.2.7, 7.3.0, 7.3.1