Handle failure of a search head cluster member
When a member fails, the cluster can usually absorb the failure and continue to function normally.
When a failed member restarts and rejoins the cluster, the cluster can frequently complete the process automatically. In some cases, however, your intervention is necessary.
When a member fails
If a search head cluster member fails for any reason and leaves the cluster unexpectedly, the cluster can usually continue to function without interruption:
- The cluster's high availability features ensure that the cluster can continue to function as long as a majority (at least 51%) of the members are still running. For example, if you have a cluster configured with seven members, the cluster will function as long as four or more members remain up. If a majority of members fail, the cluster cannot successfully elect a new captain, which results in failure of the entire cluster. See "Search head cluster captain."
- All search artifacts resident on the failed member remain available through other search heads, as long as the number of machines that fail is less than the replication factor. If the number of failed members equals or exceeds the replication factor, it is likely that some search artifacts will no longer be available to the remaining members.
- If the failed member was serving as captain, the remaining nodes elect another member as captain. Since members share configurations, the new captain is immediately fully functional.
- If you are employing a load balancer in front of the search heads, the load balancer should automatically reroute users on the failed member to an available search head.
When the member rejoins the cluster
A failed member automatically rejoins the cluster, if its instance successfully restarts. When this occurs, its configurations require immediate updating so that they match those of the other cluster members. The member needs updates for two sets of configurations:
- The replicated changes, which it gets from the captain. See "Updating the replicated changes."
- The deployed changes, which it gets from the deployer. See "Updating the deployed changes."
See "How configuration changes propagate across the search head cluster" for information on how configurations are shared among cluster members.
Updating the replicated changes
When the member rejoins the cluster, it contacts the captain to request the set of intervening replicated changes. What happens next depends on whether the member and the captain still share a common commit in their change histories:
- If the captain and the member still share a common commit, the member automatically downloads the intervening changes from the captain and applies them to its pre-offline configuration. The member also pushes its intervening changes, if any, to the captain, which replicates them to the other members.
- If the captain and the member do not share a common commit, they cannot properly sync without your intervention. To update the member's configuration, you must instruct the member to download the entire configuration tarball from the captain, as described in "How the update proceeds." The tarball overwrites the member's existing set of configurations, causing it to lose any local changes.
Changes are purged from the change history over time, based on configurable purge limits.
The purging of the configuration change history is determined by these attributes in
conf_replication_purge.eligibile_count. Its default is 20,000 changes.
conf_replication_purge.eligibile_age. Its default is one day.
When both limits have been exceeded on a member, the member begins to purge the change history, starting with the oldest changes.
For more information on purge limit attributes, see the server.conf specification file.
How the update proceeds
Upon rejoining the cluster, the member attempts to apply the set of intervening replicated changes from the captain. If the set exceeds the purge limits and the member and captain no longer share a common commit, a banner message appears on the member's UI, with text similar to the following:
Error pulling configurations from the search head cluster captain; consider performing a destructive configuration resync on this search head cluster member.
If this message appears, it means that the member is unable to update its configuration through the configuration change delta and must apply the entire configuration tarball. It does not do this automatically. Instead, it waits for your intervention.
You must then initiate the process of downloading and applying the tarball by running this CLI command on the member:
splunk resync shcluster-replicated-config
You do not need to restart the member after running this command.
Caution: This command causes an overwrite of the member's entire set of search-related configurations, resulting in the loss of any local changes.
Updating the deployed changes
When the member rejoins the cluster, it automatically contacts the deployer for the latest configuration bundle. The member then applies any changes or additions that have been made since it last downloaded the bundle.
Use static captain to recover from loss of majority
This documentation applies to the following versions of Splunk® Enterprise: 6.3.0, 6.3.1, 6.3.2, 6.3.3, 6.3.4, 6.3.5, 6.3.6, 6.3.7, 6.3.8, 6.3.9, 6.3.10, 6.3.11, 6.3.12, 6.3.13, 6.4.0, 6.4.1, 6.4.2, 6.4.3, 6.4.4, 6.4.5, 6.4.6, 6.4.7, 6.4.8, 6.4.9, 6.4.10, 6.4.11