Splunk® Enterprise

Distributed Search

Download manual as PDF

Download topic as PDF

Use static captain to recover from loss of majority

A cluster normally uses a dynamic captain, which can change over time. The dynamic captain is chosen by periodic elections, in which a majority of all cluster members must agree on the captain. See "Captain election."

If a cluster loses the majority of its members, therefore, it cannot elect a captain and cannot continue to function. You can work around this situation by reconfiguring the cluster to use a static captain in place of the dynamic captain.

A static captain does not change over time. Unlike a dynamic captain, the cluster does not conduct an election to select the static captain. Instead, you designate a member as the static captain, and that member remains the captain until you designate another member as captain.

Shortcomings of the static captain

The static captain has one fundamental shortcoming: It becomes a single point of failure for the cluster. If the captain fails, the cluster fails. The cluster cannot, on its own, replace a static captain. Rather, manual intervention is necessary.

Because of this shortcoming, Splunk recommends that you use the static captain capability only for disaster recovery. Specifically, you can employ the static captain to recover from a loss of majority, which renders the cluster incapable of electing a dynamic captain.

In addition, the static captain does not check whether enough members are running to meet the replication factor. This means that, under some conditions, you might not have a full complement of search artifact copies.

Note: You should only employ static captain when absolutely necessary. While the process of converting to static captain is usually simple and fast, the process of later reverting back to a dynamic captain is somewhat more involved.

Uses cases for static captain

Here are some situations where it makes sense to switch to a static captain:

  • A single-site cluster loses the majority of its members. You can revive the cluster by designating one of its members as a static captain.
  • The cluster is deployed across two sites. The majority site fails. Without a majority, the members in the second, minority site cannot elect a captain. You can revive the cluster by designating one of the members on the minority site as a static captain.

In all cases, once the precipitating issue has been resolved, you should revert the cluster to use a dynamic captain.

Caution: Do not use the static captain to handle a network interruption that stops communication between two sites. During a network interruption, the site with a majority of members continues to function as usual, because it can elect a dynamic captain as necessary. However, the site with a minority of members cannot elect a captain and therefore will not function as a cluster. If you attempt to revive the minority site by configuring its members to use a static captain, you will then have two clusters, one with a dynamic captain and the other with a static captain. When the network heals, you will not be able to reconcile the configuration changes between the sites.

Switch to a static captain

To switch to a static captain, reconfigure each cluster member to use a static captain:

1. On the member that you want to designate as captain, run this CLI command:

splunk edit shcluster-config -mode captain -captain_uri <URI>:<management_port> -election false

2. On each non-captain member, run this CLI command:

splunk edit shcluster-config -mode member -captain_uri <URI>:<management_port> -election false

Note the following:

  • The -mode parameter specifies whether the instance should function as a captain or solely as a member. The captain always functions as both captain and a member.
  • The -captain_uri parameter specifies the URI and management port of the captain instance.
  • The -election parameter indicates the type of captain that this cluster uses. By setting -election to "false", you indicate that the cluster uses a static captain.

You do not need to restart the captain or any other members after running these commands. The captain immediately takes control of the cluster.

To confirm that the cluster is now operating with a static captain, run this CLI command from any member:

splunk show shcluster-status -auth <username>:<password>

The dynamic_election flag will be set to 0.

Revert to the dynamic captain

When the precipitating situation has resolved, you should revert the cluster to control by a single, dynamic captain. To switch to dynamic captain, you reconfigure all the members that you previously configured for static captain. How exactly you do this depends on the type of scenario you are recovering from.

This topic provides reversion procedures for the two main scenarios:

  • Single-site cluster with loss of majority, where you converted the remaining members to use static captain. Once the cluster regains a majority, you should convert the members back to dynamic.
  • Two-site cluster, where the majority site went down and you converted the members on the minority site to use static captain. Once the majority site returns, you should convert all members to dynamic.

Return single-site cluster to dynamic captain

In the scenario of a single-site cluster with loss of majority, you should revert to dynamic mode once the cluster regains its majority:

1. As members come back online, convert them one-by-one to point to the static captain:

splunk edit shcluster-config -election false -mode member -captain_uri <URI>:<management_port> 

Note the following:

  • The -captain_uri parameter specifies the URI and management port of the static captain instance.

You do not need to restart the member after running this command.

As you point each rejoining member to the static captain, it attempts to download the replication delta. If the purge limit has been exceeded, the system will prompt you to perform a manual resync, as explained in "How the update proceeds."

Caution: During the time that it takes for the remaining steps of this procedure to complete, your users should not make any configuration changes.

2. Once the cluster has regained its majority, convert all members back to dynamic captain use. Convert the current, static captain last. To accomplish this, run this command on each member:

splunk edit shcluster-config -election true -mgmt_uri <URI>:<management_port>

Note the following:

  • The -election parameter indicates the type of captain that this cluster uses. By setting -election to "true", you indicate that the cluster uses a dynamic captain.
  • The -mgmt_uri parameter specifies the URI and management port for this member instance. You must use the fully qualified domain name. This is the same value that you specified when you first deployed the member with the splunk init command.

You do not need to restart the member after running this command.

3. Bootstrap one of the members. This member then becomes the first dynamic captain. It is recommended that you bootstrap the member that was previously serving as the static captain.

splunk bootstrap shcluster-captain -servers_list "<URI>:<management_port>,<URI>:<management_port>,..." -auth <username>:<password>

For information on these parameters, see "Bring up the cluster captain."

Return two-site cluster to dynamic captain

In the scenario of a two-site cluster with loss of the majority site, you should revert to dynamic mode once the majority site comes back online:

1. When the majority site comes back online, convert its members to use the static captain. Point each majority site member to the static captain:

splunk edit shcluster-config -election false -mode member -captain_uri <URI>:<management_port>

Note the following:

  • The -captain_uri parameter specifies the URI and management port of the static captain instance.

You do not need to restart the member after running this command.

As you point each rejoining member to the static captain, it attempts to download the replication delta. If the purge limit has been exceeded, the system will prompt you to perform a manual resync, as explained in "How the update proceeds."

2. Wait for all the majority-site members to get the replicated configs from the static captain. This typically takes a few minutes.

Caution: During the time that it takes for the remaining steps of this procedure to complete, your users should not make any configuration changes.

3. Convert all members back to dynamic captain use. Convert the current, static captain last. To accomplish this, run this command on each member:

splunk edit shcluster-config -election true -mgmt_uri <URI>:<management_port>

Note the following:

  • The -election parameter indicates the type of captain that this cluster uses. By setting -election to "true", you indicate that the cluster uses a dynamic captain.
  • The -mgmt_uri parameter specifies the URI and management port for this member instance. You must use the fully qualified domain name. This is the same value that you specified when you first deployed the member with the splunk init command.

You do not need to restart the member after running this command.

4. Bootstrap one of the members. This member then becomes the first dynamic captain. It is recommended that you bootstrap the member that was previously serving as the static captain.

splunk bootstrap shcluster-captain -servers_list "<URI>:<management_port>,<URI>:<management_port>,..." -auth <username>:<password>

For information on these parameters, see "Bring up the cluster captain."

PREVIOUS
Handle failure of a search head cluster member
  NEXT
Put a search head cluster member into detention

This documentation applies to the following versions of Splunk® Enterprise: 6.3.0, 6.3.1, 6.3.2, 6.3.3, 6.3.4, 6.3.5, 6.3.6, 6.3.7, 6.3.8, 6.3.9, 6.3.10, 6.3.11, 6.3.12, 6.3.13, 6.4.0, 6.4.1, 6.4.2, 6.4.3, 6.4.4, 6.4.5, 6.4.6, 6.4.7, 6.4.8, 6.4.9, 6.4.10, 6.4.11, 6.5.0, 6.5.1, 6.5.1612 (Splunk Cloud only), 6.5.2, 6.5.3, 6.5.4, 6.5.5, 6.5.6, 6.5.7, 6.5.8, 6.5.9, 6.5.10, 6.6.0, 6.6.1, 6.6.2, 6.6.3, 6.6.4, 6.6.5, 6.6.6, 6.6.7, 6.6.8, 6.6.9, 6.6.10, 6.6.11, 6.6.12, 7.0.0, 7.0.1, 7.0.2, 7.0.3, 7.0.4, 7.0.5, 7.0.6, 7.0.7, 7.0.8, 7.0.9, 7.0.10, 7.0.11, 7.1.0, 7.1.1, 7.1.2, 7.1.3, 7.1.4, 7.1.5, 7.1.6, 7.1.7, 7.1.8, 7.2.0, 7.2.1, 7.2.2, 7.2.3, 7.2.4, 7.2.5, 7.2.6, 7.2.7, 7.3.0


Comments

Thanks for this guide, it's been very helpful in recovering our instance. One note I thought I would add in case anyone else runs into it - I found that we did actually need to restart our members following pointing them to a static captain, or else they were not able to properly run the manual resync due to a network error. After restarting the members the resync was able to run afterward.

Briancronrath
August 14, 2017

Was this documentation topic helpful?

Enter your email address, and someone from the documentation team will respond to you:

Please provide your comments here. Ask a question or make a suggestion.

You must be logged into splunk.com in order to post comments. Log in now.

Please try to keep this discussion focused on the content covered in this documentation topic. If you have a more general question about Splunk functionality or are experiencing a difficulty with Splunk, consider posting a question to Splunkbase Answers.

0 out of 1000 Characters