Use static captain to recover from loss of majority

A cluster normally uses a dynamic captain, which can change over time. The dynamic captain is chosen by periodic elections, in which a majority of all cluster members must agree on the captain. See "Captain election."

If a cluster loses the majority of its members, therefore, it cannot elect a captain and cannot continue to function. You can work around this situation by reconfiguring the cluster to use a static captain in place of the dynamic captain.

A static captain does not change over time. Unlike a dynamic captain, the cluster does not conduct an election to select the static captain. Instead, you designate a member as the static captain, and that member remains the captain until you designate another member as captain.

Shortcomings of the static captain

The static captain has one fundamental shortcoming: It becomes a single point of failure for the cluster. If the captain fails, the cluster fails. The cluster cannot, on its own, replace a static captain. Rather, manual intervention is necessary.

Because of this shortcoming, Splunk recommends that you use the static captain capability only for disaster recovery. Specifically, you can employ the static captain to recover from a loss of majority, which renders the cluster incapable of electing a dynamic captain.

In addition, the static captain does not check whether enough members are running to meet the replication factor. This means that, under some conditions, you might not have a full complement of search artifact copies.

Note: You should only employ static captain when absolutely necessary. While the process of converting to static captain is usually simple and fast, the process of later reverting back to a dynamic captain is somewhat more involved.

Use cases for static captain

Here are some situations where it makes sense to switch to a static captain:

A single-site cluster loses the majority of its members. You can revive the cluster by designating one of its members as a static captain.

The cluster is deployed across two sites. The majority site fails. Without a majority, the members in the second, minority site cannot elect a captain. You can revive the cluster by designating one of the members on the minority site as a static captain.

In all cases, once the precipitating issue has been resolved, you should revert the cluster to use a dynamic captain.

Caution: Do not use the static captain to handle a network interruption that stops communication between two sites. During a network interruption, the site with a majority of members continues to function as usual, because it can elect a dynamic captain as necessary. However, the site with a minority of members cannot elect a captain and therefore will not function as a cluster. If you attempt to revive the minority site by configuring its members to use a static captain, you will then have two clusters, one with a dynamic captain and the other with a static captain. When the network heals, you will not be able to reconcile the configuration changes between the sites.

Switch to a static captain

To switch to a static captain, reconfigure each cluster member to use a static captain:

1. On the member that you want to designate as captain, run this CLI command:

splunk edit shcluster-config -mode captain -captain_uri <URI>:<management_port> -election false

2. On each non-captain member, run this CLI command:

splunk edit shcluster-config -mode member -captain_uri <URI>:<management_port> -election false

Note the following:

The -mode parameter specifies whether the instance should function as a captain or solely as a member. The captain always functions as both captain and a member.
The -captain_uri parameter specifies the URI and management port of the captain instance.
The -election parameter indicates the type of captain that this cluster uses. By setting -election to "false", you indicate that the cluster uses a static captain.

You do not need to restart the captain or any other members after running these commands. The captain immediately takes control of the cluster.

To confirm that the cluster is now operating with a static captain, run this CLI command from any member:

splunk show shcluster-status -auth <username>:<password>

The election flag will be set to 0.

Revert to the dynamic captain

When the precipitating situation has resolved, you should revert the cluster to control by a single, dynamic captain. To switch to dynamic captain, you reconfigure all the members that you previously configured for static captain. How exactly you do this depends on the type of scenario you are recovering from.

This topic provides reversion procedures for the two main scenarios:

Single-site cluster with loss of majority, where you converted the remaining members to use static captain. Once the cluster regains a majority, you should convert the members back to dynamic.

Two-site cluster, where the majority site went down and you converted the members on the minority site to use static captain. Once the majority site returns, you should convert all members to dynamic.

Return single-site cluster to dynamic captain

In the scenario of a single-site cluster with loss of majority, you should revert to dynamic mode once the cluster regains its majority:

1. As members come back online, convert them one-by-one to point to the static captain:

splunk edit shcluster-config -election false -mode member -captain_uri <URI>:<management_port>

Note the following:

The -captain_uri parameter specifies the URI and management port of the static captain instance.

You do not need to restart the member after running this command.

As you point each rejoining member to the static captain, it attempts to download the replication delta. If the purge limit has been exceeded, the system will prompt you to perform a manual resync, as explained in "How the update proceeds."

Caution: During the time that it takes for the remaining steps of this procedure to complete, your users should not make any configuration changes.

2. Once the cluster has regained its majority, convert all members back to dynamic captain use. Convert the current, static captain last. To accomplish this, run this command on each member:

splunk edit shcluster-config -election true -mgmt_uri <URI>:<management_port>

Note the following:

The -election parameter indicates the type of captain that this cluster uses. By setting -election to "true", you indicate that the cluster uses a dynamic captain.
The -mgmt_uri parameter specifies the URI and management port for this member instance. You must use the fully qualified domain name. This is the same value that you specified when you first deployed the member with the splunk init command.

You do not need to restart the member after running this command.

3. Bootstrap one of the members. This member then becomes the first dynamic captain. It is recommended that you bootstrap the member that was previously serving as the static captain.

splunk bootstrap shcluster-captain -servers_list "<URI>:<management_port>,<URI>:<management_port>,..." -auth <username>:<password>

For information on these parameters, see "Bring up the cluster captain."

Return two-site cluster to dynamic captain

In the scenario of a two-site cluster with loss of the majority site, you should revert to dynamic mode once the majority site comes back online:

1. When the majority site comes back online, convert its members to use the static captain. Point each majority site member to the static captain:

splunk edit shcluster-config -election false -mode member -captain_uri <URI>:<management_port>

Note the following:

The -captain_uri parameter specifies the URI and management port of the static captain instance.

You do not need to restart the member after running this command.

As you point each rejoining member to the static captain, it attempts to download the replication delta. If the purge limit has been exceeded, the system will prompt you to perform a manual resync, as explained in "How the update proceeds."

2. Wait for all the majority-site members to get the replicated configs from the static captain. This typically takes a few minutes.

Caution: During the time that it takes for the remaining steps of this procedure to complete, your users should not make any configuration changes.

3. Convert all members back to dynamic captain use. Convert the current, static captain last. To accomplish this, run this command on each member:

splunk edit shcluster-config -election true -mgmt_uri <URI>:<management_port>

Note the following:

The -election parameter indicates the type of captain that this cluster uses. By setting -election to "true", you indicate that the cluster uses a dynamic captain.
The -mgmt_uri parameter specifies the URI and management port for this member instance. You must use the fully qualified domain name. This is the same value that you specified when you first deployed the member with the splunk init command.

You do not need to restart the member after running this command.

4. Bootstrap one of the members. This member then becomes the first dynamic captain. It is recommended that you bootstrap the member that was previously serving as the static captain.

splunk bootstrap shcluster-captain -servers_list "<URI>:<management_port>,<URI>:<management_port>,..." -auth <username>:<password>

For information on these parameters, see "Bring up the cluster captain."

Related answers from Splunk Community

Use static captain to recover from loss of majority

Shortcomings of the static captain

Use cases for static captain

Switch to a static captain

Revert to the dynamic captain

Return single-site cluster to dynamic captain

Return two-site cluster to dynamic captain

Comments

Use static captain to recover from loss of majority

Was this topic useful?