Use rolling restart
splunk rolling-restart command performs a phased restart of all the peer nodes, so that the cluster as a whole can continue to perform its functions during the restart process.
The rolling restart helps to ensure that load-balanced forwarders sending data to the cluster always have a peer available to receive the data.
A rolling restart occurs under these circumstances:
- You initiate a rolling restart by invoking the
- The master initiates a rolling restart. The master automatically initiates a rolling restart, when necessary, after distributing a configuration bundle to the peer nodes. For details on this process, see Distribute the configuration bundle.
How rolling restart works
During a rolling restart, approximately 10% (by default) of the peer nodes simultaneously undergo restart, until all peers in the cluster complete restart. If there are less than 10 peers in the cluster, one peer at a time undergoes restart. The master node orchestrates the restart process, sending a message to each peer when it is its turn to restart.
The restart percentage tells the master how many restart slots to keep open during the rolling-restart process. For example, if the cluster has 30 peers and the restart percentage is set to the default of 10%, the master keeps three slots open for peers to restart. When the rolling-restart process begins, the master issues a restart message to three peers. As soon as each peer completes its restart and contacts the master, the master issues a restart message to another peer, and so on, until all peers have restarted. Under normal circumstances, in this example, there will always be three peers undergoing restart, until the end of the process.
Caution: If the peers are restarting slowly, due to inadequately provisioned machines or other reasons, the number of peers simultaneously undergoing restart can exceed the restart percentage. See Handle slow restarts.
At the end of the rolling restart, the master rebalances the cluster primary buckets. See Rebalance the indexer cluster primary buckets.
Here are a few things to note about the behavior of a rolling restart:
- The master restarts the peers in random order.
- The cluster enters maintenance mode for the duration of the rolling restart. This prevents unnecessary bucket fixup while a peer undergoes restart.
- During a rolling restart, there is no guarantee that the cluster will be fully searchable.
Specify a rolling restart
You invoke the
splunk rolling-restart command from the master:
splunk rolling-restart cluster-peers
Specify the percentage of peers to restart at a time
By default, 10% of the peers restart at a time. The restart percentage is configurable through the
percent_peers_to_restart attribute in the
[clustering] stanza of
server.conf. For convenience, you can configure this attribute with the CLI
splunk edit cluster-config command.
For example, to cause 20% of the peers to restart simultaneously, run this command:
splunk edit cluster-config -percent_peers_to_restart 20
To cause all peers to restart immediately, run the command with a value of 100:
splunk edit cluster-config -percent_peers_to_restart 100
An immediate restart of all peers can be useful under certain circumstances, such as when no users are actively searching and no forwarders are actively sending data to the cluster. It minimizes the time required to complete the restart.
After changing the
percent_peers_to_restart attribute, you must run the
splunk rolling-restart command to initiate the actual restart.
Rolling restart on a multisite cluster
With a multisite cluster, you can specify that the rolling restart proceed with site awareness. That is, the master restarts all peers on one site before proceeding to restart the peers on the next site, and so on. This ensures that the cluster is always fully searchable, assuming that each site has a full set of primaries.
By default, the rolling restart process is not site aware. The master restarts peers without taking into consideration where the peers reside.
Invoke rolling restart on a multisite cluster
When you invoke the
splunk rolling-restart command for a multisite cluster, you can specify these behaviors:
- Whether the restart should proceed in a site-aware fashion, through the
- The site order, through the
Here is the multisite version of the command:
splunk rolling-restart cluster-peers [-site-by-site true|false] [-site-order site<n>,site<n>, ...]
Note the following:
- This parameter specifies whether the restart is site-aware.
- The default is false, that is, the master selects each peer randomly from across the entire cluster, without taking into consideration which site it resides on.
- This parameter specifies the site restart order.
- You must list all available sites when using this option.
- This parameter only has meaning if the
-site-by-siteparameter is set to true.
- The default, if this parameter is not specified, is to select sites at random.
For example, say you have a three-site cluster, you can specify site-aware rolling restart with this command:
splunk rolling-restart cluster-peers -site-by-site true -site-order site1,site3,site2
The value of "true" for
-site-by-site means that the master completes a rolling restart of all peers on one site before proceeding to the next site. The
-site-order parameter causes the master to initiate the restarts in this order: site1, site3, site2. So, the master initiates a rolling restart on site1 and waits until it completes, then initiates a rolling restart on site3 and waits until it completes, and then initiates a rolling restart on site2.
How the master determines the number of multisite peers to restart in each round
You can specify the percentage of peers that restart simultaneously by editing the
percent_peers_to_restart attribute in
server.conf, in the same way that you do for a single-site cluster. This percentage is always calculated globally, even for site-aware rolling restarts.
Assuming the default of 10%, in a two-site cluster with 10 peers on site1 and 20 peers on site2, for a total of 30 peers, the master restarts three peers at a time.
If the restart is not site-aware, the master selects the three peers randomly from across the cluster. At any point, the peers currently restarting could be on both sites or they could all be on a single site.
If the restart is site aware, the restart instead proceeds like this:
1. The master selects a site to restart first, for example, site2. (The site order is configurable.)
2. The master restarts three peers from site2.
3. The master continues to restart peers from site2 as slots become available, until it restarts all 20 peers on site2. It waits until all peers on site2 restart before proceeding to site1. The master does not split restart slots across multiple sites.
4. The master restarts three peers on site1.
5. The master continues to restart peers from site1 until it restarts all 10 peers on site1.
Handle slow restarts
If the peer instances restart slowly, the peers in one group might still be undergoing restart when the master tells the next group to initiate restart. This can occur, for example, due to inadequate machine resources. To remedy this issue, you can increase the value of
restart_timeout in the master's
server.conf file. Its default value is 60 seconds.
Restart the entire indexer cluster or a single peer node
Rebalance the indexer cluster primary bucket copies
This documentation applies to the following versions of Splunk® Enterprise: 6.3.0, 6.3.1, 6.3.2, 6.3.3, 6.3.4, 6.3.5, 6.3.6, 6.3.7, 6.3.8, 6.3.9, 6.3.10, 6.3.11, 6.3.12, 6.3.13, 6.3.14, 6.4.0, 6.4.1, 6.4.2, 6.4.3, 6.4.4, 6.4.5, 6.4.6, 6.4.7, 6.4.8, 6.4.9, 6.4.10, 6.4.11, 6.5.0, 6.5.1, 6.5.1612 (Splunk Cloud only), 6.5.2, 6.5.3, 6.5.4, 6.5.5, 6.5.6, 6.5.7, 6.5.8, 6.5.9, 6.5.10