Take a peer offline
Use the splunk offline
command to take a peer offline.
Caution: Do not use splunk stop
to take a peer offline. Instead, use splunk offline
. It stops the peer in a way that minimizes disruption to your searches.
Depending on your needs, you can take a peer offline permanently or temporarily. In both cases, the cluster performs actions to regain its valid and complete states:
- A valid cluster has primary copies of all its buckets and therefore is able to handle search requests across the entire set of data. In the case of a multisite cluster, a valid cluster also has primary copies for every site with search affinity.
- A complete cluster has replication factor number of copies of all its buckets, with search factor number of searchable copies. It therefore meets the designated requirements for failure tolerance. A complete cluster is also a valid cluster.
Offline use cases
The splunk offline
command has two versions, one that takes the peer offline temporarily and another that takes it offline permanently.
When you take a peer offline temporarily
When you take a peer offline temporarily, it is usually to perform a short-term maintenance task, such as a machine or operating system upgrade. You want the cluster to continue to process data and handle searches without interruption, but it is acceptable if the cluster does not meet its replication factor or search factor during the time that the peer is offline.
When a peer goes offline temporarily, the master kicks off processes to return it to a valid state, but it does not ordinarily attempt to return it to a complete state, because the cluster regains its complete state as soon as the peer comes back online.
When you take a peer offline permanently
When you take an indexer cluster peer offline permanently, you want to ensure that the cluster continues to process data and handle searches without interruption. You also want the cluster to replace any copies of buckets that are lost by the peer going offline. For example, if the offline peer was maintaining copies of 10 buckets (three searchable and seven non-searchable), the cluster must recreate those copies on other peers in the cluster, to fulfill its replication factor and search factor.
When a peer goes offline permanently, the master kicks off various bucket-fixing processes, so that the cluster returns to a valid and complete state.
The splunk offline command
The splunk offline
command handles both types of peer shutdown: temporary and permanent. It takes the peer down gracefully, attempting to allow in-progress searches to complete, while also returning the cluster quickly to the valid state. In this way, it tries to eliminate disruption to existing or future searches.
The splunk offline
command also initiates remedial bucket-fixing activities to return the cluster to a complete state. Depending on the version of the command that you run, it will start this process either immediately or after waiting a specified period of time, to give the peer time to come back on line and avoid the need for bucket-fixing.
There are two versions of the splunk offline
command that correspond to the typical use cases:
splunk offline
. Used to take a peer down temporarily for maintenance operations. Also known as the "fast offline" command.splunk offline --enforce-counts
. Used to remove a peer permanently from the cluster. Also known as the "enforce-counts offline" command.
Take a peer down temporarily: the fast offline command
The fast version of the splunk offline
command has the simple syntax: splunk offline
.
The cluster attempts to regain its valid state before the peer goes down. It does not attempt to regain its complete state. You can use this version to bring the peer down briefly without kicking off any bucket-fixing activities.
You can also use this version in cases where you want the peer to go down permanently but quickly, with the bucket-fixing occurring after it goes down.
The fast offline process
The peer goes down after the cluster attempts, within certain constraints, to meet two conditions:
- Reallocation of primary copies on the peer, so that the cluster regains its valid state
- Completion of any searches that the peer is currently participating in
The peer goes down after its primary bucket copies have been reallocated to searchable copies on other peers, so that the cluster regains its valid state. The maximum time period allotted for the primary allocation activity is five minutes.
Note: If the cluster has a search factor of 1, the cluster does not attempt to reallocate primary copies before allowing the peer to go down. With a search factor of 1, the cluster cannot fix the primaries without first creating new searchable copies, which takes significant time and thus would defeat the goal of a fast shutdown.
The peer also waits for any ongoing searches to complete, up to a maximum time period, as determined by the decommission_search_jobs_wait_secs
attribute in server.conf
. The default for this attribute is three minutes.
Once these conditions have been met, or the maximum durations for the activities have been exceeded, the peer goes down.
Syntax for the fast offline command
Here is the syntax for the fast version of the splunk offline
command:
splunk offline
You run this command directly on the peer.
When you run this command, the peer shuts down after the cluster returns to a valid state and the peer completes any ongoing searches, as described in The fast offline process.
After the peer shuts down, you have 60 seconds (by default) to complete any maintenance work and bring the peer back online. If the peer does not return to the cluster within this time, the master initiates bucket-fixing activities to return the cluster to a complete state. If you need more time, you can extend the time that the master waits for the peer to come back online by configuring the restart_timeout
attribute, as described in Extend the restart period.
Important: To minimize any bucket-fixing activities, you should ordinarily take down only one peer at a time. If you are performing an operation that involves taking many peers offline temporarily, consider invoking maintenance mode during the operation. See Use maintenance mode.
For detailed information on the processes that occur when a peer goes offline, read What happens when a peer node goes down.
Extend the restart period
If you need to perform maintenance on a peer and you expect the time required to exceed the master's restart_timeout
period (set to 60 seconds by default), you can change the value of that setting. Run this CLI command on the master:
splunk edit cluster-config -restart_timeout <seconds>
For example, this command resets the timeout period to 900 seconds (15 minutes):
splunk edit cluster-config -restart_timeout 900
You can run this command on the fly. You do not need to restart the master after it runs.
You can also change this value in server.conf
on the master.
Take a peer down permanently: the enforce-counts offline command
The enforce-counts version of the offline command is intended for use when you want to take a peer offline permanently, but only after the cluster has returned to its complete state.
In this version of the command, the cluster performs the necessary bucket-fixing activities to regain its valid and complete state before allowing the peer to go down.
The enforce-counts offline process
The peer goes down after the cluster meets two conditions:
- Completion of all bucket-fixing activities that are necessary for the cluster to regain its complete state
- Completion of ongoing searches that the peer is participating in, constrained by a time limit
The peer goes down only after its searchable and non-searchable bucket copies have been reallocated to other peers, causing the cluster to regain its complete state.
Because this version of splunk offline
requires that the cluster return to a complete state before the peer can go down, certain preconditions are necessary before you can run this command:
- The cluster must have (replication factor + 1) number of peers, so that it can reallocate bucket copies to other peers as necessary and can continue to meet its replication factor after the peer goes down.
- In a multisite cluster, the peer's site must have enough peers so that the site continues to fulfill the requirements of its site replication factor, in terms of the number of origin or explicit peers.
- The cluster cannot be in maintenance mode, because bucket fixup does not occur during maintenance mode.
The peer also waits for any ongoing searches to complete, up to a maximum time, as determined by the decommission_search_jobs_wait_secs
attribute in server.conf
. The default for this attribute is three minutes.
Syntax for the enforce-counts offline command
Here is the syntax for the enforce-counts version of the splunk offline
command:
splunk offline --enforce-counts
You run this command directly on the peer.
This version of the command initiates an operation called decommissioning, during which the master coordinates a wide range of remedial processes. The peer does not shut down until those processes finish and the cluster returns to a complete state. This can take quite a while if the peer is maintaining a large set of bucket copies.
The actual time required to return to the complete state depends on the amount and type of data the peer was maintaining. See Estimate the cluster recovery time when a peer gets decommissioned for details.
If the cluster is unable to return to the complete state, the peer will not shut down. This is due to issues described in The enforce-counts offline process. If you need to take a peer offline despite such issues, run the fast version of the splunk offline
command instead.
For detailed information on the processes that occur when a peer gets decommissioned, read What happens when a peer node goes down.
After a peer goes down, it continues to appear on the list of peers on the master dashboard, although its status changes to "GracefulShutdown." To remove the peer from the master's list, see Remove a peer from the master's list.
Estimate the cluster recovery time when a peer gets decommissioned
When you decommission a peer, the master coordinates activities among the remaining peers to fix the buckets and return the cluster to a complete state. For example, if the peer going offline is storing copies of 10 buckets and five of those copies are searchable, the master instructs peers to:
- Stream copies of those 10 buckets to other peers, so that the cluster regains a full complement of bucket copies (to match the replication factor).
- Make five non-searchable bucket copies searchable, so that the cluster regains a full complement of searchable bucket copies (to match the search factor).
This activity can take some time to complete. Exactly how long depends on many factors, such as:
- System considerations, such as CPU specifications, storage type, interconnect type.
- Amount of other indexing currently being performed by the peers that are tasked with making buckets searchable.
- The size and number of buckets stored on the offline peer.
- The size of the index files on the searchable copies stored on the offline peer. (These index files can vary greatly in size relative to rawdata size, depending on factors such as amount of segmentation.) For information on the relative sizes of rawdata and index files, read Storage considerations.
- The search factor. This determines how quickly the cluster can convert non-searchable copies to searchable. If the search factor is at least 2, the cluster can convert non-searchable copies to searchable by copying index files to the non-searchable copies from the remaining set of searchable copies. If the search factor is 1, however, the cluster must convert non-searchable copies by rebuilding the index files, a much slower process. (For information on the types of files in a bucket, see Data files.)
Despite these variable factors, you can make a rough determination of how long the process will take. Assuming you are using Splunk Enterprise reference hardware, here are some basic estimates of how long the two main activities take:
- To stream 10GB (rawdata and/or index files) from one peer to another across a LAN takes about 5-10 minutes.
- The time required to rebuild the index files on a non-searchable bucket copy containing 4GB of rawdata depends on a number of factors such as the size of the resulting index files, but 30 minutes is a reasonable approximation to start with. Rebuilding index files is necessary if the search factor is 1, meaning that there are no copies of the index files available to stream. A non-searchable bucket copy consisting of 4GB rawdata can grow to a size approximating 10GB once the index files have been added. As described earlier, the actual size depends on numerous factors.
Add a peer to the cluster | Use maintenance mode |
This documentation applies to the following versions of Splunk® Enterprise: 7.0.0, 7.0.1, 7.0.2, 7.0.3, 7.0.4, 7.0.5, 7.0.6, 7.0.7, 7.0.8, 7.0.9, 7.0.10, 7.0.11, 7.0.13, 7.1.0, 7.1.1, 7.1.2, 7.1.3, 7.1.4, 7.1.5, 7.1.6, 7.1.7, 7.1.8, 7.1.9, 7.1.10
Feedback submitted, thanks!