Take a peer offline
Use the CLI
splunk offline command to take a peer offline. By using the
offline command, you minimize any disruption to your searches.
- A valid cluster has primary copies of all its buckets and therefore is able to handle search requests across the entire set of data. In the case of a multisite cluster, a valid cluster also has primary copies for every site with search affinity.
- A complete cluster has replication factor number of copies of all its buckets, with search factor number of searchable copies. It therefore meets the designated requirements for failure tolerance. A complete cluster is also a valid cluster.
The offline command
offline command handles peer shutdown. It takes the peer down gracefully, allowing most in-progress searches to complete while also returning the cluster quickly to the valid state. In this way, it essentially eliminates disruption to existing or future searches. The
offline command also initiates further remedial bucket-fixing activities to return the cluster to a complete state.
Note: The peer goes down after a maximum of 5-10 minutes, even if searches are still in progress.
For the offline command to work as intended, the search factor must be at least 2. This is why: With a search factor of 2, there is always a spare searchable copy of the data. If a peer with searchable data goes offline, new searches can immediately switch to the spare searchable copy of that data, thus minimizing the time that the cluster is not fully searchable. On the other hand, if the search factor is just 1, the master must make non-searchable copies searchable before searches can again run across the full set of data. That can potentially take quite a while, as described in "Estimate the cluster recovery time when a peer goes offline".
Important: When taking down a peer, use the
offline command, not the
stop command. The
offline command stops the peer, but it does so in a way that minimizes any disruption to your searches.
Take a peer down with the offline command
Here is the syntax for the
You run this command directly on the peer.
When you run this command, the peer shuts down after the master finishes coordinating the necessary activities to return the cluster to a valid state (by reassigning primaries, as necessary), up to a maximum of 5-10 minutes. Until it shuts down, the peer continues to participate in any in-progress searches.
After the peer shuts down, you have 60 seconds (by default) to complete any maintenance and bring the peer back online. If the peer does not return to the cluster within this time, the master initiates bucket-fixing activities to return the cluster to a complete state. If you need more than 60 seconds, you can extend the time the master waits for the peer to come back online by configuring the
restart_timeout attribute, as described in "Extend the restart period".
Important: To minimize bucket fixup activities, you should ordinarily take down only one peer at a time. If you are performing an operation that involves taking many peers offline temporarily, you should consider invoking maintenance mode during the operation. See "Use maintenance mode".
For detailed information on the processes that occur when a peer goes offline, read "What happens when a peer node goes down".
After a peer goes down, it continues to appear on the list of peers on the master dashboard, although its status changes to "Down." To remove the peer from the master's list, see "Remove a peer from the master's list."
Extend the restart period
If you need to perform maintenance on a peer and you expect the time required to exceed the master's
restart_timeout period (set to 60 seconds by default), you can change the value of that setting. Run this CLI command on the master:
splunk edit cluster-config -restart_timeout <seconds>
For example, this command resets the timeout period to 900 seconds (15 minutes):
splunk edit cluster-config -restart_timeout 900
You can run this command on the fly. You do not need to restart the master after it runs.
You can also change this value in
server.conf on the master.
Estimate the cluster recovery time when a peer goes offline
When a peer goes offline for a period that exceeds
restart_timeout, the master coordinates activities among the remaining peers to fix the buckets and return the cluster to a complete state. For example, if the peer going offline is storing copies of 10 buckets and five of those copies are searchable, the master instructs peers to:
- Stream copies of those 10 buckets to other peers, so that the cluster regains a full complement of bucket copies (to match the replication factor).
- Make five non-searchable bucket copies searchable, so that the cluster regains a full complement of searchable bucket copies (to match the search factor).
This activity can take some time to complete. Exactly how long depends on many factors, such as:
- System considerations, such as CPU specifications, storage type, interconnect type.
- Amount of other indexing currently being performed by the peers that are tasked with making buckets searchable.
- The size and number of buckets stored on the offline peer.
- The size of the index files on the searchable copies stored on the offline peer. (These index files can vary greatly in size relative to rawdata size, depending on factors such as amount of segmentation.) For information on the relative sizes of rawdata and index files, read "Storage considerations".
- The search factor. This determines how quickly the cluster can convert non-searchable copies to searchable. If the search factor is at least 2, the cluster can convert non-searchable copies to searchable by copying index files to the non-searchable copies from the remaining set of searchable copies. If the search factor is 1, however, the cluster must convert non-searchable copies by rebuilding the index files, a much slower process. (For information on the types of files in a bucket, see "Data files".)
Despite these variable factors, you can make a rough determination of how long the process will take. Assuming you are using Splunk Enterprise reference hardware, here are some basic estimates of how long the two main activities take:
- To stream 10GB (rawdata and/or index files) from one peer to another across a LAN takes about 5-10 minutes.
- The time required to rebuild the index files on a non-searchable bucket copy containing 4GB of rawdata depends on a number of factors such as the size of the resulting index files, but 30 minutes is a reasonable approximation to start with. Rebuilding index files is necessary if the search factor is 1, meaning that there are no copies of the index files available to stream. A non-searchable bucket copy consisting of 4GB rawdata can grow to a size approximating 10GB once the index files have been added. As described earlier, the actual size depends on numerous factors.
View the search head dashboard
Use maintenance mode
This documentation applies to the following versions of Splunk® Enterprise: 6.2.0, 6.2.1, 6.2.2, 6.2.3, 6.2.4, 6.2.5, 6.2.6, 6.2.7, 6.2.8, 6.2.9, 6.2.10, 6.2.11, 6.2.12, 6.2.13, 6.2.14, 6.2.15, 6.3.0, 6.3.1, 6.3.2, 6.3.3, 6.3.4, 6.3.5, 6.3.6, 6.3.7, 6.3.8, 6.3.9, 6.3.10, 6.3.11, 6.3.12, 6.3.13, 6.3.14