Splunk® Enterprise

Managing Indexers and Clusters of Indexers

Download manual as PDF

This documentation does not apply to the most recent version of Splunk. Click here for the latest version.
Download topic as PDF

Take a peer offline

Use the CLI splunk offline command to take a peer offline. By using the offline command, you minimize any disruption to your searches.

During and after a peer goes offline, the cluster performs actions to regain its valid and complete states:

  • A valid cluster has primary copies of all its buckets and therefore is able to handle search requests across the entire set of data. In the case of a multisite cluster, a valid cluster also has primary copies for every site with search affinity.
  • A complete cluster has replication factor number of copies of all its buckets, with search factor number of searchable copies. It therefore meets the designated requirements for failure tolerance. A complete cluster is also a valid cluster.

The offline command

The CLI offline command handles peer shutdown. It takes the peer down gracefully, allowing most in-progress searches to complete while also returning the cluster quickly to the valid state. In this way, it essentially eliminates disruption to existing or future searches. The offline command also initiates further remedial bucket-fixing activities to return the cluster to a complete state.

Note: The peer goes down after a maximum of 5-10 minutes, even if searches are still in progress.

For the offline command to work as intended, the search factor must be at least 2. This is why: With a search factor of 2, there is always a spare searchable copy of the data. If a peer with searchable data goes offline, new searches can immediately switch to the spare searchable copy of that data, thus minimizing the time that the cluster is not fully searchable. On the other hand, if the search factor is just 1, the master must make non-searchable copies searchable before searches can again run across the full set of data. That can potentially take quite a while, as described in "Estimate the cluster recovery time when a peer goes offline".

Important: When taking down a peer, use the offline command, not the stop command. The offline command stops the peer, but it does so in a way that minimizes any disruption to your searches.

Take a peer down with the offline command

Here is the syntax for the offline command:

splunk offline

You run this command directly on the peer.

When you run this command, the peer shuts down after the master finishes coordinating the necessary activities to return the cluster to a valid state (by reassigning primaries, as necessary), up to a maximum of 5-10 minutes. Until it shuts down, the peer continues to participate in any in-progress searches.

After the peer shuts down, you have 60 seconds (by default) to complete any maintenance and bring the peer back online. If the peer does not return to the cluster within this time, the master initiates bucket-fixing activities to return the cluster to a complete state. If you need more than 60 seconds, you can extend the time the master waits for the peer to come back online by configuring the restart_timeout attribute, as described in "Extend the restart period".

Important: To minimize bucket fixup activities, you should ordinarily take down only one peer at a time. If you are performing an operation that involves taking many peers offline temporarily, you should consider invoking maintenance mode during the operation. See "Use maintenance mode".

For detailed information on the processes that occur when a peer goes offline, read "What happens when a peer node goes down".

After a peer goes down, it continues to appear on the list of peers on the master dashboard, although its status changes to "Down." To remove the peer from the master's list, see "Remove a peer from the master's list."

Extend the restart period

If you need to perform maintenance on a peer and you expect the time required to exceed the master's restart_timeout period (set to 60 seconds by default), you can change the value of that setting. Run this CLI command on the master:

splunk edit cluster-config -restart_timeout <seconds>

For example, this command resets the timeout period to 900 seconds (15 minutes):

splunk edit cluster-config -restart_timeout 900

You can run this command on the fly. You do not need to restart the master after it runs.

You can also change this value in server.conf on the master.

Estimate the cluster recovery time when a peer goes offline

When a peer goes offline for a period that exceeds restart_timeout, the master coordinates activities among the remaining peers to fix the buckets and return the cluster to a complete state. For example, if the peer going offline is storing copies of 10 buckets and five of those copies are searchable, the master instructs peers to:

  • Stream copies of those 10 buckets to other peers, so that the cluster regains a full complement of bucket copies (to match the replication factor).
  • Make five non-searchable bucket copies searchable, so that the cluster regains a full complement of searchable bucket copies (to match the search factor).

This activity can take some time to complete. Exactly how long depends on many factors, such as:

  • System considerations, such as CPU specifications, storage type, interconnect type.
  • Amount of other indexing currently being performed by the peers that are tasked with making buckets searchable.
  • The size and number of buckets stored on the offline peer.
  • The size of the index files on the searchable copies stored on the offline peer. (These index files can vary greatly in size relative to rawdata size, depending on factors such as amount of segmentation.) For information on the relative sizes of rawdata and index files, read "Storage considerations".
  • The search factor. This determines how quickly the cluster can convert non-searchable copies to searchable. If the search factor is at least 2, the cluster can convert non-searchable copies to searchable by copying index files to the non-searchable copies from the remaining set of searchable copies. If the search factor is 1, however, the cluster must convert non-searchable copies by rebuilding the index files, a much slower process. (For information on the types of files in a bucket, see "Data files".)

Despite these variable factors, you can make a rough determination of how long the process will take. Assuming you are using Splunk Enterprise reference hardware, here are some basic estimates of how long the two main activities take:

  • To stream 10GB (rawdata and/or index files) from one peer to another across a LAN takes about 5-10 minutes.
  • The time required to rebuild the index files on a non-searchable bucket copy containing 4GB of rawdata depends on a number of factors such as the size of the resulting index files, but 30 minutes is a reasonable approximation to start with. Rebuilding index files is necessary if the search factor is 1, meaning that there are no copies of the index files available to stream. A non-searchable bucket copy consisting of 4GB rawdata can grow to a size approximating 10GB once the index files have been added. As described earlier, the actual size depends on numerous factors.
PREVIOUS
Use the DMC to view indexer cluster status
  NEXT
Use maintenance mode

This documentation applies to the following versions of Splunk® Enterprise: 6.2.0, 6.2.1, 6.2.2, 6.2.3, 6.2.4, 6.2.5, 6.2.6, 6.2.7, 6.2.8, 6.2.9, 6.2.10, 6.2.11, 6.2.12, 6.2.13, 6.2.14, 6.2.15, 6.3.0, 6.3.1, 6.3.2, 6.3.3, 6.3.4, 6.3.5, 6.3.6, 6.3.7, 6.3.8, 6.3.9, 6.3.10, 6.3.11, 6.3.12, 6.3.13, 6.3.14


Comments

Kbecker - The enforce-counts version of this command is currently not supported, as there are some problems with how it operates. The online help is in error in exposing that flag. Thank you for bring that to our attention. There are plans to support enforce-counts in a future release, however.

Sgoodman, Splunker
October 1, 2015

The flag "enforce-counts" is not documented here
splunk help offline

Flags:
enforce-counts If this flag is used, the cluster is completely fixed up before this peer is taken down.
ie the Replication factor and search factor for the cluster are honored to the maximum possible extent.
Without this flag, the master will simple rearrange the primaries and timeout after 5 minutes(by default),
The amount of time the master waits for the peer is configurable using the "restart_timeout" parameter using
the "./splunk edit cluster-config" command.

Kbecker
October 1, 2015

Was this documentation topic helpful?

Enter your email address, and someone from the documentation team will respond to you:

Please provide your comments here. Ask a question or make a suggestion.

You must be logged into splunk.com in order to post comments. Log in now.

Please try to keep this discussion focused on the content covered in this documentation topic. If you have a more general question about Splunk functionality or are experiencing a difficulty with Splunk, consider posting a question to Splunkbase Answers.

0 out of 1000 Characters