Splunk® Enterprise

Managing Indexers and Clusters of Indexers

Download manual as PDF

This documentation does not apply to the most recent version of Splunk. Click here for the latest version.
Download topic as PDF

Take a peer offline

Use the splunk offline command to take a peer offline.

Use splunk offline, not splunk stop. The splunk offline command stops the peer, but it does so in a way that minimizes disruption to your searches.

Depending on your needs, you can take a peer offline permanently or temporarily. In both cases, the cluster performs actions to regain its valid and complete states:

  • A valid cluster has primary copies of all its buckets and therefore is able to handle search requests across the entire set of data. In the case of a multisite cluster, a valid cluster also has primary copies for every site with search affinity.
  • A complete cluster has replication factor number of copies of all its buckets, with search factor number of searchable copies. It therefore meets the designated requirements for failure tolerance. A complete cluster is also a valid cluster.

When you take a peer offline temporarily

When you take a peer offline temporarily, it is usually to perform a short-term maintenance task, such as a machine or operating system upgrade. You want the cluster to continue to process data and handle searches without interruption, but it is acceptable if the cluster does not meet its replication factor or search factor during the time that the peer is offline.

When a peer goes offline temporarily, the master kicks off processes to return it to a valid state, but it does not ordinarily attempt to return it to a complete state, because the cluster regains its complete state as soon as the peer comes back online.

When you take a peer offline permanently

When you take an indexer cluster peer offline permanently, you want to ensure that the cluster continues to process data and handle searches without interruption. You also want the cluster to replace any copies of buckets that are lost by the peer going offline. For example, if the offline peer was maintaining copies of 10 buckets (three searchable and seven non-searchable), the cluster must recreate those copies on other peers in the cluster, to fulfill its replication factor and search factor.

When a peer goes offline permanently, the master kicks off various bucket-fixing processes, so that the cluster returns to a valid and complete state.

The offline command

The splunk offline command handles both types of peer shutdown: temporary and permanent. It takes the peer down gracefully, allowing in-progress searches to complete while also returning the cluster quickly to the valid state. In this way, it essentially eliminates disruption to existing or future searches. The splunk offline command also initiates remedial bucket-fixing activities to return the cluster to a complete state. Depending on how you run the command, it will start this process either immediately or after waiting a specified period of time, to give the peer time to come back on line and avoid the need for bucket-fixing.

Important: For the offline command to work as intended, the search factor must be at least 2. This is why: With a search factor of 2, there is always a spare searchable copy of the data. If a peer with searchable data goes offline, new searches can immediately switch to the spare searchable copy of that data, thus minimizing the time that the cluster is not fully searchable. On the other hand, if the search factor is just 1, the master must make non-searchable copies searchable before searches can again run across the full set of data. That can potentially take quite a while, as described in Estimate the cluster recovery time when a peer gets decommissioned.

There are two versions of the splunk offline command:

  • splunk offline: This is the fast version of the splunk offline command, intended mainly for taking a peer offline temporarily. The peer goes down after a maximum of 5-10 minutes, even if searches are still in progress. You can use this version of the command to bring the peer down briefly without kicking off any bucket-fixing activities. You can also use this version in cases where you want the peer to go down permanently but quickly, with the bucket-fixing occurring after it goes down.
  • splunk offline --enforce-counts: This is the enforce-counts version of the command, intended for use when you want to take a peer offline permanently but only after the cluster has returned to the complete state. If you invoke the enforce-counts flag, the peer does not shut down until all search and remedial activities have completed.

Important: When taking a peer offline, use splunk offline, not splunk stop. The splunk offline command stops the peer, but it does so in a way that minimizes disruption to your searches.

Take a peer down temporarily: the fast offline command

Here is the syntax for the fast version of the splunk offline command:

splunk offline

You run this command directly on the peer.

When you run this command, the peer shuts down after the master finishes coordinating the necessary activities to return the cluster to a valid state, up to a maximum of 5-10 minutes. Until it shuts down, the peer continues to participate in any in-progress searches.

After the peer shuts down, you have 60 seconds (by default) to complete any maintenance and bring the peer back online. If the peer does not return to the cluster within this time, the master initiates bucket-fixing activities to return the cluster to a complete state. If you need more time, you can extend the time the master waits for the peer to come back online by configuring the restart_timeout attribute, as described in Extend the restart period.

Important: To minimize bucket fixup activities, you should ordinarily take down only one peer at a time. If you are performing an operation that involves taking many peers offline temporarily, consider invoking maintenance mode during the operation. See Use maintenance mode.

For detailed information on the processes that occur when a peer goes offline, read What happens when a peer node goes down.

Extend the restart period

If you need to perform maintenance on a peer and you expect the time required to exceed the master's restart_timeout period (set to 60 seconds by default), you can change the value of that setting. Run this CLI command on the master:

splunk edit cluster-config -restart_timeout <seconds>

For example, this command resets the timeout period to 900 seconds (15 minutes):

splunk edit cluster-config -restart_timeout 900

You can run this command on the fly. You do not need to restart the master after it runs.

You can also change this value in server.conf on the master.

Take a peer down permanently: the enforce-counts offline command

Here's the syntax for the enforce-counts version of the splunk offline command:

splunk offline --enforce-counts

You run this command directly on the peer.

This version of the command initiates a process called decommissioning, during which the master coordinates a wide range of remedial processes. The peer does not shut down until those processes finish and the cluster returns to a complete state. This can take quite a while if the peer is maintaining a large set of bucket copies.

The actual time required to return to the complete state depends on the amount and type of data the peer was maintaining. See Estimate the cluster recovery time when a peer gets decommissioned for details.

Note: If the cluster is unable to return to the complete state, the peer will not shut down. This is usually due to having less than replication factor number of peers remaining in the cluster. For example, if the cluster has a replication factor of 3 and has just three peer nodes, it cannot return to the complete state if you try to take one of those nodes offline. If you need to take a peer offline in this circumstance, run the fast version of the splunk offline command instead.

During decommissioning, the peer continues to participate in any in-progress searches.

If you restart the peer later, it adds itself back into the cluster.

For detailed information on the processes that occur when a peer gets decommissioned, read What happens when a peer node goes down.

After a peer goes down, it continues to appear on the list of peers on the master dashboard, although its status changes to "GracefulShutdown." To remove the peer from the master's list, see Remove a peer from the master's list.

Estimate the cluster recovery time when a peer gets decommissioned

When you decommission a peer, the master coordinates activities among the remaining peers to fix the buckets and return the cluster to a complete state. For example, if the peer going offline is storing copies of 10 buckets and five of those copies are searchable, the master instructs peers to:

  • Stream copies of those 10 buckets to other peers, so that the cluster regains a full complement of bucket copies (to match the replication factor).
  • Make five non-searchable bucket copies searchable, so that the cluster regains a full complement of searchable bucket copies (to match the search factor).

This activity can take some time to complete. Exactly how long depends on many factors, such as:

  • System considerations, such as CPU specifications, storage type, interconnect type.
  • Amount of other indexing currently being performed by the peers that are tasked with making buckets searchable.
  • The size and number of buckets stored on the offline peer.
  • The size of the index files on the searchable copies stored on the offline peer. (These index files can vary greatly in size relative to rawdata size, depending on factors such as amount of segmentation.) For information on the relative sizes of rawdata and index files, read Storage considerations.
  • The search factor. This determines how quickly the cluster can convert non-searchable copies to searchable. If the search factor is at least 2, the cluster can convert non-searchable copies to searchable by copying index files to the non-searchable copies from the remaining set of searchable copies. If the search factor is 1, however, the cluster must convert non-searchable copies by rebuilding the index files, a much slower process. (For information on the types of files in a bucket, see Data files.)

Despite these variable factors, you can make a rough determination of how long the process will take. Assuming you are using Splunk Enterprise reference hardware, here are some basic estimates of how long the two main activities take:

  • To stream 10GB (rawdata and/or index files) from one peer to another across a LAN takes about 5-10 minutes.
  • The time required to rebuild the index files on a non-searchable bucket copy containing 4GB of rawdata depends on a number of factors such as the size of the resulting index files, but 30 minutes is a reasonable approximation to start with. Rebuilding index files is necessary if the search factor is 1, meaning that there are no copies of the index files available to stream. A non-searchable bucket copy consisting of 4GB rawdata can grow to a size approximating 10GB once the index files have been added. As described earlier, the actual size depends on numerous factors.
PREVIOUS
Use the DMC to view indexer cluster status
  NEXT
Use maintenance mode

This documentation applies to the following versions of Splunk® Enterprise: 6.4.0, 6.4.1, 6.4.2, 6.4.3, 6.4.4, 6.4.5, 6.4.6, 6.4.7, 6.4.8, 6.4.9, 6.4.10, 6.4.11, 6.5.0, 6.5.1, 6.5.2, 6.5.3, 6.5.4, 6.5.5, 6.5.6, 6.5.7, 6.5.8, 6.5.9, 6.5.10


Comments

Can you please include what happens when you (1) Take a peer offline say for 10mins (2) the physical server is shutdown and on restart it starts Splunk . this happens in say 7 minutes.
Will the peer be still offline until the 10mins is over? or the restart at 7th minute will remove offline ?

Koshyk
April 10, 2017

Dfronck - There were a few issues that needed working through. Happy to say that it's now doing what it's supposed to!

Sgoodman, Splunker
April 21, 2016

It looks like "splunk offline --enforce-counts" was added back in v6.4. Do you know why it was removed around v6.3?

Dfronck
April 19, 2016

Was this documentation topic helpful?

Enter your email address, and someone from the documentation team will respond to you:

Please provide your comments here. Ask a question or make a suggestion.

You must be logged into splunk.com in order to post comments. Log in now.

Please try to keep this discussion focused on the content covered in this documentation topic. If you have a more general question about Splunk functionality or are experiencing a difficulty with Splunk, consider posting a question to Splunkbase Answers.

0 out of 1000 Characters