What happens when a peer node comes back up

A peer node can go down either intentionally (through the CLI offline command) or unintentionally (for example, by a server crashing). When the peer goes down, the cluster undertakes remedial activities, also known as bucket fixing, as described in the topic, What happens when a peer node goes down. The topic you're now reading describes what happens when and if the peer later returns to the cluster.

When a peer comes back up, it starts sending heartbeats to the manager. The manager recognizes it and adds it back into the cluster. If the peer still has intact bucket copies from its earlier activities in the cluster, the manager adds those copies to the counts it maintains of buckets. The manager also rebalances the cluster, which can result in searchable bucket copies on the peer, if any, being assigned primary status. For information on rebalancing, see Rebalance the indexer cluster primary buckets.

Note: When the peer connects with the manager, it checks to see whether it already has the current version of the configuration bundle. If the bundle has changed since it went down, the peer downloads the latest configuration bundle, validates it locally, and restarts. The peer rejoins the cluster only if bundle validation succeeds.

How the manager counts buckets

To understand what happens when a peer returns to the cluster, you must first understand how the manager tracks bucket copies.

The manager maintains counts for each bucket in the cluster. For each bucket, it knows:

how many copies of the bucket exist on the cluster.
how many searchable copies of the bucket exist on the cluster.

The manager also ensures that there's always exactly one primary copy of a given bucket.

With multisite clusters, the manager keeps track of copies and searchable copies for each site, as well as for the cluster as a whole. It also ensures that each site with an explicit search factor has exactly one primary copy of each bucket.

These counts allow the manager to determine whether the cluster is valid and complete. For a single-site cluster, this means that the cluster has:

Exactly one primary copy of each bucket.
A full set of searchable copies for each bucket, matching the search factor.
A full set of copies (searchable and non-searchable) for each bucket, matching the replication factor.

For a multisite cluster, a valid and complete cluster has:

Exactly one primary copy of each bucket for each site with an explicit search factor.
A full set of searchable copies for each bucket, matching the search factor for each site as well as for the cluster as a whole.
A full set of copies (searchable and non-searchable) for each bucket, matching the replication factor for each site as well as for the cluster as a whole.

Bucket-fixing and the copies on the peer

When a peer goes down, the manager directs the remaining peers in bucket-fixing activities. Eventually, if the bucket fixing is successful, the cluster returns to a complete state.

If the peer later returns to the cluster, the manager adds its bucket copies to its counts (assuming that the copies were not destroyed by whatever problem caused the peer to go down in the first place). The consequences vary somewhat depending on whether bucket-fixing activity has completed by the time the peer comes back up.

If bucket-fixing is finished

If bucket-fixing has already completed and the cluster is in a complete state, the copies from the returned peer are just extras. For example, assume the replication factor is 3 and the cluster has fixed all the buckets so that there are again three copies of each bucket in the cluster, including the ones that the downed peer was maintaining before it went down. When the downed peer then comes back up with its copies intact, the manager just adds those copies to the count, so that instead of three copies, there will be four copies of some buckets. Similarly, there could be an excess of searchable bucket copies if the returned peer was maintaining some searchable bucket copies. These excess copies might come in handy later, if another peer maintaining copies of some of those buckets goes down.

If bucket-fixing is still underway

If the cluster is still replacing the copies that were lost when the peer went down, the return of the peer can curtail the bucket-fixing. Once the manager has added the copies on the returned peer to its counts, it knows that the cluster is complete and valid, and so it will no longer direct the other peers to make copies of those buckets. However, any peers that are currently in the middle of some bucket-fixing activity, such as copying buckets or making copies searchable, will complete their work on those copies. Since bucket-fixing is time-intensive, it is worthwhile to bring a downed peer back online as soon as possible, particularly if the peer was maintaining a large number of bucket copies.

Remove excess bucket copies

If the returning peer results in extra copies of some buckets, you can save disk space by removing the extra copies. See Remove excess bucket copies from the indexer cluster.

Related answers from Splunk Community

What happens when a peer node comes back up

How the manager counts buckets

Bucket-fixing and the copies on the peer

If bucket-fixing is finished

If bucket-fixing is still underway

Remove excess bucket copies

Comments

What happens when a peer node comes back up

Was this topic useful?