Rebalance the indexer cluster

By rebalancing the indexer cluster, you balance the distribution of bucket copies across the set of peer nodes. A balanced set of bucket copies optimizes each peer's search load and, in the case of data rebalancing, each peer's disk storage.

Types of indexer cluster rebalancing

There are two types of indexer cluster rebalancing:

Primary rebalancing
Data rebalancing

Primary rebalancing

The goal of primary rebalancing is to balance the search load across the peer nodes.

Primary rebalancing redistributes the primary bucket copies across the set of peer nodes. It attempts, to the degree possible, to ensure that each peer has approximately the same number of primary copies.

Primary rebalancing simply reassigns primary markers across the set of existing searchable copies. It does not move searchable copies to different peer nodes. Because of this limitation, primary rebalancing is unlikely to achieve a perfect balance of primaries.

Because primary rebalancing only reassigns markers and does not cause any bucket copies to move between peers, it occurs quickly.

Data rebalancing

The goal of data rebalancing is to balance the storage distribution across the peer nodes.

Data rebalancing redistributes bucket copies, so that each peer has approximately the same number of copies. It balances searchable, non-searchable, and primary copies.

During data rebalancing, the cluster moves bucket copies from peers with more copies to peers with fewer copies. Since this type of rebalancing includes searchable copies, it overcomes the limitation inherent in primary rebalancing and achieves a significantly better balance of primaries.

Because data rebalancing involves significant fixup activity, such as moving bucket copies between peers, it can be a slow and lengthy process.

Rebalance indexer cluster primary bucket copies

When you start or restart a master or peer node, the master rebalances the set of primary bucket copies across the peers, in an attempt to spread the primary copies as equitably as possible. Ideally, if you have four peers and 300 buckets, each peer would hold 75 primary copies. The purpose of primary rebalancing is to even the search load across the set of peers.

How primary rebalancing works

To achieve primary rebalancing, the master reassigns the primary state from existing bucket copies to searchable copies of the same buckets on other peers, as necessary. This rebalancing is a best-effort attempt; there is no guarantee that full, perfect rebalance will result.

Primary rebalancing occurs automatically whenever a peer or master node joins or rejoins the cluster. In the case of a rolling restart, rebalancing occurs once, at the end of the process.

Note: Even though primary rebalancing occurs when a new peer joins the cluster, that peer won't participate in the rebalancing, because it does not yet have any bucket copies. The rebalancing takes place among any existing peers that have searchable bucket copies.

When performing primary rebalancing on a bucket, the master just reassigns the primary state from one searchable copy to another searchable copy of the same bucket, if there is one and if, by doing so, the balance of primaries across peers will improve. It does not cause peers to stream bucket copies, and it does not cause peers to make unsearchable copies searchable. If an existing peer does not have any searchable copies, it will not gain any primaries during rebalancing.

Initiate primary rebalancing manually

If you want to initiate the primary rebalancing process manually, you can either restart a peer or hit the /services/cluster/master/control/control/rebalance_primaries REST endpoint on the master. For example, run this command on the master node:

curl -k -u admin:pass --request POST \
  https://localhost:8089/services/cluster/master/control/control/rebalance_primaries

For more information, refer to the REST API documentation for cluster/master/control/control/rebalance_primaries.

Rebalance primaries on a multisite cluster

There are a few differences in how primary rebalancing works in a multisite cluster. In a multisite cluster, multiple sites typically have full sets of primary copies. When you rebalance the cluster, the rebalancing occurs independently for each site. For example, in a two-site cluster, the cluster separately rebalances the primaries in site1 and site2. It does not shift primaries between the two sites.

The start or restart of a peer on any site triggers primary rebalancing on all sites. For example, if you restart a peer on site1 in a two-site cluster, rebalancing occurs on both site1 and site2.

View the number of primaries on a peer

To gain insight into the primary load for any peer, you can use the cluster/master/peers endpoint to view the number of primaries that the peer currently holds. The primary_count shows the number of primaries that the peer holds for its local site. The primary_count_remote shows the number of primaries that the peer holds for non-local sites, including site0.

By using this endpoint on all your peers, you can determine whether the cluster could benefit from primary rebalancing.

See the REST API documentation for cluster/master/peers.

Summary of indexer cluster primary rebalancing

Primary rebalancing is the rebalancing of the primary assignments across existing searchable copies in the cluster.

It occurs under these circumstances:

A peer joins or rejoins a cluster.
At the end of a rolling restart.
A master rejoins the cluster.
You manually hit the rebalance_primaries REST endpoint on the master.

Rebalance indexer cluster data

To rebalance indexer cluster data, you rebalance the set of bucket copies so that each peer holds approximately the same number of copies. This helps ensure that each peer has a similar storage distribution.

The problem of imbalanced data

Imbalanced data increases the likelihood that one or more peers will run out of disk space and thus transition to the detention state. In the detention state, the peer no longer indexes new data, and so forwarders can no longer send data to that peer. When that happens, a load-balanced forwarder shifts its incoming data to other peers, increasing the indexing burden on those peers. In the worst-case scenario, if the forwarder is not configured for load-balancing, its data gets lost.

In addition, as its existing data ages, the peer in detention will contain relatively older data compared to other peers. Since most searches focus on newer data, this means that the peer's data will typically get searched less frequently, shifting the burden of the search load onto the peers that are not in detention.

Aside from detention considerations, imbalanced data can affect peer utilization during searches. If some peer nodes hold more bucket copies for a particular index compared to other peer nodes, they will have a greater part of the search workload for searches on that index. For this reason, data rebalancing is index-aware. When rebalancing completes, each peer will have approximately the same number of bucket copies for each index.

Conditions that cause imbalanced data

A number of factors can cause an imbalance of bucket copies. These include:

Newly added peer nodes. When you add a new peer, it initially has no bucket copies. Through data rebalancing, you can move copies onto that peer from other peers.
Uneven forwarding of data. If the forwarders are sending more data to some peer nodes, it is likely that those peers will hold more bucket copies. Rebalancing provides a way to move copies from those peers to peers with fewer copies.

What data rebalancing accomplishes

Data rebalancing attempts to achieve an equitable distribution of bucket copies across the set of peer nodes. It factors in several bucket characteristics:

Data rebalancing balances both non-searchable and searchable bucket copies.
Data rebalancing balances for each index, so that, in addition to each peer holding approximately the same number of bucket copies in total, each peer will hold the same number of bucket copies for each index. See Data rebalancing and indexes.
Data rebalancing operates on warm and cold buckets only. It does not rebalance hot buckets.
Data rebalancing operates only on buckets that meet their replication and search factors.

Data rebalancing attempts to achieve a best-effort balance, not a perfect balance. See Configure the data rebalancing threshold.

Limitations of data rebalancing

Data rebalancing has limitations regarding its effect on storage utilization and its effect on concurrent searches:

Storage utilization. Data rebalancing balances the number of bucket copies, not the actual data storage. In addition, data rebalancing attempts to achieve a practical balance, not a perfect balance. In most cases, the process achieves an optimal approximation of balanced storage. Specifically:
- The process assumes that all peers have the same amount of disk storage available for indexes. It is a best practice to use homogeneous instances for your peer nodes.
- The process assumes that all buckets are the same size, although bucket size does sometimes vary.
- The process stops when the number of copies on each peer are within a small range of a perfect balance. It does not ordinarily attempt to put precisely the same number of copies on each peer. See Configure the data rebalancing threshold.
Concurrent searches. Because data rebalancing can cause primary bucket copies to move to new peers, search results are not guaranteed to be complete while data rebalancing continues. For this reason, the best practice is to run data rebalancing during a maintenance window.

How data rebalancing works

The master node controls the data rebalancing process. To achieve the goal of equitable distribution of bucket copies across all peer nodes, the master moves bucket copies from peers with an above-average number of copies to peers with a below-average number of copies. It continues this process until the cluster is balanced, with each peer holding approximately the same number of bucket copies.

To achieve rebalancing, the cluster uses the fundamental processes of bucket fixup. It streams bucket copies from one peer to another. The process of "moving" a copy to another peer thus involves streaming the copy from a peer with an above-average number of copies to one with a below-average number of copies, followed by removal of the extra copy from a peer with an above-average number of copies.

The cluster processes buckets sequentially, one after another, until rebalancing is complete. It does not wait for one bucket to complete before starting rebalancing on the next, so there will typically be a number of buckets simultaneously undergoing rebalancing. See Control rebalancing load.

Data rebalancing removes any excess bucket copies, both at the start of the rebalancing process and during rebalancing, as extra bucket copies are generated.

The master performs primary rebalancing at the end of the data rebalancing process.

The rebalancing process can terminate prematurely, either due to manual intervention or because it times out. Termination conditions are discussed elsewhere in this topic.

Data rebalancing and indexes

The rebalancing process balances the bucket copies by index. When rebalancing completes, each peer holds approximately the same number of bucket copies in total, as well as the same number of copies split by index.

For example, say you have a cluster with four peers and two indexes, index1 and index2. Index1 has 100 bucket copies distributed across all peers. Index2 has 300 copies distributed across all peers, for a total of 400 copies of all buckets across all peers.

Before rebalancing, the bucket distribution across the set of peers might look like this:

Peer1: 110 total
- Index1: 10
- Index2: 100
Peer2: 100 total
- Index1: 50
- Index2: 50
Peer3: 50 total
- Index1: 20
- Index2: 30
Peer4: 140 total
- Index1: 20
- Index2: 120

After rebalancing , the bucket distribution will look approximately like this:

Peer1: 100 total
- Index1: 25
- Index2: 75
Peer2: 100 total
- Index1: 25
- Index2: 75
Peer3: 100 total
- Index1: 25
- Index2: 75
Peer4: 100 total
- Index1: 25
- Index2: 75

Data rebalancing in multisite indexer clusters

Multisite data rebalancing operates in essentially the same way as single-site rebalancing. However, in multisite clusters, the master first balances each bucket across sites to the degree that the site configuration allows. It then balances the bucket within each site.

The degree to which you can balance a multisite cluster across sites depends on the site replication and search factors. For example, if you have a site replication factor of origin:2,total:3, the cluster maintains two-thirds of the copies on their origin site. If one site originates more buckets than another, rebalancing cannot address the resulting imbalance without violating the site replication factor, and so the imbalance will remain. Similarly, the cluster will not rebalance in a way that violates explicit site requirements. Intersite balancing does, however, balance non-explicit copies across sites.

Initiate data rebalancing

You can rebalance the data for all indexes or for just a single index. In addition, you can set a timeout for the rebalancing.

To rebalance the data, run this CLI command on the master node:

splunk rebalance cluster-data -action start [-index index_name] [-max_runtime interval_in_minutes]

Note the following:

Use the optional -index parameter to rebalance just a single index. Otherwise, the command rebalances all indexes.
Use the optional -max_runtime parameter to limit the rebalance activity to a specified number of minutes. The rebalancing stops automatically if the timeout limit is reached, even if there are still more buckets to process. For details on what happens when data rebalancing stops prematurely, see Stop data rebalancing.

You can also initiate rebalancing from the master dashboard. See Use the master dashboard to initiate and configure rebalancing.

Note: It is best to perform data rebalancing during a maintenance window for these reasons:

Data rebalancing can cause primary bucket copies to move to new peers, so search results are not guaranteed to be complete while data rebalancing continues.
The fix-up activities associated with data rebalancing have a low priority compared to other bucket fix-up activities, such as maintaining the replication and search factors, so rebalancing will wait while other fix-up activity completes.

Stop data rebalancing

To stop data rebalancing prematurely, run this CLI command on the master node:

splunk rebalance cluster-data -action stop

For any bucket in the midst of rebalancing when this command occurs, the cluster finishes the current process. It does not initiate any additional processing on the bucket, however. For example, if the cluster is in the midst of copying a non-searchable copy of a bucket, it finishes making the copy. It does not check whether the bucket balance can improve by also processing a searchable copy. It also does not remove any excess copies of the bucket.

View status of data rebalancing

To see whether data rebalancing is running, run this CLI command on the master node:

splunk rebalance cluster-data -action status

You can also use the master dashboard to view rebalancing status. See Use the master dashboard to initiate and configure rebalancing.

Configure the data rebalancing threshold

The master attempts to achieve a reasonable, but not perfect, balance, in which the resulting number of copies on each peer lies within a narrow range to either side of the average number of copies for all peers.

This balance is configurable through the rebalance_threshold attribute in the master's server.conf. You can adjust the setting either directly in server.conf or by means of the CLI. For example:

splunk edit cluster-config -mode master -rebalance_threshold 0.95 -auth admin:your_password

You can also configure the rebalancing threshold through the master dashboard. See Use the master dashboard to initiate and configure rebalancing.

A rebalance_threshold value of 1.00 means that rebalancing will continue until the cluster is fully balanced, with each peer having the same number of copies. The default value is 0.90, which means that rebalancing will continue until each peer is within 90% of a perfect balance.

With the default setting of 0.90, rebalancing continues until the number of copies on each peer lies within a range of .90 to 1.10 of the average. For example, if you have three peers holding between them a total of 300 copies, meaning that there is an average of 100 copies per peer, the rebalancing process stops when every peer holds between 90 and 110 copies.

If you decide instead that a 95% balance is preferable, you can set rebalance_threshold to 0.95. The master will then perform any necessary rebalancing until the number of copies on each peer lies within a range of .95 to 1.05 of the average.

The cluster considers each index individually against the threshold. That is, the goal of the rebalancing is to ensure that each index is balanced to the tolerance set by the rebalance_threshold attribute.

Control rebalancing load

You can configure the rebalancing load to minimize the effect that the consequent fix-up activity has on peer indexing and search performance.

The maximum number of buckets that can be rebalancing at a time is subject to the same attributes that determine peer load for any fix-up activity: max_peer_rep_load and max_peer_build_load in the [clustering] stanza of server.conf.

Data rebalancing uses the value of these attributes reduced by 1, if the attribute value is greater than 1. For example, if you set max_peer_rep_load to 4, then the peer can participate in a maximum of three (not four) concurrent rebalancing replications as a target.

Use the master dashboard to initiate and configure rebalancing

You can initiate and configure rebalancing through the master dashboard. See Configure the master with the dashboard.

Click the Edit button on the upper right side of the dashboard.
Select the Data Rebalance option.
A pop-up window appears with several fields.
Fill out the fields as necessary:
- Threshold. Change the rebalancing threshold.
- Max Runtime. Stop the rebalancing process after a set period of time. If you leave this field empty, rebalancing continues until all peers are within the threshold limit.
- Index. Run rebalancing on a single index or across all indexes.
To start rebalancing, click the Start button.

The window also provides rebalance status information.

Related answers from Splunk Community

Rebalance the indexer cluster

Types of indexer cluster rebalancing

Primary rebalancing

Data rebalancing

Rebalance indexer cluster primary bucket copies

How primary rebalancing works

Initiate primary rebalancing manually

Rebalance primaries on a multisite cluster

View the number of primaries on a peer

Summary of indexer cluster primary rebalancing

Rebalance indexer cluster data

The problem of imbalanced data

Conditions that cause imbalanced data

What data rebalancing accomplishes

Limitations of data rebalancing

How data rebalancing works

Data rebalancing and indexes

Data rebalancing in multisite indexer clusters

Initiate data rebalancing

Stop data rebalancing

View status of data rebalancing

Configure the data rebalancing threshold

Control rebalancing load

Use the master dashboard to initiate and configure rebalancing

Comments

Rebalance the indexer cluster

Was this topic useful?