Bucket replication issues

Network issues impede bucket replication

If there are problems with the connection between peer nodes such that a source peer is unable to replicate a hot bucket to a target peer, the source peer rolls the hot bucket and start a new hot bucket. If it still has problems connecting with the target peer, it rolls the new hot bucket, and so on.

To prevent a situation from arising where a prolonged failure causes the source peer to generate a large quantity of small hot buckets, the source peer, after a configurable number of replication errors to a single target peer, stops rolling hot buckets in response to the connection problem with that target peer. The default is three replication errors. The following banner message then appears one or more times in the master node's dashboard, depending on the number of source peers encountering errors:

Search peer <search peer> has the following message: Too many streaming errors to target=<target 
peer>. Not rolling hot buckets on further errors to this target. (This condition might exist with 
other targets too. Please check the logs.)

While the network problem persists, there might not be replication factor number of copies available for the most recent hot buckets.

If a particular peer is consistently being reported by other peer nodes as being the cause of replication errors, you can temporarily put the peer in manual detention. Once the root problem has been resolved, remove the peer from manual detention. See Put a peer into detention.

Configure the allowable number of replication errors

To adjust the allowable number of replication errors, you can configure the max_replication_errors attribute in server.conf on the source peer. However, it is unlikely that you will need to change the attribute from its default of 3, because replication errors that can be attributed to a single network problem are bunched together and only count as one error. The "Too many streaming errors" banner message might still appear, but it can be ignored.

Note: The bunching of replication errors is a change introduced in release 6.0. With this change, the number of errors will be unlikely to exceed the default value of 3, except in unusual conditions.

Evidence of replication failure on the source peer

Evidence of replication failure appears in the source peer's splunkd.log, with a reference to the failed target peer(s). You can locate the relevant lines in the log by searching on "CMStreamingErrorJob". For example, this grep command finds that there have been 15 streaming errors to the peer with the GUID "B3D35EF4-4BC8-4D69-89F9-3FACEDC3F46E":

grep CMStreamingErrorJob ../var/log/splunk/splunkd.log* | cut -d' ' -f10 | sort |uniq -c | sort -nr
15 failingGuid=B3D35EF4-4BC8-4D69-89F9-3FACEDC3F46E

Unable to disable and re-enable a peer

When you disable an indexer as a peer, any hot buckets that were on the peer at the time it was disabled are rolled to warm and named using the standalone bucket convention. If you later re-enable the peer, a problem arises because the master remembers those buckets as clustered and expects them to be named according to the clustered bucket convention, but instead they are named using the convention for standalone buckets. Because of this naming discrepancy, the peer cannot rejoin the cluster.

To work around this issue, you must clean the buckets or otherwise remove the standalone buckets on the peer before re-enabling it.

Multisite cluster does not meet its replication or search factors

The symptom is a message that the multisite cluster does not meet its replication or search factors. This message can appear, for example, on the master dashboard. This condition occurs immediately after bringing up a multisite cluster.

Compare the values for the single-site replication_factor and search_factor attributes to the number of peers that you have on each site. (If you did not explicitly set the single-site replication and search factors, then they default to 3 and 2, respectively.) These attribute values cannot exceed the number of peers on any site. If either exceeds the number of peers on the smallest site, change it to the number of peers on the smallest site. For example, if the number peers on the smallest site is 2, and you are using the default values of replication_factor=3 and search_factor=2, you must explicitly change the replication_factor to 2.

This condition can occur after you convert a single-site cluster to multisite. If you configure the cluster as multisite from the very beginning, before you first start it up, the issue does not occur.

Related answers from Splunk Community

Bucket replication issues

Network issues impede bucket replication

Configure the allowable number of replication errors

Evidence of replication failure on the source peer

Unable to disable and re-enable a peer

Multisite cluster does not meet its replication or search factors

Comments

Bucket replication issues

Was this topic useful?