Configure the site replication factor

Read this first

Before attempting to configure the site replication factor, you must be familiar with:

The basic, single-site replication factor. See "The basics of indexer cluster architecture" and "Replication factor".
Multisite cluster configurations. See "Configure multisite indexer clusters with server.conf".

What is a site replication factor?

To implement multisite indexer clustering, you must configure a site replication factor. This replaces the standard replication factor, which is specific to single-site deployments. You specify the site replication factor on the manager node, as part of the basic configuration of the cluster.

The site replication factor provides site-level control over the location of bucket copies, in addition to providing control over the total number of copies across the entire cluster. For example, you can specify that a two-site cluster maintain a total of three copies of all buckets, with one site maintaining two copies and the second site maintaining one copy.

You can also specify a replication policy based on which site originates the bucket. That is, you can configure the replication factor so that a site receiving external data maintains a greater number of copies of buckets for that data than for data that it does not originate. For example, you can specify that each site maintains two copies of all data that it originates but only one copy of data originating on another site.

Syntax

You configure the site replication factor with the site_replication_factor attribute in the manager's server.conf file. The attribute resides in the [clustering] stanza, in place of the single-site replication_factor attribute. For example:

[clustering]
mode = manager
multisite=true
available_sites=site1,site2
site_replication_factor = origin:2,total:3
site_search_factor = origin:1,total:2

You can also use the CLI to configure the site replication factor. See "Configure multisite indexer clusters with the CLI".

You must configure the site_replication_factor attribute correctly. Otherwise, the manager will not start.

For disaster recovery purposes, be sure to configure the site replication factor so that the cluster maintains copies of each bucket on multiple sites.

Here is the formal syntax:

site_replication_factor = origin:<n>, [site1:<n>,] [site2:<n>,] ..., total:<n>

where:

<n> is a positive integer indicating the number of copies of a bucket.
origin:<n> specifies the minimum number of copies of a bucket that will be held on the site originating the data in that bucket (that is, the site where the data first entered the cluster). When a site is originating the data, it is known as the "origin" site.
site1:<n>, site2:<n>, ..., indicates the minimum number of copies that will be held at each specified site. The identifiers "site1", "site2", and so on, are the same as the site attribute values specified on the peer nodes.
total:<n> specifies the total number of copies of each bucket, across all sites in the cluster.

Note the following:

This attribute specifies the per-site replication policy. It is specified globally and applies to all buckets in all indexes.

This attribute is valid only if mode=manager and multisite=true. Under those conditions, it supersedes any replication_factor attribute.

The origin and total values are required.

Site values (site1:<n>, site2:<n>, ...) are optional. A site that is specified here is known as an "explicit" site. A site that is not specified is known as a "non-explicit" site.

Here is how the cluster determines the minimum number of copies a site gets:
- When a site is functioning as origin, the minimum number of copies the site gets is the greater of either its site value, if any, or origin.
- When a site is not functioning as origin, the site value, if any, determines the minimum number of copies the site gets.
- A non-explicit site is not guaranteed any copies except when it is functioning as the origin site.

For example, in a four-site cluster with "site_replication_factor = origin:2, site1:1, site2:2, site3:3, total:8", site1 gets two copies of any data that it originates and one copy of data that any other site originates. Site2 gets two copies of data, whether or not it originates it. Site3 gets three copies of data, whether or not it originates it. The non-explicit site4 gets two copies of data that it originates, two copies of data that site2 or site3 originates, and one copy of data that site1 originates. (Site4 gets the number of copies necessary to ensure that the number of copies of each bucket meets the total value of 8.) Here is how you calculate site4's number of copies, according to where the data originates:

originate at site1 -> 2 copies in site1, 2 copies in site2, 3 copies in site3, 1 copy in site4 => total=8
originate at site2 -> 1 copy in site1, 2 copies in site2, 3 copies in site3, 2 copies in site4 => total=8
originate at site3 -> 1 copy in site1, 2 copies in site2, 3 copies in site3, 2 copies in site4 => total=8
originate at site4 -> 1 copy in site1, 2 copies in site2, 3 copies in site3, 2 copies in site4 => total=8

When specifying the site_replication_factor, here is how you determine the minimum required value for total, based on the site and origin values:
- If there are some non-explicit sites, then the total value must be at least the sum of all explicit site and origin values.
  
  For example, with a three-site cluster and "site_replication_factor = origin:2, site1:1, site2:2", the total must be at least 5: 2+1+2=5. For another three-site cluster with "site_replication_factor = origin:2, site1:1, site3:3", the total must be at least 6: 2+1+3=6.
- If all sites are explicit, then the total must be at least the minimum value necessary to fulfill the dictates of the site and origin values.
  
  For example, with a three-site cluster and "site_replication_factor = origin:1, site1:1, site2:1, site3:1", the total must be at least 3, because that configuration never requires more than three copies. For a three-site cluster and "site_replication_factor = origin:2, site1:1, site2:1, site3:1", the total must be at least 4, because one of the sites will always be the origin and thus require two copies, while the other sites will each require only one. For a three-site cluster and "site_replication_factor = origin:3, site1:1, site2:2", site3:3", the total must be at least 8, to cover the case where site1 is the origin.
  
  The easiest way to determine this is to substitute the origin value for the smallest site value and then sum the site values (including the substituted origin value). So, in the last example ("site_replication_factor = origin:3, site1:1, site2:2", site3:3"), site1 has the smallest value, which is 1. You substitute the origin's 3 for that 1 and then add the site2 and site3 values: 3+2+3=8.

Because the total value can be greater than the total set of explicit values, the cluster needs a strategy to handle any "remainder" bucket copies. Here is the strategy:
- If copies remain to be assigned after all site and origin values have been satisfied, those remainder copies are distributed across all sites, with preference given to sites with less or no copies, so that the distribution is as even as possible. Assuming that there are enough remainder copies available, each site will have at least one copy of the bucket.
  
  For example, given a four-site cluster with "site_replication_factor = origin:1, site1:1, site4:2, total:4", if site1 is the origin, there will be one remainder copy. This copy gets distributed randomly to either site2 or site3. However, if site2 is the origin, it gets one copy, leaving no remainder copies to distribute to site3.
  
  In another example, given a four-site cluster where "site_replication_factor = origin:2, site1:2, site4:2, total:7", if site1 is the origin, there are three remainder copies to distribute. Because site2 and site3 have no explicitly assigned copies, the three copies are distributed between them, with each site getting at least one copy. If site2 is the origin, however, it gets two copies and site3 gets the one remainder copy.
  
  This entire process depends on the availability of a sufficient number of peers on each site. If a site does not have enough peers available to accept additional copies, the copies go to sites with available peers. In any case, at least one copy will be distributed or reserved for each site, assuming enough copies are available.
  
  Here are a few more examples:
  - A three-site cluster with "origin:1, total:3": The distribution guarantees one copy per site. Note: If you prefer, you can instead explicitly specify a copy for each site.
  - A three-site cluster with "origin:1, total:4": The distribution guarantees one copy per site, with one additional copy going to a site that has at least two peers.
  - A three-site cluster with "origin:1, total:9", where site1 has only one peer and site2 and site3 have 10 peers each: The distribution guarantees one copy per site, with the six remaining copies distributed evenly between site2 and site3.
- If all peers on one non-explicit site are down and there are still remainder copies after all other non-explicit sites have received a copy, the cluster will reserve one of the remainder copies for that site, pending the return of its peers. During that time, the site_replication_factor cannot be met, because the total number of copies distributed will be one less than the specified total value, due to the copy that is being held in reserve for the site with the downed peers.
  
  For example, given a four-site cluster with "site_replication_factor = origin:1, site1:1, site4:2, total:5", if site1 is the origin, there will be two remainder copies to distribute between site2 and site3. If all the peers on site2 are down, one remainder copy goes to site3 and the other copy will be held in reserve until one of the site2 peers rejoins the cluster. During that time, the site_replication_factor is not met. However, given a four-site cluster with "site_replication_factor = origin:1, site1:1, site4:2, total:4" (the only difference from the last example being that the total value is 4 instead of 5), if site1 is the origin, there will be only one remainder copy, which will go to either site2 or site3. If all the peers on site2 are down, the copy will go to site3 and no copy will be held in reserve for site2. The site_replication_factor is met in this example, because no copies are being held in reserve for site2.

Each site must deploy a set of peers at least as large as the greater of the origin value or its site value.

For example, given a three-site cluster with "site_replication_factor = origin:2, site1:1, site2:2, site3:3, total:7", the sites must have at least the following number of peers: site1: 2 peers; site2: 2 peers; site3: 3 peers.

The total number of peers deployed across all sites must be greater than or equal to the total value.

The total value must be at least as large as the replication_factor attribute, which has a default value of 3. Therefore, if the total value is 2, you should explicitly set the value of replication_factor to 2.

If you are migrating from a single-site cluster, the total value must be at least as large as the replication_factor for the single-site cluster. See "Migrate an indexer cluster from single-site to multisite".

The attribute defaults to: "origin:2, total:3."

Examples

A cluster of two sites (site1, site2), with the default "site_replication_factor = origin:2, total:3": For any given bucket, the site originating the data stores two copies. The remaining site stores one copy.

A cluster of three sites (site1, site2, site3), with the default "site_replication_factor = origin:2, total:3": For any given bucket, the site originating the data stores two copies. One of the two non-originating sites, selected at random, stores one copy, and the other doesn't store any copies.

A cluster of three sites (site1, site2, site3), with "site_replication_factor = origin:1, site1:1, site2:1, site3:2, total:5": For all buckets, site1 and site2 store a minimum of one copy each and site3 stores two copies. The fifth copy gets distributed to either site1 or site2, because those sites have fewer assigned copies than site3.

A cluster of three sites (site1, site2, site3), with "site_replication_factor = origin:2, site1:1, site2:1, total:4": Site1 stores two copies of any bucket it is originating and one copy of any other bucket. Site2 follows the same pattern. Site3, whose site value is not explicitly defined, follows the same pattern.

A cluster of three sites (site1, site2, site3), with "site_replication_factor = origin:2, site1:1, site2:2, total:5": Site1 stores two copies of any bucket it originates, one or two copies of any bucket site2 originates, and one copy of any bucket that site3 originates. Site2 stores two copies of any bucket, whether or not it originates it. Site3, whose site value is not explicitly defined, stores two copies of any bucket it originates, one copy of any bucket site1 originates, and one or two copies of any bucket site2 originates. (When site2 originates a bucket, one copy remains after initial assignments. The manager assigns this randomly to site1 or site3.)

A cluster of three sites with "site_replication_factor = origin:1, total:4": Four copies of each bucket are distributed randomly across all sites, with each site getting at least one copy.

Handle the persistence of the single-site replication_factor

Important: Although the site_replication_factor effectively replaces the single-site replication_factor, the single-site attribute continues to exist in the manager's configuration, with its default value of 3. This can cause problems on start-up if any site has fewer than three peer nodes. The symptom will be a message that the multisite cluster does not meet its replication factor. For example, if one of your sites has only two peers, the default single-site replication factor of 3 will cause the error to occur. To avoid or fix this problem, you must set the single-site replication factor to a value that is less than or equal to the smallest number of peers on any site. In the case where one site has only two peers, you must therefore explicitly set the replication_factor attribute to a value of 2. See "Multisite cluster does not meet its replication or search factors."

Related answers from Splunk Community

Configure the site replication factor

Read this first

What is a site replication factor?

Syntax

Examples

Handle the persistence of the single-site replication_factor

Comments

Configure the site replication factor

Was this topic useful?