Deploy multisite indexer clusters with SmartStore

You can deploy multisite indexer clusters with SmartStore to meet disaster recovery requirements. The clusters and their object stores can be hosted with a public cloud provider or in on-premises data centers.

The following SmartStore multisite deployment types are available:

Active-active multisite hosted in a public cloud provider (AWS), within a single region across multiple zones
Active-active multisite hosted across data centers, using S3 API-compliant on-premise object stores

SmartStore support for public cloud provider hosted solutions is limited to AWS only and does not apply to other public cloud providers, including GCP and Azure.

The multisite deployment must be hosted within either a single public cloud provider or a set of on-prem data centers. Sites cannot span a cloud provider and a data center.

Requirements for multisite indexer clusters with SmartStore

There are several layers of requirements to consider when deploying multisite indexer clusters with SmartStore:

Standard requirements for all multisite clusters and for all SmartStore-enabled clusters. See Standard requirements for multisite clusters and for SmartStore-enabled clusters.
Requirements for all multisite clusters with SmartStore, independent of deployment type. See Requirements for all multisite clusters with SmartStore.
Specific requirements for each deployment type.
- For public cloud provider hosted, see Public cloud provider hosted, within a single region.
- For on premsises hosted, see On premises hosted, across data centers.

Standard requirements for multisite clusters and for SmartStore-enabled clusters

The standard requirements and deployment methods for multisite clusters and for SmartStore-enabled clusters also apply to the combination of multisite clusters with SmartStore. See the relevant topics in this manual, including, in particular:

Do not proceed with the deployment until you thoroughly understand the requirements and deployment methods for multisite indexer clusters and SmartStore-enabled indexer clusters.

Requirements for all multisite clusters with SmartStore

Multisite clusters with SmartStore have several additional requirements on top of those standard requirements.The requirements listed in this section apply to all deployments of multisite clusters with SmartStore.

Deployment requirements

These requirements apply to all deployments of multisite clusters with SmartStore:

Designate one site as primary. This site hosts the active cluster master.
Each cluster site must have a master node. Exactly one master node in the cluster must be active at any time. The other master nodes must be maintained in standby mode.
Search heads are typically deployed on each site, but this is not a strict requirement.
Site locations host two object stores in an active-active replicated relationship. Depending on the deployment type, the set of cluster peer nodes can be sending data to one or both object stores.

Configuration requirements

These configurations are essential for all multisite clusters with SmartStore:

On each search head, in server.conf, set site=site0 to disable site affinity.
On the master node, in server.conf, configure site_replication_factor and the site_search_factor to ensure that each site holds at least one searchable copy of each bucket.
To ensure that the cluster continues to ingest all incoming data if one or more peer nodes fail, configure the forwarders to load balance their data across all peer nodes on all sites. See Use forwarders to get data into the indexer cluster.

Prepare the standby master nodes

Each site location must host a master node. A properly functioning cluster has exactly one active master node. The other master nodes must be maintained in standby mode, so that they are available if the site with the active master fails.

See Handle master site failure for details on how to prepare for and implement master node failover. These are the main preparatory steps described in that topic:

Set up a master node on each site.
Set the master node on the primary site to be the active master.
Copy over the active master's configurations to each standby master.
On an ongoing basis, copy any new or changed configurations from the active to the standby masters.
Develop a plan for explicit failover to a standby master upon loss of the active master.

Public cloud provider hosted, within a single region

In this deployment type, each indexer cluster site is hosted on a separate AZ within a single region. One of the AZs also hosts a primary active remote store, and all peer nodes across all sites on the cluster point to that single, primary remote store. The public cloud provider itself handles object store failover needs by maintaining a secondary active store and performing automatic failover in case the primary store fails for any reason, including the loss of an AZ.

In addition, each AZ must host a cluster master node. Only one master is active at a time. The other masters are prepared as standby failovers.

Deployment topology

These are the essential elements of the deployment topology:

Each indexer cluster site resides on a separate AZ within a single region.
Network latency between AZs is less than 15ms.
A typical deployment has three sites (AZs) in total.
Each AZ hosts either an active or a standby cluster master node.
All peer nodes across all sites point to a single object store URI endpoint.
The public cloud provider handles synchronous replication between S3 object stores across AZs in the same region and performs failover when necessary.

Site failure and recovery

If a site fails, note the following:

If the failed site was hosting the active cluster master, you must immediately switch to a standby master node on one of the remaining sites. See Handle master site failure.
The public cloud provider automatically handles S3 failures within an AZ.
Forwarders configured for load balancing across the cluster automatically redirect to peers on the remaining sites.
Any searches in progress when a site fails might return incomplete results if the search heads were accessing peers on the failed site. Post-failure, new searches are automatically redirected to peers on the remaining sites.

When the failed site returns to the cluster, you must handle these cluster master-related issues:

Ensure that the peer nodes on the recovered site point to the new active master.
Because the recovered site's master now serves as a standby master, include it in future configuration updates, as described in Prepare the standby master nodes.

On premises hosted, across data centers

This deployment type is limited to two sites, with each site hosted in an on-premises data center. Both sites host active replicated object stores. One site hosts an active cluster master and the other site hosts a standby cluster master.

Deployment topology

These are the essential elements of the deployment topology:

Each indexer cluster site resides in an on-premises data center.
Network latency between sites is a maximum of 300ms. For best performance, the recommended maximum latency is 100ms.
This topology is limited to two sites.
Each site location hosts an active object store.
Each peer node points to the same remote object store URI through a third-party VIP or GSLB,
The VIP or GSLB routes traffic from each peer node to the object store hosted on its site's location. In the case of object store failure, the VIP or GSLB reroutes traffic as necessary to the remaining active object store.
One site location hosts an active cluster master, and the other site location hosts a standby cluster master node.

Object store requirements

For the basic requirements for SmartStore remote object stores, see Configure the remote store for SmartStore.

Requirements for third-party object stores for on-premises multisite deployments include the following:

Must be compliant with the S3 API.
Supports bi-directional replication between physical object stores.
Retries replication of uploaded objects to the target store to ensure data is synchronized in the case of object store failure and recovery.
Supports object versioning similar to AWS S3. Versioning of objects must be based on object creation or modification timestamps, not replication timestamp.
Supports delete marker replication.

In addition, the object store vendor must provide assurance that maximum replication lag time between remote stores does not exceed your Recovery Point Objective (RPO) requirements.

Details of normal operation

Peer nodes in SmartStore-enabled multisite clusters upload and download data from the remote store no differently from peer nodes in SmartStore-enabled single-site clusters. The path setting on all peer nodes across the cluster must share an identical URI to identify the remote store. The difference is that the URI points to a VIP or GSLB that routes traffic to the active remote store on the peer's site location (data center). Thus, in normal operation, the peers always upload data to, and download data from, their local remote store.

Data uploaded to one remote store is replicated to the other remote store, using the capability provided by the remote store vendor. Replication solutions usually rely on asynchronous replication, which introduces a lag that can result in temporary unavailability of the most recent data replicated to the non-local store. The amount of replication lag varies according to the upload traffic, available network bandwidth between locations, and vendor-specific replication latency.

Replication lag is an inherent characteristic of asynchronous replication. Splunk Enterprise provides no additional protections for data that has not yet been replicated to a second object store due to replication lag.

Because site affinity is disabled, search heads can access peer nodes on either site to fulfill search requests. This can result in WAN traffic across data centers.

Search requests that involve data not resident on local peer node caches can return incomplete data if the requested data is not yet available on the local object store but is in the process of replication to the local object store. This condition affects recent datasets within the replication lag time frame.

Site failure and recovery

If a site location fails, note the following:

If the failed site was hosting the active cluster master, you must immediately switch to the standby master node on the remaining site. See Handle master site failure.
Forwarders configured for load balancing across the cluster automatically redirect to peers on the remaining site.
Any searches in progress when a site fails might return incomplete results if the search heads were accessing peers on the failed site. Post-failure, new searches are automatically redirected to peers on the remaining site.
Data uploaded to the remaining object store is not replicated to the object store on the failed location.
Due to replication lag, the remaining object store might be missing data recently uploaded to the failed object store. This situation is temporary and will be rectified when the failed object store returns to service.

In the event of a catastrophic failure that results in permanent failure of an object store, data recently uploaded to that object store which was not replicated to the remaining object store might be permanently lost. In such a case, the cluster emits incomplete search results warnings and bucket fixup errors in response to attempts to access the lost buckets. To eliminate these warnings and errors, contact Splunk Customer Support for assistance in removing the corresponding bucket metadata.

When the failed site returns to the cluster, you must handle these cluster master-related issues:

Ensure that the peer nodes on the recovered site point to the new active master.
Because the recovered site's master now serves as the standby master, include it in future configuration updates, as described in Prepare the standby master nodes.

When the failed object store comes back online, object store replication restarts, with the object stores also replicating any data that was unsent at the time of the failure, as well as data uploaded to the remaining object store while the other object store was down. The amount of time required to resync the object stores depends on factors such as the length of time the failure lasted (and thus the amount of accumulated unreplicated data) and replication throughput.

While the object stores are resyncing, search requests might produce incomplete results. Since the recovering object store can be missing significant amounts of recent data, it takes some time for it to catch up through the resyncing process. Partial search results can occur from attempts to access data on the recovering object store that has not yet been synced to that store.

Object-store-only failure and recovery

In cases where an object store fails but its corresponding cluster site remains active, the VIP or GSLB redirects traffic from the peer nodes on the site to the remaining object store. This results in increased WAN traffic across data centers.

Due to replication lag, the remaining object store might be missing data recently uploaded to the failed object store, possibly resulting in incomplete search results.This situation is temporary and will be rectified when the failed object store returns to service.

In the event of a catastrophic failure that results in permanent failure of an object store, data recently uploaded to that object store which was not replicated to the remaining object store might be permanently lost. In such a case, the cluster emits incomplete search results warnings and bucket fixup errors, in response to attempts to access the lost buckets. To eliminate these warnings and errors, contact Splunk Customer Support for assistance in removing the corresponding bucket metadata.

When the failed object store comes back online, object store replication restarts, with the object stores also replicating any data that was unsent at the time of the failure, as well as data uploaded to the remaining object store while the other object store was down. The amount of time required to resync the object stores depends on factors such as the length of time the failure lasted (and thus the amount of accumulated unreplicated data) and replication throughput.

While the object stores are resyncing, search requests might produce incomplete results. Since the recovering object store can be missing significant amounts of recent data, it takes some time for it to catch up through the resyncing process. Partial search results can occur from attempts to access data on the recovering object store that has not yet been synced to that store.

Related answers from Splunk Community

Deploy multisite indexer clusters with SmartStore

Requirements for multisite indexer clusters with SmartStore

Standard requirements for multisite clusters and for SmartStore-enabled clusters

Requirements for all multisite clusters with SmartStore

Deployment requirements

Configuration requirements

Prepare the standby master nodes

Public cloud provider hosted, within a single region

Deployment topology

Site failure and recovery

On premises hosted, across data centers

Deployment topology

Object store requirements

Details of normal operation

Site failure and recovery

Object-store-only failure and recovery

Comments

Deploy multisite indexer clusters with SmartStore

Was this topic useful?