Deploy multisite indexer clusters with SmartStore

You can deploy multisite indexer clusters with SmartStore to meet disaster recovery requirements. The clusters and their object stores can be hosted with a public cloud provider or in on-premises data centers.

The following SmartStore multisite deployment types are available:

Active-active multisite hosted in a public cloud provider (AWS, GCP, or Azure), within a single region across multiple zones
Active-active multisite hosted across data centers, using S3 API-compliant on-premise object stores

The multisite deployment must be hosted within either a single public cloud provider or a set of on-prem data centers. Sites cannot span multiple cloud providers, nor can they span a cloud provider and a data center.

Requirements for multisite indexer clusters with SmartStore

There are several layers of requirements to consider when deploying multisite indexer clusters with SmartStore:

Standard requirements for all multisite clusters and for all SmartStore-enabled clusters. See Standard requirements for multisite clusters and for SmartStore-enabled clusters.
Requirements for all multisite clusters with SmartStore, independent of deployment type. See Requirements for all multisite clusters with SmartStore.
Specific requirements for each deployment type.
- For public cloud provider hosted, see Public cloud provider hosted, within a single region.
- For on-premises hosted, see On premises hosted, across data centers.

Standard requirements for multisite clusters and for SmartStore-enabled clusters

The standard requirements and deployment methods for multisite clusters and for SmartStore-enabled clusters also apply to the combination of multisite clusters with SmartStore. See the relevant topics in this manual, including, in particular:

Do not proceed with the deployment until you thoroughly understand the requirements and deployment methods for multisite indexer clusters and SmartStore-enabled indexer clusters.

Requirements for all multisite clusters with SmartStore

Multisite clusters with SmartStore have several additional requirements on top of those standard requirements.The requirements listed in this section apply to all deployments of multisite clusters with SmartStore.

Deployment requirements

These requirements apply to all deployments of multisite clusters with SmartStore:

Designate one site as primary. This site hosts the active cluster manager node.
Each cluster site must have a manager node. Exactly one manager node in the cluster must be active at any time. The other manager nodes must be maintained in standby mode.
Search heads are typically deployed on each site, but this is not a strict requirement.
Site locations host two object stores in an active-active replicated relationship. Depending on the deployment type, the set of cluster peer nodes can be sending data to one or both object stores.

Configuration requirements

These configurations are essential for all multisite clusters with SmartStore:

On each search head, in server.conf, set site=site0 to disable search affinity.
On the manager node, in server.conf, configure site_replication_factor and the site_search_factor to ensure that each site holds at least one searchable copy of each bucket.
To ensure that the cluster continues to ingest all incoming data if one or more peer nodes fail, configure the forwarders to load balance their data across all peer nodes on all sites. See Use forwarders to get data into the indexer cluster.

Prepare the standby manager nodes

Each site location must host a manager node. A properly functioning cluster has exactly one active manager node. The other manager nodes must be maintained in standby mode, so that they are available if the site with the active manager fails.

See Handle manager site failure for details on how to prepare for and implement manager node failover. These are the main preparatory steps described in that topic:

Set up a manager node on each site.
Set the manager node on the primary site to be the active manager.
Copy over the active manager's configurations to each standby manager.
On an ongoing basis, copy any new or changed configurations from the active to the standby managers.
Develop a plan for explicit failover to a standby manager upon loss of the active manager.

Public cloud provider hosted, within a single region

This deployment type is available with AWS, GCP, or Microsoft Azure as cloud provider. AWS-hosted deployments use S3 as the remote store. GCP-hosted deployments use GCS as the remote store. Azure-hosted deployments use Azure Blob storage as the remote store.

In this deployment type, each indexer cluster site is hosted in a separate zone within a single region. Indexers in each site/zone are configured with the same remote store endpoint. The public cloud provider handles object store failover between zones within a region, in case of any zone failures, including the loss of an entire zone.

In addition, each zone must host a cluster manager node. Only one manager is active at a time. The other managers are prepared as standby failovers.

Deployment topology

These are the essential elements of the deployment topology:

Each indexer cluster site resides on a separate zone within a single region.
Network latency between zones is less than 15ms.
A typical deployment has three sites (zones) in total.
Each zone hosts either an active or a standby cluster manager node.
All peer nodes across all sites point to a single object store URI endpoint.
The public cloud provider handles synchronous replication between object stores across zones in the same region and performs failover when necessary.

Site failure and recovery

If a site fails, note the following:

If the failed site was hosting the active cluster manager, you must immediately switch to a standby manager node on one of the remaining sites. See Handle manager site failure.
The public cloud provider automatically handles object store failures within a zone.
Forwarders configured for load balancing across the cluster automatically redirect to peers on the remaining sites.
Any searches in progress when a site fails might return incomplete results if the search heads were accessing peers on the failed site. Post-failure, new searches are automatically redirected to peers on the remaining sites.

When the failed site returns to the cluster, you must handle these cluster manager-related issues:

Ensure that the peer nodes on the recovered site point to the new active manager.
Because the recovered site's manager now serves as a standby manager, include it in future configuration updates, as described in Prepare the standby manager nodes.

On premises hosted, across data centers

This deployment type is available only for S3-API-compliant object stores.

This deployment type is limited to two sites, with each site hosted in an on-premises data center. Both sites host active replicated object stores. One site hosts an active cluster manager and the other site hosts a standby cluster manager.

Deployment topology

These are the essential elements of the deployment topology:

Each indexer cluster site resides in an on-premises data center.
Network latency between sites is a maximum of 300ms. For best performance, the recommended maximum latency is 100ms.
This topology is limited to two sites.
Each site location hosts an active object store.
Each peer node points to the same remote object store URI through a third-party VIP or GSLB,
The VIP or GSLB routes traffic from each peer node to the object store hosted on its site's location. In the case of object store failure, the VIP or GSLB reroutes traffic as necessary to the remaining active object store.
One site location hosts an active cluster manager, and the other site location hosts a standby cluster manager node.

Object store requirements

For the basic requirements for S3 remote object stores for SmartStore, see Configure the S3 remote store for SmartStore.

Requirements for third-party object stores for on-premises multisite deployments include the following:

Must be compliant with the S3 API.
Supports bi-directional replication between physical object stores.
Retries replication of uploaded objects to the target store to ensure data is synchronized in the case of object store failure and recovery.
Supports object versioning similar to AWS S3. Versioning of objects must be based on object creation or modification timestamps, not replication timestamp.
Supports delete marker replication.

In addition, the object store vendor must provide assurance that maximum replication lag time between remote stores does not exceed your Recovery Point Objective (RPO) requirements.

Details of normal operation

Peer nodes in SmartStore-enabled multisite clusters upload and download data from the remote store no differently from peer nodes in SmartStore-enabled single-site clusters. The path setting on all peer nodes across the cluster must share an identical URI to identify the remote store. The difference is that the URI points to a VIP or GSLB that routes traffic to the active remote store on the peer's site location (data center). Thus, in normal operation, the peers always upload data to, and download data from, their local remote store.

Data uploaded to one remote store is replicated to the other remote store, using the capability provided by the remote store vendor. Replication solutions usually rely on asynchronous replication, which introduces a lag that can result in temporary unavailability of the most recent data replicated to the non-local store. The amount of replication lag varies according to the upload traffic, available network bandwidth between locations, and vendor-specific replication latency.

Replication lag is an inherent characteristic of asynchronous replication. Splunk Enterprise provides no additional protections for data that has not yet been replicated to a second object store due to replication lag.

Because search affinity is disabled, search heads can access peer nodes on either site to fulfill search requests. This can result in WAN traffic across data centers.

Search requests that involve data not resident on local peer node caches can return incomplete data if the requested data is not yet available on the local object store but is in the process of replication to the local object store. This condition affects recent datasets within the replication lag time frame.

Site failure and recovery

If a site location fails, note the following:

If the failed site was hosting the active cluster manager, you must immediately switch to the standby manager node on the remaining site. See Handle manager site failure.
Forwarders configured for load balancing across the cluster automatically redirect to peers on the remaining site.
Any searches in progress when a site fails might return incomplete results if the search heads were accessing peers on the failed site. Post-failure, new searches are automatically redirected to peers on the remaining site.
Data uploaded to the remaining object store is not replicated to the object store on the failed location.
Due to replication lag, the remaining object store might be missing data recently uploaded to the failed object store. This situation is temporary and will be rectified when the failed object store returns to service.

In the event of a catastrophic failure that results in permanent failure of an object store, data recently uploaded to that object store which was not replicated to the remaining object store might be permanently lost. In such a case, the cluster emits incomplete search results warnings and bucket fixup errors in response to attempts to access the lost buckets. To eliminate these warnings and errors, contact Splunk Customer Support for assistance in removing the corresponding bucket metadata.

When the failed site returns to the cluster, you must handle these cluster manager-related issues:

Ensure that the peer nodes on the recovered site point to the new active manager.
Because the recovered site's manager now serves as the standby manager, include it in future configuration updates, as described in Prepare the standby manager nodes.

When the failed object store comes back online, object store replication restarts, with the object stores also replicating any data that was unsent at the time of the failure, as well as data uploaded to the remaining object store while the other object store was down. The amount of time required to resync the object stores depends on factors such as the length of time the failure lasted (and thus the amount of accumulated unreplicated data) and replication throughput.

While the object stores are resyncing, search requests might produce incomplete results. Since the recovering object store can be missing significant amounts of recent data, it takes some time for it to catch up through the resyncing process. Partial search results can occur from attempts to access data on the recovering object store that has not yet been synced to that store.

Object-store-only failure and recovery

In cases where an object store fails but its corresponding cluster site remains active, the VIP or GSLB redirects traffic from the peer nodes on the site to the remaining object store. This results in increased WAN traffic across data centers.

Due to replication lag, the remaining object store might be missing data recently uploaded to the failed object store, possibly resulting in incomplete search results.This situation is temporary and will be rectified when the failed object store returns to service.

In the event of a catastrophic failure that results in permanent failure of an object store, data recently uploaded to that object store which was not replicated to the remaining object store might be permanently lost. In such a case, the cluster emits incomplete search results warnings and bucket fixup errors, in response to attempts to access the lost buckets. To eliminate these warnings and errors, contact Splunk Customer Support for assistance in removing the corresponding bucket metadata.

When the failed object store comes back online, object store replication restarts, with the object stores also replicating any data that was unsent at the time of the failure, as well as data uploaded to the remaining object store while the other object store was down. The amount of time required to resync the object stores depends on factors such as the length of time the failure lasted (and thus the amount of accumulated unreplicated data) and replication throughput.

While the object stores are resyncing, search requests might produce incomplete results. Since the recovering object store can be missing significant amounts of recent data, it takes some time for it to catch up through the resyncing process. Partial search results can occur from attempts to access data on the recovering object store that has not yet been synced to that store.

Related answers from Splunk Community

Deploy multisite indexer clusters with SmartStore

Requirements for multisite indexer clusters with SmartStore

Standard requirements for multisite clusters and for SmartStore-enabled clusters

Requirements for all multisite clusters with SmartStore

Deployment requirements

Configuration requirements

Prepare the standby manager nodes

Public cloud provider hosted, within a single region

Deployment topology

Site failure and recovery

On premises hosted, across data centers

Deployment topology

Object store requirements

Details of normal operation

Site failure and recovery

Object-store-only failure and recovery

Comments

Deploy multisite indexer clusters with SmartStore

Was this topic useful?