SmartStore architecture overview

The architectural goal of the SmartStore feature is to minimize the amount of data on local storage, while maintaining the fast indexing and search capabilities characteristic of Splunk Enterprise deployments. Except in a few uncommon scenarios, indexers return search results for SmartStore-enabled indexes with speeds similar to those for non-SmartStore indexes.

SmartStore in a nutshell

A SmartStore-enabled index minimizes its use of local storage, with the bulk of its data residing on remote object stores such as S3 or GCS.

Indexing and searching of data occurs on the indexer, just as in a traditional deployment that stores all data locally.

The key difference with SmartStore is that the remote object store becomes the location for master copies of warm buckets, while the indexer's local storage is used to cache copies of warm buckets currently participating in a search or that have a high likelihood of participating in a future search.

A cache manager on the indexer fetches copies of warm buckets from the remote store and places them in the indexer's local cache when the buckets are needed for a search. The cache manager also evicts warm bucket copies from the cache once their likelihood of being searched again diminishes.

Buckets and SmartStore

With SmartStore indexes, as with non-SmartStore indexes, hot buckets are built in the indexer's local storage cache. However, with SmartStore indexes, when a bucket rolls from hot to warm, a copy of the bucket is uploaded to remote storage. The remote copy then becomes the master copy of the bucket.

Eventually, the cache manager evicts the local bucket copy from the cache. When the indexer needs to search a warm bucket for which it doesn't have a local copy, the cache manager fetches a copy from remote storage and places it in the local cache.

The remote storage has a copy of every warm bucket.

Each indexer's local cache contains several types of data:

Hot buckets. Hot buckets are created in local storage. They continue to reside solely on the indexer until they roll to warm.
Copies of warm buckets that are currently participating in searches.
Copies of recently created or recently searched warm buckets. The indexer maintains a cache of warm buckets, to minimize the need to fetch the same buckets from remote storage repeatedly.
Metadata for remote buckets. The indexer maintains a small amount of information about each bucket in remote storage.

The buckets of SmartStore indexes ordinarily have just two active states: hot and warm. The cold state, which is used with non-SmartStore indexes to distinguish older data eligible for moving to cheap storage, is not necessary with SmartStore because warm buckets already reside on inexpensive remote storage. Buckets roll directly from warm to frozen.

Cold buckets can, in fact, exist in a SmartStore-enabled index, but only under limited circumstances. Specifically, if you migrate an index from non-SmartStore to SmartStore, any migrated cold buckets use the existing cold path as their cache location, post-migration.

In all respects, cold buckets are functionally equivalent to warm buckets. The cache manager manages the migrated cold buckets in the same way that it manages warm buckets. The only difference is that the cold buckets will be fetched into the cold path location, rather than the home path location.

The cache manager

The indexer's cache manager manages the local cache. It fetches copies of warm buckets from remote storage when the buckets are needed for a search. It also evicts buckets from the cache, based on factors such as the bucket's search frequency, its data recency, and various other, configurable criteria.

Certain characteristics that are common to the great majority of Splunk platform searches drive the cache manager's strategy for optimizing bucket location. Specifically, most searches have these characteristics:

They occur over near-term data. 97% of searches look back 24 hours or less.
They have spatial and temporal locality. If a search finds an event at a specific time or in a specific log, it's likely that other searches will look for events within a closely similar time range or in that log.

The cache manager therefore favors recently created buckets and recently accessed buckets, ensuring that most of the data that you are likely to search is available in local cache. Those buckets that are likely to participate in searches only infrequently still exist on remote storage and can be fetched to local storage as needed.

Indexer clusters

With SmartStore indexes, indexer clusters maintain replication and search factor copies of hot buckets only. The remote storage is responsible for ensuring the high availability, data fidelity, and disaster recovery of the warm buckets.

Because the remote storage handles warm bucket high availability, peer nodes replicate only warm bucket metadata, not the buckets themselves. This means that any necessary bucket fixup for SmartStore indexes proceeds much more quickly than it does for non-SmartStore indexes.

If a group of peer nodes equaling or exceeding the replication factor goes down, the cluster does not lose any of its SmartStore warm data because copies of all warm buckets reside on the remote store.

SmartStore data flow

This diagram encapsulates the flow of data for a SmartStore-enabled index in an indexer cluster. Although it contains specific references to indexer clusters, the essential architecture is the same for non-clustered indexers.

Data streams to a source indexer, which indexes it and saves it locally in a hot bucket. The indexer also replicates the hot bucket data to target indexers. So far, the data flow is identical to the data flow for non-SmartStore indexes.

When the hot bucket rolls to warm, the data flow diverges, however. The source indexer copies the warm bucket to the remote object store, while leaving the existing copy in its cache, since searches tend to run across recently indexed data. The target indexers, however, delete their copies, because the remote store ensures high availability without the need to maintain multiple local copies.The master copy of the bucket now resides on the remote store.

The cache manager on the indexer is central to the SmartStore data flow. It fetches copies of buckets from the remote store, as necessary, to handle search requests. It also evicts older or less searched copies of buckets from the cache, as the likelihood of their participating in searches decreases over time. The job of the cache manager is to optimize use of the available cache, while ensuring that searches have immediate access to the buckets they need.

For further information

The chapter How SmartStore works offers a deeper understanding of the SmartStore architecture. It includes these topics:

Related answers from Splunk Community

SmartStore architecture overview

SmartStore in a nutshell

Buckets and SmartStore

The cache manager

Indexer clusters

SmartStore data flow

For further information

Comments

SmartStore architecture overview

Was this topic useful?