How clustered indexing works

When discussing how data and messages flow between nodes during indexing, it is useful to distinguish between the two roles that a peer node plays:

Source node. The source node ingests data from forwarders or other external sources.
Target node. The target node receive streams of replicated data from the source nodes.

In practice, a single peer functions as both a source and a target node, often simultaneously.

Important: In a typical indexer cluster deployment, all the peer nodes are source nodes; that is, each node has its own set of external inputs. This is not a requirement, but it is generally the best practice. There is no reason to reserve some peers for use just as target nodes. The processing cost of storing replicated data is minimal, and, in any case, you cannot currently specify which nodes will receive replicated data. The manager node determines that on a bucket-by-bucket basis, and the behavior is not configurable. You must assume that all the peer nodes will serve as targets.

Note: In addition to replicating external data, each peer replicates its internal indexes to other peers in the same way. To keep things simple, this discussion focuses on external data only.

How the target peers are chosen

Whenever the source peer starts a hot bucket, the manager node gives it a list of target peers to stream its replicated data to. The list is bucket-specific. If a source peer is writing to several hot buckets, it could be streaming the contents of each bucket to a different set of target peers.

The manager chooses the list of target peers randomly. In the case of multisite clustering, it respects site boundaries, as dictated by the replication factor, but chooses the target peers randomly within those constraints.

When a peer node starts

These events occur when a peer node starts up:

1. The peer node registers with the manager and receives the latest configuration bundle from the manager.

2. The manager rebalances the primary bucket copies across the cluster and starts a new generation.

3. The peer starts ingesting external data, in the same way as any indexer. It processes the data into events and then appends the data to a rawdata file. It also creates associated index files. It stores these files (both the rawdata and the index files) locally in a hot bucket. This is the primary copy of the bucket.

4. The manager gives the peer a list of target peers for its replicated data. For example, if the replication factor is 3, the manager gives the peer a list of two target peers.

5. If the search factor is greater than 1, the manager also tells the peer which of its target peers should make its copy of the data searchable. For example, if the search factor is 2, the manager picks one specific target peer that should make its copy searchable and communicates that information to the source peer.

6. The peer begins streaming the processed rawdata to the target peers specified by the manager. It does not wait until its rawdata file is complete to start streaming its contents; rather, it streams the rawdata in blocks, as it processes the incoming data. It also tells any target peers if they need to make their copies searchable, as communicated to it by the manager in step 5.

7. The target peers receive the rawdata from the source peer and store it in local copies of the bucket.

8. Any targets with designated searchable copies start creating the necessary index files.

9. The peer continues to stream data to the targets until it rolls its hot bucket.

Note: The source and target peers rarely communicate with each other through their management ports. Usually, they just send and receive data to each other over their replication ports. The manager node manages the overall process.

This is just the breakdown for data flowing from a single peer. In a cluster, multiple peers will be both originating and receiving data at any time.

When a peer node rolls a hot bucket

When a source peer rolls a hot bucket to warm (for example, because the bucket has reached its maximum size), the following sequence of events occurs:

1. The source peer tells the manager and its target peers that it has rolled a bucket.

2. The target peers roll their copies of the bucket.

3. The source peer continues ingesting external data as this process is occurring. It indexes the data locally into a new hot bucket and streams the rawdata to a new set of target peers that it gets from the manager.

4. The new set of target peers receive the rawdata for the new hot bucket from the source peer and store it in local copies of the bucket. The targets with designated searchable copies also start creating the necessary index files.

5. The source peer continues to stream data to the targets until it rolls its next hot bucket. And so on.

How a peer node interacts with a forwarder

When a peer node gets its data from a forwarder, it processes it in the same way as any indexer getting data from a forwarder. However, in a clustering environment, you should ordinarily enable indexer acknowledgment for each forwarder sending data to a peer. This protects against loss of data between forwarder and peer and is the only way to ensure end-to-end data fidelity. If the forwarder does not get an acknowledgment for a block of data it has sent to a peer, it resends the block.

For details on how to set up forwarders to send data to peers, read "Use forwarders to get your data into the indexer cluster". To understand how peers and forwarders process indexer acknowledgment, read the section "How indexer acknowledgment works" in that topic.

Related answers from Splunk Community

How clustered indexing works

How the target peers are chosen

When a peer node starts

When a peer node rolls a hot bucket

How a peer node interacts with a forwarder

Comments

Was this topic useful?