How clustered indexing works
In discussing how indexing works in a cluster, it's useful to distinguish between two types of peer nodes:
- Source nodes. The source nodes ingest data from forwarders or other external sources.
- Target nodes. The target nodes receive streams of replicated data from the source nodes.
In practice, a single peer functions as both a source and a target node. However, for the purposes of understanding the flow of data and messages between cluster components, it's helpful to distinguish the two roles.
Important: In a typical cluster deployment, all the peer nodes are source nodes; that is, each node has its own set of external inputs. This is not a requirement, but it's generally best practice. There's no reason to reserve some peers for use just as target nodes. The processing cost of storing replicated data is minimal and, in any case, you cannot currently specify which nodes will receive replicated data. The master determines that on a bucket-by-bucket basis, and the behavior is not configurable. You must assume that all your peer nodes will serve as targets.
Note: In addition to replicating external data, each peer replicates its internal indexes to other peers in the same way. To keep things simple, however, this discussion focuses on what happens to external data.
When a peer node starts
These events occur when a peer node starts up:
1. The peer node registers with the master and receives the latest configuration bundle from the master.
2. The master starts a new generation.
3. The peer starts ingesting external data, in the same way as any indexer. It processes the data into events and then appends the data to a rawdata file. It also creates associated index files. It stores these files (both the rawdata and the index files) locally in a hot bucket. This is the primary copy of the bucket.
4. The master gives the peer a list of target peers for its replicated data. For example, if the replication factor is 3, the master gives the peer a list of two target peers.
5. If the search factor is greater than 1, the master also tells the peer which of its target peers should make its copy of the data searchable. For example, if the search factor is 2, the master picks one specific target peer that should make its copy searchable and communicates that information to the source peer.
6. The peer begins streaming the processed rawdata to the target peers specified by the master. It does not wait until its rawdata file is complete to start streaming its contents; rather, it streams the rawdata in blocks, as it processes the incoming data. It also tells any target peer(s) if they need to make their copies searchable, as communicated to it by the master in step 5.
7. The target peers receive the rawdata from the source peer and store it in local copies of the bucket.
8. Any targets with designated searchable copies start creating the necessary index files.
9. The peer continues to stream data to the targets until it rolls its hot bucket.
Note: The source and target peers rarely communicate with each other through their management ports. Usually, they just send and receive data to each other over their replication ports. The master node manages the overall process.
This is just the breakdown for data flowing from a single peer. In a cluster, multiple peers will be both originating and receiving data at any time.
When a peer node rolls a hot bucket
When a source peer rolls a hot bucket to warm (for example, because the bucket has reached its maximum size), the following sequence of events occurs:
1. The source peer tells the master and its target peers that it has rolled a bucket.
2. The target peers roll their copies of the bucket.
3. The source peer continues ingesting external data as this process is occurring. It indexes the data locally into a new hot bucket and streams the rawdata to a new set of target peers that it gets from the master.
4. The new set of target peers receive the rawdata for the new hot bucket from the source peer and store it in local copies of the bucket. The targets with designated searchable copies also start creating the necessary index files.
5. The source peer continues to stream data to the targets until it rolls its next hot bucket. And so on.
How a peer node interacts with a forwarder
When a peer node gets its data from a forwarder, it processes it in the same way as any indexer getting data from a forwarder. However, in a clustering environment, for most cases, you should enable indexer acknowledgment on each forwarder sending data to a peer. This protects against loss of data between forwarder and peer and is the only way to ensure end-to-end data fidelity. If the forwarder does not get an acknowledgment for a block of data it has sent to a peer, it resends the block.
For details on how to set up forwarders to send data to peers, including how to enable indexer acknowledgment, read "Use forwarders to get your data". To understand how peers and forwarders process indexer acknowledgment, read the section "How indexer acknowledgment works" in that topic.
How clustered search works
This documentation applies to the following versions of Splunk® Enterprise: 5.0, 5.0.1, 5.0.2, 5.0.3, 5.0.4, 5.0.5, 5.0.6, 5.0.7, 5.0.8, 5.0.9, 5.0.10, 5.0.11, 5.0.12, 5.0.13, 5.0.14, 5.0.15, 5.0.16, 5.0.17, 5.0.18