Manage pipeline sets for index parallelization

Index parallelization is a feature that allows an indexer to maintain multiple pipeline sets. A pipeline set handles the processing of data from ingestion of raw data, through event processing, to writing the events to disk. A pipeline set is one instance of the processing pipeline described in How indexing works.

By default, an indexer runs just a single pipeline set. However, if the underlying machine is under-utilized, in terms of available cores and I/O both, you can configure the indexer to run two pipeline sets. By running two pipeline sets, you potentially double the indexer's indexing throughput capacity.

Note: The actual amount of increased throughput on your indexer depends on the nature of your data inputs and other factors.

In addition, if the indexer is having difficulty handling bursts of data, index parallelization can help it to accommodate the bursts, assuming again that the machine has the available capacity.

To summarize, these are some typical use cases for index parallelization, dependent on available machine resources:

Scale indexer throughput.
Handle bursts of data.

For a better understanding of the use cases and to determine whether your deployment can benefit from multiple pipeline sets, see Parallelization settings in the Capacity Planning Manual.

Note: You cannot use index parallelization with multiple pipeline sets for metrics data that is received from a UDP data input. If your system uses multiple pipeline sets, use a TCP or HTTP Event Collector data input for metrics data. For more about metrics, see the Metrics manual.

Configure the number of pipeline sets

Caution: Before you increase the number of pipeline sets from the default of one, be sure that your indexer can support multiple pipeline sets. Read Parallelization settings in the Capacity Planning Manual. In addition, consult with Professional Services, particularly if you want to increase the number of pipeline sets beyond two.

To set the number of pipeline sets to two, change the parallelIngestionPipelines attribute in the [general] stanza of server.conf:

 parallelIngestionPipelines = 2

You must restart the indexer for the change to take effect.

Unless Professional Services advises otherwise, limit the number of pipeline sets to a maximum of 2.

How the indexer handles multiple pipeline sets

When you implement two pipeline sets, you have two complete processing pipelines, from the point of data ingestion to the point of writing events to disk. The pipeline sets operate independently of each other, with no knowledge of each other's activities. The effect is essentially the same as if each pipeline set was running on its own, separate indexer.

Each data input goes to a single pipeline. For example, if you are directly ingesting a file, the entire file will get processed through a single pipeline. The pipelines do not share the file's data.

When a data input enters the indexer, it can enter either of the pipeline sets. The indexer uses round-robin load balancing to allocate new inputs across its pipeline sets.

Each pipeline writes to its own set of hot buckets.

The effect of multiple pipeline sets on indexing settings

Some indexing settings are scoped to pipeline sets. These include any settings that are related to a pipeline, processor or queue. Examples of these include max_fd and maxKBps in limits.conf and maxHotBuckets in indexes.conf.

If you have multiple pipeline sets, these limits apply to each pipeline set individually, not to the indexer as a whole. For example, each pipeline set is separately subject to the maxHotBuckets limit. If you set maxHotBuckets to 4, each pipeline set is allowed a maximum of four hot buckets at a time, for a total of eight on an indexer with two pipeline sets.

Forwarders and multiple pipeline sets

You can also configure forwarders to run multiple pipeline sets. Multiple pipeline sets increase forwarder throughput and allow the forwarder to process multiple inputs simultaneously.

This can be of particular value, for example, when a forwarder needs to process a large file that would occupy the pipeline for a long period of time. With just a single pipeline, no other files can be processed until the forwarder finishes the large file. With two pipeline sets, the second pipeline can ingest and forward smaller files quickly, while the first pipeline continues to process the large file.

Assuming that the forwarder has sufficient resources and depending on the nature of the incoming data, a forwarder with two pipelines can potentially forward twice as much data as a forwarder with one pipeline.

How forwarders use multiple pipeline sets

When you enable multiple pipeline sets on a forwarder, each pipeline handles both data input and output. In the case of a heavy forwarder, each pipeline also handles parsing.

The forwarder uses round-robin load balancing to allocate new inputs across its pipeline sets.

The forwarder forwards the output streams independently of each other. If the forwarder is configured for load balancing, it load balances each output stream separately. The receiving indexer handles each stream coming from the forwarder separately, as if each stream were coming from a different forwarder.

Note: The pipeline sets on forwarders and indexers are entirely independent of each other. For example, a forwarder with multiple pipeline sets can forward to any indexer, no matter whether the indexer has one pipeline set or two. The forwarder does not know the pipeline configuration on the indexer, and it does not need to know it. Similarly, an indexer with multiple pipeline sets can receive data from any forwarder, no matter how many pipeline sets the forwarder has.

Configure pipeline sets on a forwarder

You configure the number of pipeline sets for forwarders in the same way as for indexers, with the parallelIngestionPipelines attribute in the [general] stanza of server.conf.

For heavy forwarders, the indexer guidelines apply: The underlying machine must be significantly under-utilized. You should generally limit the number of pipeline sets to two and consult with Professional Services. See Parallelization settings in the Capacity Planning Manual.

For universal forwarders, a single pipeline set uses, on average, around 0.5 of a core, but utilization can reach a maximum of 1.5 cores. Therefore, two pipeline sets will use between 1.0 and 3.0 cores. If you want to configure more than two pipeline sets on a universal forwarder, consult with Professional Services first.

Related answers from Splunk Community

Manage pipeline sets for index parallelization

Configure the number of pipeline sets

How the indexer handles multiple pipeline sets

The effect of multiple pipeline sets on indexing settings

Forwarders and multiple pipeline sets

How forwarders use multiple pipeline sets

Configure pipeline sets on a forwarder

Comments

Manage pipeline sets for index parallelization

Was this topic useful?