Manage pipeline sets for index parallelization

Index parallelization is a feature that allows an indexer to maintain multiple pipeline sets. A pipeline set handles the processing of data from ingestion of raw data, through event processing, to writing the events to disk. A pipeline set is one instance of the processing pipeline described in How indexing works. It is called a "pipeline set" because it comprises the individual pipelines, such as the parsing pipeline and the indexing pipeline, that together constitute the overall processing pipeline

By default, an indexer runs just a single pipeline set. However, if the underlying machine is under-utilized, in terms of available cores and I/O both, you can configure the indexer to run two pipeline sets. By running two pipeline sets, you potentially double the indexer's indexing throughput capacity.

Note: The actual amount of increased throughput on your indexer depends on the nature of your data inputs and other factors.

In addition, if the indexer is having difficulty handling bursts of data, index parallelization can help it to accommodate the bursts, assuming again that the machine has the available capacity.

To summarize, these are some typical use cases for index parallelization, dependent on available machine resources:

Scale indexer throughput.
Handle bursts of data.

For a better understanding of the use cases and to determine whether your deployment can benefit from multiple pipeline sets, see Parallelization settings in the Capacity Planning Manual.

Note: You cannot use index parallelization with multiple pipeline sets for data that is streaming from a single data source, such as UDP or TCP. Instead, use HTTP Event Collector for streaming data input. This restriction affects both metrics and event data.

Configure the number of pipeline sets

Caution: Before you increase the number of pipeline sets from the default of one, be sure that your indexer can support multiple pipeline sets. Read Parallelization settings in the Capacity Planning Manual. In addition, consult with Professional Services, particularly if you want to increase the number of pipeline sets beyond two.

To set the number of pipeline sets to two, change the parallelIngestionPipelines attribute in the [general] stanza of server.conf:

 parallelIngestionPipelines = 2

You must restart the indexer for the change to take effect.

Unless Professional Services advises otherwise, limit the number of pipeline sets to a maximum of 2.

How the indexer handles multiple pipeline sets

When you implement two pipeline sets, you have two complete processing pipelines, from the point of data ingestion to the point of writing events to disk. The pipeline sets operate independently of each other, with no knowledge of each other's activities. The effect is essentially the same as if each pipeline set were running on its own, separate indexer.

When an indexer receives a new input, it assigns that input to a pipeline set, which then processes the input's data through the entire pipeline process, to the point of writing the input's data to an index on disk.

For example, if the indexer is directly ingesting a file, a single pipeline set processes the entire file. The pipeline sets do not share the file's data. Similarly, when a forwarder begins sending data to an indexer, the indexer assigns the entire input from that forwarder to a single pipeline set. The pipeline set continues to process all data from that forwarder until the forwarder switches to another indexer, assuming that the forwarder is load balancing across multiple indexers.

Each pipeline set can handle multiple inputs simultaneously.

Each pipeline set writes to its own set of hot buckets.

How the indexer allocates inputs to pipeline sets

When an indexer receives a new input, for example, from a forwarder that has just connected with it, the indexer allocates the input to one of its pipeline sets. Depending on the configuration, the indexer uses one of these allocation methods:

round-robin selection
weighted-random selection

In most cases, weighted-random selection is the preferred method. The default method, and the only method available for indexers running a pre-7.3 release, is round-robin selection.

Round-robin selection

By default, the indexer uses round-robin selection to allocate new inputs across its pipeline sets. With round-robin selection, the indexer simply allocates new inputs to each pipeline set in turn.

This method has the downside that it does not take into account the current loads on the pipeline sets. Pipeline sets can have widely varying loads, due to variation in size and complexity of different inputs. With round-robin selection, the indexer might, for example, allocate the next input to a heavily-loaded pipeline set while another pipeline set stands idly by.

Weighted-random selection

The weighted-random selection method considers the pipeline sets' relative loads. The indexer monitors the weighted loads of its pipeline sets over time and uses that information to choose the next pipeline set for data ingestion, in an attempt to balance the relative loads.

With the weighted-random method, each pipeline set reports metrics on its current processing load to the indexer. The indexer looks at the relative loads over a configurable time period and allocates the next new input to the pipeline set with the least overall load during that time period.

Configure the selection method

Specify the selection method with the pipelineSetSelectionPolicy setting in server.conf:

pipelineSetSelectionPolicy = round_robin | weighted_random

The default selection method is round_robin.

You can modify the behavior of the weighted-random selection policy through these settings in server.conf:

pipelineSetWeightsUpdatePeriod
pipelineSetNumTrackingPeriods

See the server.conf spec file for details.

View pipeline set activity

You can use the monitoring console to monitor most aspects of your deployment. This section discusses the console dashboard that provides insight into pipeline set performance.

The primary documentation for the monitoring console is located in Monitoring Splunk Enterprise.

For information on pipeline set performance, select the Indexing Performance: Advanced submenu under the Indexing menu.

The Indexing Performance: Advanced dashboard provides data on pipeline set performance on a per-indexer basis. You can use the dashboard to gain insight into the activity of the pipeline sets and their component pipelines.

This dashboard is mainly of use when troubleshooting performance issues in consultation with Splunk Support. Without expert-level knowledge of the underlying processes, it can be difficult to interpret the information sufficiently to determine performance issue remediation.

The effect of multiple pipeline sets on indexing settings

Some indexing settings are scoped to pipeline sets. These include any settings that are related to a pipeline, processor or queue. Examples of these include max_fd and maxKBps in limits.conf and maxHotBuckets in indexes.conf.

If you have multiple pipeline sets, these limits apply to each pipeline set individually, not to the indexer as a whole. For example, each pipeline set is separately subject to the maxHotBuckets limit. If you set maxHotBuckets to 4, each pipeline set is allowed a maximum of four hot buckets at a time, for a total of eight on an indexer with two pipeline sets.

Forwarders and multiple pipeline sets

You can also configure forwarders to run multiple pipeline sets. Multiple pipeline sets increase forwarder throughput and allow the forwarder to process multiple inputs simultaneously.

This can be of particular value, for example, when a forwarder needs to process a large file that would occupy the pipeline for a long period of time. With just a single pipeline, no other files can be processed until the forwarder finishes the large file. With two pipeline sets, the second pipeline can ingest and forward smaller files quickly, while the first pipeline continues to process the large file.

Assuming that the forwarder has sufficient resources and depending on the nature of the incoming data, a forwarder with two pipelines can potentially forward twice as much data as a forwarder with one pipeline.

How forwarders use multiple pipeline sets

When you enable multiple pipeline sets on a forwarder, each pipeline handles both data input and output. In the case of a heavy forwarder, each pipeline also handles parsing.

The forwarder forwards the output streams independently of each other. If the forwarder is configured for load balancing, it load balances each output stream separately. The receiving indexer handles each stream coming from the forwarder separately, as if each stream were coming from a different forwarder.

The pipeline sets on forwarders and indexers are entirely independent of each other. For example, a forwarder with multiple pipeline sets can forward to any indexer, no matter whether the indexer has one pipeline set or two. The forwarder does not know the pipeline configuration on the indexer, and it does not need to know it. Similarly, an indexer with multiple pipeline sets can receive data from any forwarder, no matter how many pipeline sets the forwarder has.

Configure pipeline sets on a forwarder

You configure the number of pipeline sets for forwarders in the same way as for indexers, with the parallelIngestionPipelines attribute in the [general] stanza of server.conf.

For heavy forwarders, the indexer guidelines apply: The underlying machine must be significantly under-utilized. You should generally limit the number of pipeline sets to two and consult with Professional Services. See Parallelization settings in the Capacity Planning Manual.

For universal forwarders, a single pipeline set uses, on average, around 0.5 of a core, but utilization can reach a maximum of 1.5 cores. Therefore, two pipeline sets will use between 1.0 and 3.0 cores. If you want to configure more than two pipeline sets on a universal forwarder, consult with Professional Services first.

As with indexers, you can specify the selection method with the pipelineSetSelectionPolicy setting in server.conf.

In the case of monitored inputs, the selection method setting has no practical effect. Monitored input sources stick to the initially assigned pipeline until the instance is restarted.

Related answers from Splunk Community

Manage pipeline sets for index parallelization

Configure the number of pipeline sets

How the indexer handles multiple pipeline sets

How the indexer allocates inputs to pipeline sets

Round-robin selection

Weighted-random selection

Configure the selection method

View pipeline set activity

The effect of multiple pipeline sets on indexing settings

Forwarders and multiple pipeline sets

How forwarders use multiple pipeline sets

Configure pipeline sets on a forwarder

Comments

Manage pipeline sets for index parallelization

Was this topic useful?