Manage pipeline sets for index parallelization
Index parallelization is a feature that allows an indexer to maintain multiple pipeline sets. A pipeline set handles the processing of data from ingestion of raw data, through event processing, to writing the events to disk. A pipeline set is one instance of the processing pipeline described in How indexing works. It is called a "pipeline set" because it comprises the individual pipelines, such as the parsing pipeline and the indexing pipeline, that together constitute the overall processing pipeline
By default, an indexer runs just a single pipeline set. However, if the underlying machine is under-utilized, in terms of available cores and I/O both, you can configure the indexer to run two pipeline sets. By running two pipeline sets, you potentially double the indexer's indexing throughput capacity.
Note: The actual amount of increased throughput on your indexer depends on the nature of your data inputs and other factors.
In addition, if the indexer is having difficulty handling bursts of data, index parallelization can help it to accommodate the bursts, assuming again that the machine has the available capacity.
To summarize, these are some typical use cases for index parallelization, dependent on available machine resources:
- Scale indexer throughput.
- Handle bursts of data.
For a better understanding of the use cases and to determine whether your deployment can benefit from multiple pipeline sets, see Parallelization settings in the Capacity Planning Manual.
Note: You cannot use index parallelization with multiple pipeline sets for metrics data that is received from a UDP data input. If your system uses multiple pipeline sets, use a TCP or HTTP Event Collector data input for metrics data. For more about metrics, see the Metrics manual.
Configure the number of pipeline sets
Caution: Before you increase the number of pipeline sets from the default of one, be sure that your indexer can support multiple pipeline sets. Read Parallelization settings in the Capacity Planning Manual. In addition, consult with Professional Services, particularly if you want to increase the number of pipeline sets beyond two.
To set the number of pipeline sets to two, change the
parallelIngestionPipelines attribute in the
[general] stanza of
parallelIngestionPipelines = 2
You must restart the indexer for the change to take effect.
Unless Professional Services advises otherwise, limit the number of pipeline sets to a maximum of 2.
How the indexer handles multiple pipeline sets
When you implement two pipeline sets, you have two complete processing pipelines, from the point of data ingestion to the point of writing events to disk. The pipeline sets operate independently of each other, with no knowledge of each other's activities. The effect is essentially the same as if each pipeline set were running on its own, separate indexer.
When an indexer receives a new input, it assigns that input to a pipeline set, which then processes the input's data through the entire pipeline process, to the point of writing the input's data to an index on disk.
For example, if the indexer is directly ingesting a file, a single pipeline set processes the entire file. The pipeline sets do not share the file's data. Similarly, when a forwarder begins sending data to an indexer, the indexer assigns the entire input from that forwarder to a single pipeline set. The pipeline set continues to process all data from that forwarder until the forwarder switches to another indexer, assuming that the forwarder is load balancing across multiple indexers.
Each pipeline set can handle multiple inputs simultaneously.
Each pipeline set writes to its own set of hot buckets.
How the indexer allocates inputs to pipeline sets
When an indexer receives a new input, for example, from a forwarder that has just connected with it, the indexer allocates the input to one of its pipeline sets. Depending on the configuration, the indexer uses one of these allocation methods:
- round-robin selection
- weighted-random selection
In most cases, weighted-random selection is the preferred method. The default method, and the only method available for indexers running a pre-7.3 release, is round-robin selection.
By default, the indexer uses round-robin selection to allocate new inputs across its pipeline sets. With round-robin selection, the indexer simply allocates new inputs to each pipeline set in turn.
This method has the downside that it does not take into account the current loads on the pipeline sets. Pipeline sets can have widely varying loads, due to variation in size and complexity of different inputs. With round-robin selection, the indexer might, for example, allocate the next input to a heavily-loaded pipeline set while another pipeline set stands idly by.
The weighted-random selection method considers the pipeline sets' relative loads. The indexer monitors the weighted loads of its pipeline sets over time and uses that information to choose the next pipeline set for data ingestion, in an attempt to balance the relative loads.
With the weighted-random method, each pipeline set reports metrics on its current processing load to the indexer. The indexer looks at the relative loads over a configurable time period and allocates the next new input to the pipeline set with the least overall load during that time period.
Configure the selection method
Specify the selection method with the
pipelineSetSelectionPolicy setting in
pipelineSetSelectionPolicy = round_robin | weighted_random
The default selection method is
You can modify the behavior of the weighted-random selection policy through these settings in
See the server.conf spec file for details.
View pipeline set activity
You can use the monitoring console to monitor most aspects of your deployment. This section discusses the console dashboard that provides insight into pipeline set performance.
The primary documentation for the monitoring console is located in Monitoring Splunk Enterprise.
For information on pipeline set performance, select the Indexing Performance: Advanced submenu under the Indexing menu.
The Indexing Performance: Advanced dashboard provides data on pipeline set performance on a per-indexer basis. You can use the dashboard to gain insight into the activity of the pipeline sets and their component pipelines.
This dashboard is mainly of use when troubleshooting performance issues in consultation with Splunk Support. Without expert-level knowledge of the underlying processes, it can be difficult to interpret the information sufficiently to determine performance issue remediation.
The effect of multiple pipeline sets on indexing settings
Some indexing settings are scoped to pipeline sets. These include any settings that are related to a pipeline, processor or queue. Examples of these include
If you have multiple pipeline sets, these limits apply to each pipeline set individually, not to the indexer as a whole. For example, each pipeline set is separately subject to the
maxHotBuckets limit. If you set
maxHotBuckets to 4, each pipeline set is allowed a maximum of four hot buckets at a time, for a total of eight on an indexer with two pipeline sets.
Forwarders and multiple pipeline sets
You can also configure forwarders to run multiple pipeline sets. Multiple pipeline sets increase forwarder throughput and allow the forwarder to process multiple inputs simultaneously.
This can be of particular value, for example, when a forwarder needs to process a large file that would occupy the pipeline for a long period of time. With just a single pipeline, no other files can be processed until the forwarder finishes the large file. With two pipeline sets, the second pipeline can ingest and forward smaller files quickly, while the first pipeline continues to process the large file.
Assuming that the forwarder has sufficient resources and depending on the nature of the incoming data, a forwarder with two pipelines can potentially forward twice as much data as a forwarder with one pipeline.
How forwarders use multiple pipeline sets
When you enable multiple pipeline sets on a forwarder, each pipeline handles both data input and output. In the case of a heavy forwarder, each pipeline also handles parsing.
The forwarder always uses round-robin load balancing to allocate new inputs across its pipeline sets. Weighted-random selection is not available for forwarders.
The forwarder forwards the output streams independently of each other. If the forwarder is configured for load balancing, it load balances each output stream separately. The receiving indexer handles each stream coming from the forwarder separately, as if each stream were coming from a different forwarder.
Note: The pipeline sets on forwarders and indexers are entirely independent of each other. For example, a forwarder with multiple pipeline sets can forward to any indexer, no matter whether the indexer has one pipeline set or two. The forwarder does not know the pipeline configuration on the indexer, and it does not need to know it. Similarly, an indexer with multiple pipeline sets can receive data from any forwarder, no matter how many pipeline sets the forwarder has.
Configure pipeline sets on a forwarder
You configure the number of pipeline sets for forwarders in the same way as for indexers, with the
parallelIngestionPipelines attribute in the
[general] stanza of
For heavy forwarders, the indexer guidelines apply: The underlying machine must be significantly under-utilized. You should generally limit the number of pipeline sets to two and consult with Professional Services. See Parallelization settings in the Capacity Planning Manual.
For universal forwarders, a single pipeline set uses, on average, around 0.5 of a core, but utilization can reach a maximum of 1.5 cores. Therefore, two pipeline sets will use between 1.0 and 3.0 cores. If you want to configure more than two pipeline sets on a universal forwarder, consult with Professional Services first.
Remove indexes and indexed data
This documentation applies to the following versions of Splunk® Enterprise: 7.3.0, 7.3.1, 7.3.2, 7.3.3, 7.3.4, 8.0.0, 8.0.1