Protect against loss of in-flight data

If you use Splunk Enterprise, you can guard against loss of data when forwarding to an indexer by enabling the indexer acknowledgment capability. With indexer acknowledgment, the forwarder will resend any data that the indexer does not acknowledge as "received".

You enable indexer acknowledgment on the forwarder, in the outputs.conf file. The feature is disabled by default.

Indexer acknowledgment and indexer clusters

When you use forwarders to send data to peer nodes in an indexer cluster, you should ordinarily enable indexer acknowledgment. To learn more about forwarders and clusters, read "Use forwarders to get your data" in the Managing Indexers and Clusters of Indexers manual.

How indexer acknowledgment works when everything goes well

The forwarder sends data continuously to the indexer, in blocks of approximately 64kB. The forwarder maintains a copy of each block in memory, in its wait queue, until it gets an acknowledgment from the indexer. While waiting, it continues to send more data blocks.

If all goes well, the indexer:

Receives the block of data.
Parses the data.
Writes the data to the file system as events (raw data and index data).
Sends an acknowledgment to the forwarder.

The acknowledgment tells the forwarder that the indexer received the data and successfully wrote it to the file system. Upon receiving the acknowledgment, the forwarder releases the block from memory.

If the wait queue is of sufficient size, it doesn't fill up while waiting for acknowledgments to arrive. See the remainder of this topic for information on possible problems and ways to address them, including how to increase the wait queue size through the in-memory queue size setting.

In the case of indexer clustering, the process of indexer acknowledgment requires a few additional steps. See Use forwarders to get your data in the Managing Indexers and Clusters of Indexers manual for further details.

How indexer acknowledgment works when there's a failure

When there's a failure in the round-trip process, the forwarder does not receive an acknowledgment. It will then attempt to resend the block of data.

Why no acknowledgment?

These are the reasons that a forwarder might not receive acknowledgment:

Indexer goes down after receiving the data -- for instance, due to machine failure.
Indexer is unable to write to the file system -- for instance, because the disk is full.
Network goes down while acknowledgment is en route to the forwarder.

How the forwarder deals with failure

After sending a data block, the forwarder maintains a copy of the data in its wait queue until it receives an acknowledgment. In the meantime, it continues to send additional blocks as usual. If the forwarder doesn't get acknowledgment for a block within 300 seconds (by default), it closes the connection. You can change the wait time by setting the readTimeout attribute in outputs.conf.

If the forwarder is set up for auto load balancing, it then opens a connection to the next indexer in the group (if one is available) and sends the data to it. If the forwarder is not set up for auto load balancing, it attempts to open a connection to the same indexer as before and resend the data.

The forwarder maintains the data block in the wait queue until acknowledgment is received. Once the wait queue fills up, the forwarder stops sending additional blocks until it receives an acknowledgment for one of the blocks, at which point it can free up space in the queue.

Other reasons the forwarder might close a connection

There are actually three conditions that can cause the forwarder to close the network connection:

Read timeout. The forwarder doesn't receive acknowledgment within 300 (default) seconds. This is the condition described above.
Write timeout. The forwarder is not able to finish a network write within 300 (default) seconds. The value is configurable in outputs.conf by setting writeTimeout.
Read/write failure. Typical causes include the indexer's machine crashing or the network going down.

In all these cases, the forwarder will then attempt to open a connection to the next indexer in the load-balanced group, or to the same indexer again if load-balancing is not enabled.

The possibility of duplicates

It's possible for the indexer to index the same data block twice. This can happen if there's a network problem that prevents an acknowledgment from reaching the forwarder. For instance, assume the indexer receives a data block, parses it, and writes it to the file system. It then generates the acknowledgment. However, on the round-trip to the forwarder, the network goes down, so the forwarder never receives the acknowledgment. When the network comes back up, the forwarder then resends the data block, which the indexer will parse and write as if it were new data.

To deal with such a possibility, every time the forwarder resends a data block, it writes an event to its splunkd.log noting that it's a possible duplicate. The admin is responsible for using the log information to track down the duplicate data on the indexer.

Here's an example of a duplicate warning:

10-18-2010 17:32:36.941 WARN  TcpOutputProc - Possible duplication of events with 
channel=source::/home/jkerai/splunk/current-install/etc/apps/sample_app
/logs/maillog.1|host::MrT|sendmail|, streamId=5941229245963076846, offset=131072 
subOffset=219 on host=10.1.42.2:9992

Enable indexer acknowledgment

You configure indexer acknowledgment on the forwarder. Configure the useACK setting to true in the outputs.conf file:

[tcpout:<target_group>]
server=<server1>, <server2>, ...
useACK=true

By default, useACK is set to false.

You can set useACK either globally or by target group, at the [tcpout] or [tcpout:] stanza levels. You cannot set it for individual receiving indexers at the [tcpout-server: ...] stanza level.

For more information, see the outputs.conf spec file.

Indexer acknowledgment and forwarded data throughput

The forwarder uses a wait queue to manage the indexer acknowledgment process. This queue has a default maximum size of 21MB, which is generally sufficient. In rare cases, however, you might need to manually adjust the wait queue size.

If you want more information about the wait queue, read this section. It describes how the wait queue size is configured. It also provides detailed information on how the wait queue functions.

How the wait queue size is configured

You do not set the wait queue size directly. Instead, you set the size of the in-memory output queue, and the wait queue size is automatically set to three times the output queue size. To configure the output queue size, use the maxQueueSize attribute in outputs.conf.

The default for the maxQueueSize attribute is auto. Splunk recommends that you keep this setting. It optimizes the queue sizes, based on whether indexer acknowledgment is enabled:

When useACK=true, the output queue size is 7MB and the wait queue size is 21MB.
When useACK=false, the output queue size is 500KB.

You can set maxQueueSize to specific values if necessary. See the outputs.conf spec file for further details on maxQueueSize.

Note the following points regarding the maxQueueSize=auto recommendation:

When you turn on indexer acknowledgment, the increase in queue size takes effect only after you restart the forwarder.
The auto setting is only available for forwarders of version 5.0.4 and above. For earlier version forwarders running indexer acknowledgment, you need to explicitly set the maxQueueSize attribute to 7MB.

Why the wait queue matters

If you enable indexer acknowledgment, the forwarder uses a wait queue to manage the acknowledgment process. Because the forwarder sends data blocks continuously and does not wait for acknowledgment before sending the next block, its wait queue will typically maintain many blocks, each waiting for its acknowledgment. The forwarder will continue to send blocks until its wait queue is full, at which point it will stop forwarding. The forwarder then waits until it receives an acknowledgment, which allows it to release a block from its queue and thus resume forwarding.

A wait queue can fill up when something is wrong with the network or indexer; however, it can also fill up even when the indexer is functioning normally. This is because the indexer only sends the acknowledgment after it has written the data to the file system. Any delay in writing to the file system will slow the pace of acknowledgment, leading to a full wait queue.

There are a few reasons that a normal functioning indexer might delay writing data to the file system (and so delay its sending of acknowledgments):

The indexer is very busy. For example, at the time the data arrives, the indexer might be dealing with multiple search requests or with data coming from a large number of forwarders.
The indexer is receiving too little data. For efficiency, an indexer only writes to the file system periodically -- either when a write queue fills up or after a timeout of a few seconds. If a write queue is slow to fill up, the indexer will wait until the timeout to write. If data is coming from only a few forwarders, the indexer can end up in the timeout condition, even if each of those forwarders is sending a normal quantity of data. Since write queues exist on a per hot bucket basis, the condition occurs when some particular bucket is getting a small amount of data. Usually this means that a particular index is getting a small amount of data.

To ensure that throughput does not degrade because the forwarder is waiting on the indexer for acknowledgment, you should ordinarily retain the default setting of maxQueueSize=auto. In rare cases, you might need to increase the wait queue size, so that the forwarder has sufficient space to maintain all blocks in memory while waiting for acknowledgments to arrive. On the other hand, if you have many forwarders feeding a single indexer and a moderate number of data sources per forwarder, you might be able to conserve a few megabytes of memory by using a smaller size.

When the receiver is a forwarder, not an indexer

You can also use indexer acknowledgment when the data transmission occurs via an intermediate forwarder; that is, where an originating forwarder sends the data to an intermediate forwarder, which then forwards it to the indexer. For this scenario, if you want to use indexer acknowledgment, it is recommended that you enable it along all segments of the data transmission. That way, you can ensure that the data gets delivered along the entire path from originating forwarder to indexer.

Assume you have an originating forwarder that sends data to an intermediate forwarder, which in turn forwards that data to an indexer. To enable indexer acknowledgment along the entire line of transmission, you must enable it twice: first for the segment between originating forwarder and intermediate forwarder, and again for the segment between intermediate forwarder and indexer.

If you enable both segments of the transmission, the intermediate forwarder waits until it receives acknowledgment from the indexer and then it sends acknowledgment back to the originating forwarder.

However, if you enable just one of the segments, you only get indexer acknowledgment over that part of the transmission. For example, say indexer acknowledgment is enabled for the segment from originating forwarder to intermediate forwarder but not for the segment from intermediate forwarder to indexer. In this case, the intermediate forwarder sends acknowledgment back to the originating forwarder as soon as it sends the data on to the indexer. It then relies on TCP to safely deliver the data to the indexer. Because indexer acknowledgment is not enabled for this second segment, the intermediate forwarder cannot verify delivery of the data to the indexer. This second case has limited value and is not recommended.

Related answers from Splunk Community

Protect against loss of in-flight data

Indexer acknowledgment and indexer clusters

How indexer acknowledgment works when everything goes well

How indexer acknowledgment works when there's a failure

Why no acknowledgment?

How the forwarder deals with failure

Other reasons the forwarder might close a connection

The possibility of duplicates

Enable indexer acknowledgment

Indexer acknowledgment and forwarded data throughput

How the wait queue size is configured

Why the wait queue matters

When the receiver is a forwarder, not an indexer

Comments

Protect against loss of in-flight data

Was this topic useful?