Data checkpoints

When reading data for indexing you can set checkpoints to mark a source as having been read and indexed. You can persist any state information that is appropriate for your input. Typically, you store (check point) the progress of an input source so upon restart, the script knows where to resume reading data. This prevents you from reading and indexing the same data twice.

Splunk software provides a default location for storing checkpoints for modular inputs:

$SPLUNK_DB/modinputs/<input_name>

For example, checkpoint data for the S3 example are stored here:

$SPLUNK_DB/modinputs/s3

Enable checkpoints in your modular input script

The following example shows how to enable checkpoints in a script. This code sample is from the Splunk S3 example.

Create checkpoint files

In this snippet, you write a function to create the checkpoint file. The checkpoint file is an empty file with a unique name to identify it with the source. This example is encoding the url to an Amazon S3 source. This script has been made cross-compatible with Python 2 and Python 3 using python-future.

. . .
from builtins import range
def get_encoded_file_path(config, url):
    # encode the URL (simply to make the file name recognizable)
    name = ""
    for i in range(len(url)):
        if url[i].isalnum():
            name += url[i]
        else:
            name += "_"

    # MD5 the URL
    m = md5.new()
    m.update(url)
    name += "_" + m.hexdigest()

    return os.path.join(config["checkpoint_dir"], name)
. . .
# simply creates a checkpoint file indicating that the URL was checkpointed
def save_checkpoint(config, url):
    chk_file = get_encoded_file_path(config, url)
    # just create an empty file name
    logging.info("Checkpointing url=%s file=%s", url, chk_file)
    f = open(chk_file, "w")
    f.close()
. . .

Test for checkpoint files

In this snippet, you have a function that tests if a checkpoint file exists. Call this function before reading from a source to make sure you don't read it twice.

. . .
# returns true if the checkpoint file exists
def load_checkpoint(config, url):
    chk_file = get_encoded_file_path(config, url)
    # try to open this file
    try:
        open(chk_file, "r").close()
    except:
        # assume that this means the checkpoint is not there
        return False
    return True
. . .

Read a file and set a checkpoint

After reading a source, set a checkpoint. Here is how you checkpoint an Amazon S3 source.

        . . .
        if not load_checkpoint(config, url):
            # there is no checkpoint for this URL: process
            init_stream()
            request_one_object(url, key_id, secret_key, bucket, obj)
            fini_stream()
            save_checkpoint(config, url)
        else:
            logging.info("URL %s already processed.  Skipping.")
       . . .

Remove checkpoints

You can remove checkpoints by running the Splunk clean utility.

Caution: Be careful when removing checkpoints. Running the clean command removes your indexed data. For example, clean all removes ALL your indexed data.

For example, to remove checkpoints for a specific scheme:

splunk clean inputdata [<scheme>]

For example, to remove all checkpoints for the S3 modular input example, run the following command:

splunk clean inputdata s3

You can remove checkpoints for all modular inputs by running the command without the optional <scheme> argument. Or you could simply just use the all argument.

// Be careful with these commands! See CAUTION above.

splunk clean inputdata
splunk clean all

Related answers from Splunk Community

Data checkpoints

Enable checkpoints in your modular input script

Create checkpoint files

Test for checkpoint files

Read a file and set a checkpoint

Remove checkpoints

Comments

Data checkpoints

Was this topic useful?