When reading data for indexing you can set checkpoints to mark a source as having been read and indexed. You can persist any state information that is appropriate for your input. Typically, you store (check point) the progress of an input source so upon restart, the script knows where to resume reading data. This prevents you from reading and indexing the same data twice.
Splunk software provides a default location for storing checkpoints for modular inputs:
For example, checkpoint data for the S3 example are stored here:
Enable checkpoints in your modular input script
The following example shows how to enable checkpoints in a script. This code sample is from the Splunk S3 example.
Create checkpoint files
In this snippet, you write a function to create the checkpoint file. The checkpoint file is an empty file with a unique name to identify it with the source. This example is encoding the url to an Amazon S3 source. This script has been made cross-compatible with Python 2 and Python 3 using python-future.
. . . from builtins import range def get_encoded_file_path(config, url): # encode the URL (simply to make the file name recognizable) name = "" for i in range(len(url)): if url[i].isalnum(): name += url[i] else: name += "_" # MD5 the URL m = md5.new() m.update(url) name += "_" + m.hexdigest() return os.path.join(config["checkpoint_dir"], name) . . . # simply creates a checkpoint file indicating that the URL was checkpointed def save_checkpoint(config, url): chk_file = get_encoded_file_path(config, url) # just create an empty file name logging.info("Checkpointing url=%s file=%s", url, chk_file) f = open(chk_file, "w") f.close() . . .
Test for checkpoint files
In this snippet, you have a function that tests if a checkpoint file exists. Call this function before reading from a source to make sure you don't read it twice.
. . . # returns true if the checkpoint file exists def load_checkpoint(config, url): chk_file = get_encoded_file_path(config, url) # try to open this file try: open(chk_file, "r").close() except: # assume that this means the checkpoint is not there return False return True . . .
Read a file and set a checkpoint
After reading a source, set a checkpoint. Here is how you checkpoint an Amazon S3 source.
. . . if not load_checkpoint(config, url): # there is no checkpoint for this URL: process init_stream() request_one_object(url, key_id, secret_key, bucket, obj) fini_stream() save_checkpoint(config, url) else: logging.info("URL %s already processed. Skipping.") . . .
You can remove checkpoints by running the Splunk
- Caution: Be careful when removing checkpoints. Running the clean command removes your indexed data. For example,
clean allremoves ALL your indexed data.
For example, to remove checkpoints for a specific scheme:
splunk clean inputdata [<scheme>]
For example, to remove all checkpoints for the S3 modular input example, run the following command:
splunk clean inputdata s3
You can remove checkpoints for all modular inputs by running the command without the optional <scheme> argument. Or you could simply just use the all argument.
// Be careful with these commands! See CAUTION above. splunk clean inputdata splunk clean all
Set up external validation
Set up streaming
This documentation applies to the following versions of Splunk Cloud™: 7.0.11, 7.0.13, 7.2.4, 7.2.6, 7.2.7, 7.2.8, 7.2.9, 7.2.10, 8.0.2001, 8.0.2003, 8.0.2004, 8.0.2006, 8.0.2007, 8.1.2008