Splunk® Supported Add-ons

Splunk Add-on for AWS

Download manual as PDF

Download topic as PDF

Configure Generic S3 inputs for the Splunk Add-on for AWS

The Generic S3 input lists all the objects in the bucket and examines each file's modified date every time it runs to pull uncollected data from an S3 bucket. When the number of objects in a bucket is large, this can be a very time-consuming process with low throughput.

Before you begin configuring your Generic S3 inputs, note the following expected behaviours.

  1. You cannot edit the initial scan time parameter of an S3 input after you create it. If you need to adjust the start time of an S3 input, delete it and recreate it.
  2. The S3 data input is not intended to read frequently modified files. If a file is modified after it has been indexed, the Splunk platform will index the file again, resulting in duplicated data. Use key/blacklist/whitelist options to instruct the add-on to index only those files that you know will not be modified later.
  3. The S3 data input processes compressed files according to their suffixes. Use these suffixes only if the file is in the corresponding format, or data processing errors will occur. The data input supports the following compression types:
    • single file in zip, gzip, tar, or tar.gz formats
    • multiple files with or without folders in zip, tar, or tar.gz format

    Expanding compressed files requires significant operating system resources.

  4. The Splunk platform auto-detects the character set used in your files among these options: UTF-8 with/without BOM, UTF-16LE/BE with BOM, UTF-32BE/LE with BOM. If your S3 key uses a different character set, you can specify it in inputs.conf using the character_set parameter and separate out this collection job into its own input. Mixing non-autodetected character sets in a single input causes errors.
  5. If your S3 bucket contains a very large number of files, you can configure multiple S3 inputs for a single S3 bucket to improve performance. The Splunk platform dedicates one process for each data input, so provided that your system has sufficient processing power, performance will improve with multiple inputs. See Performance reference for the S3 input in the Splunk Add-on for AWS for details.

    To prevent indexing duplicate data, verify that multiple inputs do not collect the same S3 folder and file data.

  6. As a best practice, archive your S3 bucket contents when you no longer need to actively collect them. AWS charges for list key API calls that the input uses to scan your buckets for new and changed files, so you can reduce costs and improve performance by archiving older S3 keys to another bucket or storage type.
  7. After configuring an S3 input, you may need to wait for a few minutes before new events are ingested and can be searched. The wait time depends on the number of files in the S3 buckets from which you are collecting data – the larger the quantity, the longer the delay. Also, more verbose logging level causes longer data digestion time. Be warned that debug mode is extremely verbose and is not recommended on production systems.

Configure a Generic S3 input on the data collection node using one of the following ways:

Configure a Generic S3 input using Splunk Web

To configure inputs in Splunk Web, click on Splunk Add-on for AWS in the left navigation bar on Splunk Web home, then choose one of the following menu paths depending on which data type you want to collect:

  • Create New Input > CloudTrail > Generic S3
  • Create New Input > CloudFront Access Log > Generic S3
  • Create New Input > ELB Access Logs > Generic S3
  • Create New Input > S3 Access Logs > Generic S3
  • Create New Input > Others > Generic S3


Make sure you choose the right menu path corresponding to the data type you want to collect. The system will automatically set the appropriate sourcetype and may display slightly different field settings in the subsequent configuration page based on the menu path.

Argument in configuration file Field in Splunk Web Description
aws_account AWS Account The AWS account or EC2 IAM role the Splunk platform uses to access the keys in your S3 buckets. In Splunk Web, select an account from the drop-down list. In inputs.conf, enter the friendly name of one of the AWS accounts that you configured on the Configuration page or the name of the autodiscovered EC2 IAM role.

Note: If the region of the AWS account you select is GovCloud, you may encounter errors like Failed to load options for S3 Bucket. You need to manually add AWS GovCloud Endpoint in the S3 Host Name field. See http://docs.aws.amazon.com/govcloud-us/latest/UserGuide/using-govcloud-endpoints.html for more information.

aws_iam_role Assume Role The IAM role to assume, see Manage IAM roles.
bucket_name S3 Bucket AWS Bucket Name
log_file_prefix Log File Prefix Configure the prefix of the log file. This add-on will search the log files under this prefix.
log_start_date Start Date/Time The start date of the log.
log_end_date End Date/Time The end date of the log.
sourcetype Source Type A source type for the events. Specify only if you want to override the default of aws:s3. You can select a source type from the drop-down or type a custom source type yourself. To index access logs, enter aws:s3:accesslogs, aws:cloudfront:accesslogs, or aws:elb:accesslogs, depending on the log types in the bucket. To index CloudTrail events directly from an S3 bucket, change the source type to aws:cloudtrail.
index Index The index name where the Splunk platform should to put the S3 data. The default is main.
ct_blacklist CloudTrail Event Blacklist Only valid if the source type is set to aws:cloudtrail. A PCRE regular expression that specifies event names to exclude. The default regex is ^$ to exclude events that can produce a high volume of data. Leave it blank if you want all data to be indexed.
blacklist CloudTrail Event Blacklist \.bin$).
polling_interval Polling Interval The number of seconds to wait before the Splunk platform runs the command again. Default is 1800 seconds.

Configure a Generic S3 input using configuration file

When you configure inputs manually in inputs.conf, create a stanza using the following template and add it to $SPLUNK_HOME/etc/apps/Splunk_TA_aws/local/inputs.conf. If the file or path does not exist, create it.

[aws_s3://<name>]
is_secure = <whether use secure connection to AWS>
host_name = <the host name of the S3 service>
aws_account = <AWS account used to connect to AWS>
bucket_name = <S3 bucket name>
polling_interval = <Polling interval for statistics>
key_name = <S3 key prefix>
recursion_depth = <For folder keys, -1 == unconstrained>
initial_scan_datetime = <Splunk relative time>
terminal_scan_datetime = <Only S3 keys which have been modified before this datetime will be considered. Using datetime format: %Y-%m-%dT%H:%M:%S%z (for example, 2011-07-06T21:54:23-0700).>
max_items = <Max trackable items.>
max_retries = <Max number of retry attempts to stream incomplete items.>
whitelist = <Override regex for blacklist when using a folder key.>
blacklist = <Keys to ignore when using a folder key.>
character_set = <The encoding used in your S3 files. Default to 'auto' meaning that file encoding will be detected automatically amoung UTF-8, UTF8 without BOM, UTF-16BE, UTF-16LE, UTF32BE and UTF32LE. Notice that once one specified encoding is set, data input will only handle that encoding.>
ct_blacklist = <The blacklist to exclude cloudtrail events. Only valid when manually set sourcetype=aws:cloudtrail.>
ct_excluded_events_index = <name of index to put excluded events into. default is empty, which discards the events>
aws_iam_role = <AWS IAM role to be assumed>

Note: Under one AWS account, to ingest logs in different prefixed locations in the bucket, you need to configure multiple AWS data inputs, one for each prefix name. Alternatively, you can configure one data input but use different AWS accounts to ingest logs in different prefixed locations in the bucket.

Some of these settings have default values that can be found in $SPLUNK_HOME/etc/apps/Splunk_TA_aws/default/inputs.conf:

[aws_s3]
aws_account =
sourcetype = aws:s3
initial_scan_datetime = default
max_items = 100000
max_retries = 3
polling_interval=
interval = 30
recursion_depth = -1
character_set = auto
is_secure = True
host_name = s3.amazonaws.com
ct_blacklist = ^(?:Describe|List|Get)
ct_excluded_events_index =
PREVIOUS
Configure Description inputs for the Splunk Add-on for AWS
  NEXT
Configure Incremental S3 inputs for the Splunk Add-on for AWS

This documentation applies to the following versions of Splunk® Supported Add-ons: released


Comments

Note that when using a mixed content bucket (i.e containing log and mp3 files) the whitelist / blacklist will not prevent the fetch phase from collecting the mp3 file and attempting to process it. This causes Splunk to crash as it tries to read the mp3 file as JSON. I had to modify _fetch_key in aws_s3 data_loader.py to reapply the match_regex .

Joseft
September 18, 2018

Hello Pkeller,

I have reached out to the engineering team about your issue, and coordinated with the support team to link your service ticket to the engineering ticket. Your support engineer will be in touch with you regarding the status of the ticket.

Mglauser splunk, Splunker
June 14, 2018

I've opened up a ticket on this, but wanted to mention that initial_scan_datetime doesn't appear to work at all. The document says to use <Splunk relative time> ... but if you try something like: -7d@d in the UI it rejects the input and says:

Please enter a correct date format e.g. 2000-01-01T00:00:00Z

If you do that, and enter something like: 2018-06-01T00:00:00Z .. the value is accepted, but the collection still goes out and starts grabbing data from much, much earlier than the first of June 2018 ...

Pkeller
June 7, 2018

Hi Vsingla1 and Manny1time,
Blacklisting can be used on the "source = s3://xxx/34/xx/xx.gz" input by creating a regular expression to exclude desired sources. For example, a regex to exclude <code>.conf</code> files, and files with sources that end in ".bin" is ".*(\.conf$|\.bin$)". For your example, the regex would look like:
.*(\.34$).

Mglauser splunk, Splunker
June 5, 2018

I have the same question as Manny1time. Can someone answer this please?

Vsingla1
June 5, 2018

Hello Kumarv,
Thanks for providing feedback. If you change a configuration file manually then you would need to either restart Splunk or disable/enable the input for Splunk to pick up the change. The Splunk Web UI is the recommended method for doing this, as it does validation during the configuration process. If it's impossible to use the UI for this task, please consider restarting your deployment after any change. Hopefully this helps, but if not, please don't hesitate to reach out.

Mglauser splunk, Splunker
May 15, 2018

I am seeing an issue with S3 generic when configured via configuration file. The input does NOT work until you disable and and re-enable it via Web UI. Is it a bug? do we have fix or workaround for this? I am trying to automate the S3 data ingestion without using Web UI.

Kumarv
May 11, 2018

Can blacklisting be used on this type of input for reading from "source = s3://xxx/34/xx/xx.gz" ? Where I would want to blacklist anything that contained a "34". Thank you.

Manny1time
December 30, 2017

It was my understanding of this that "polling_interval" ("Interval") in the GUI) regulates how often data is collected from AWS. However, changing this value seems to have no effect on that.

Reedmohn
May 14, 2017

Was this documentation topic helpful?

Enter your email address, and someone from the documentation team will respond to you:

Please provide your comments here. Ask a question or make a suggestion.

You must be logged into splunk.com in order to post comments. Log in now.

Please try to keep this discussion focused on the content covered in this documentation topic. If you have a more general question about Splunk functionality or are experiencing a difficulty with Splunk, consider posting a question to Splunkbase Answers.

0 out of 1000 Characters