Configure Generic S3 inputs for the Splunk Add-on for AWS
The Generic S3 input lists all the objects in the bucket and examines each file's modified date every time it runs to pull uncollected data from an S3 bucket. When the number of objects in a bucket is large, this can be a very time-consuming process with low throughput.
Before you begin configuring your Generic S3 inputs, note the following expected behaviors:
- You cannot edit the initial scan time parameter of an S3 input after you create it. If you need to adjust the start time of an S3 input, delete it and recreate it.
- The S3 data input is not intended to read frequently modified files. If a file is modified after it has been indexed, the Splunk platform indexes the file again, resulting in duplicated data. Use key/blacklist/whitelist options to instruct the add-on to index only those files that you know will not be modified later.
- The S3 data input processes compressed files according to their suffixes. Use these suffixes only if the file is in the corresponding format, or data processing errors will occur.
The data input supports the following compression types:
- single file in ZIP, GZIP, TAR, or TAR.GZ formats
- multiple files with or without folders in ZIP, TAR, or TAR.GZ format
Expanding compressed files requires significant operating system resources.
- The Splunk platform auto-detects the character set used in your files among these options: UTF-8 with/without BOM, UTF-16LE/BE with BOM, UTF-32BE/LE with BOM. If your S3 key uses a different character set, you can specify it in
character_setparameter and separate out this collection job into its own input. Mixing non-autodetected character sets in a single input causes errors.
- If your S3 bucket contains a very large number of files, you can configure multiple S3 inputs for a single S3 bucket to improve performance. The Splunk platform dedicates one process for each data input, so provided that your system has sufficient processing power, performance will improve with multiple inputs. See Performance reference for the S3 input in the Splunk Add-on for AWS for details.
To prevent indexing duplicate data, verify that multiple inputs do not collect the same S3 folder and file data.
- As a best practice, archive your S3 bucket contents when you no longer need to actively collect them. AWS charges for list key API calls that the input uses to scan your buckets for new and changed files, so you can reduce costs and improve performance by archiving older S3 keys to another bucket or storage type.
- After configuring an S3 input, you may need to wait for a few minutes before new events are ingested and can be searched. The wait time depends on the number of files in the S3 buckets from which you are collecting data – the larger the quantity, the longer the delay. Also, more verbose logging level causes longer data digestion time. Be warned that debug mode is extremely verbose and is not recommended on production systems.
Configure a Generic S3 input on the data collection node using one of the following ways:
- Configure a Generic S3 input using Splunk Web (recommended)
- Configure a Generic S3 input using configuration file
Configure a Generic S3 input using Splunk Web
To configure inputs in Splunk Web, click Splunk Add-on for AWS in the left navigation bar on Splunk Web home, then choose one of the following menu paths depending on which data type you want to collect:
- Create New Input > CloudTrail > Generic S3
- Create New Input > CloudFront Access Log > Generic S3
- Create New Input > ELB Access Logs > Generic S3
- Create New Input > S3 Access Logs > Generic S3
- Create New Input > Others > Generic S3
Make sure you choose the right menu path corresponding to the data type you want to collect. The system automatically sets the appropriate sourcetype and may display slightly different field settings in the subsequent configuration page based on the menu path.
|Argument in configuration file||Field in Splunk Web||Description|
||AWS Account||The AWS account or EC2 IAM role the Splunk platform uses to access the keys in your S3 buckets. In Splunk Web, select an account from the drop-down list. In |
Note: If the region of the AWS account you select is GovCloud, you may encounter errors like Failed to load options for S3 Bucket. You need to manually add AWS GovCloud Endpoint in the S3 Host Name field. See http://docs.aws.amazon.com/govcloud-us/latest/UserGuide/using-govcloud-endpoints.html for more information.
||Assume Role||The IAM role to assume, see Manage IAM roles.|
||S3 Bucket||AWS Bucket Name|
||Log File Prefix/S3 Key Prefix||Configure the prefix of the log file. This add-on will search the log files under this prefix. This argument is titled Log File Prefix in incremental S3 field inputs, and is titled S3 Key Prefix in generic S3 field inputs.|
||Start Date/Time||The start date of the log.|
||End Date/Time||The end date of the log.|
||Source Type||A source type for the events. Specify only if you want to override the default of |
||Index||The index name where the Splunk platform should to put the S3 data. The default is main.|
||CloudTrail Event Blacklist||Only valid if the source type is set to |
||CloudTrail Event Blacklist||\.bin$).|
||Polling Interval||The number of seconds to wait before the Splunk platform runs the command again. Default is 1800 seconds.|
Configure a Generic S3 input using configuration file
When you configure inputs manually in
inputs.conf, create a stanza using the following template and add it to
$SPLUNK_HOME/etc/apps/Splunk_TA_aws/local/inputs.conf. If the file or path does not exist, create it.
[aws_s3://<name>] is_secure = <whether use secure connection to AWS> host_name = <the host name of the S3 service> aws_account = <AWS account used to connect to AWS> bucket_name = <S3 bucket name> polling_interval = <Polling interval for statistics> key_name = <S3 key prefix> recursion_depth = <For folder keys, -1 == unconstrained> initial_scan_datetime = <Splunk relative time> terminal_scan_datetime = <Only S3 keys which have been modified before this datetime will be considered. Using datetime format: %Y-%m-%dT%H:%M:%S%z (for example, 2011-07-06T21:54:23-0700).> max_items = <Max trackable items.> max_retries = <Max number of retry attempts to stream incomplete items.> whitelist = <Override regex for blacklist when using a folder key.> blacklist = <Keys to ignore when using a folder key.> character_set = <The encoding used in your S3 files. Default to 'auto' meaning that file encoding will be detected automatically amoung UTF-8, UTF8 without BOM, UTF-16BE, UTF-16LE, UTF32BE and UTF32LE. Notice that once one specified encoding is set, data input will only handle that encoding.> ct_blacklist = <The blacklist to exclude cloudtrail events. Only valid when manually set sourcetype=aws:cloudtrail.> ct_excluded_events_index = <name of index to put excluded events into. default is empty, which discards the events> aws_iam_role = <AWS IAM role to be assumed>
Note: Under one AWS account, to ingest logs in different prefixed locations in the bucket, you need to configure multiple AWS data inputs, one for each prefix name. Alternatively, you can configure one data input but use different AWS accounts to ingest logs in different prefixed locations in the bucket.
Some of these settings have default values that can be found in
[aws_s3] aws_account = sourcetype = aws:s3 initial_scan_datetime = default max_items = 100000 max_retries = 3 polling_interval= interval = 30 recursion_depth = -1 character_set = auto is_secure = True host_name = s3.amazonaws.com ct_blacklist = ^(?:Describe|List|Get) ct_excluded_events_index =
Configure Description inputs for the Splunk Add-on for AWS
Configure Incremental S3 inputs for the Splunk Add-on for AWS
This documentation applies to the following versions of Splunk® Supported Add-ons: released