Splunk® Supported Add-ons

Splunk Add-on for AWS

Download manual as PDF

Download topic as PDF

Configure SQS-based S3 inputs for the Splunk Add-on for AWS

SQS-based S3 is the recommended input type for collecting a variety of pre-defined data types:

  • CloudFront Access Logs
  • Config
  • ELB Access logs
  • CloudTrail
  • S3 Access Logs
  • Custom data types.

Configure SQS-based S3 inputs to collect the data types that it supports.

Before you configure SQS-based S3 inputs, configure S3 to send notifications to SQS through SNS. This lets S3 notify the add-on that new events were written to the S3 bucket.

Keep the following in mind as you configure your inputs:

  • The SQS-based S3 input only collects in AWS service logs that meet the following criteria:
    • near-real time
    • newly created
    • stored into S3 buckets
    • have event notifications sent to SQS

Events that occurred in the past or events with no notifications sent through SNS to SQS are not collected. To collect historical logs stored into S3 buckets, use the generic S3 input instead. The S3 input lets you set the initial scan time parameter (log start date) to collect data generated after a specified time in the past.

  • To collect the same types of logs from multiple S3 buckets, even across regions, set up one input to collect data from all the buckets. To do this, configure these buckets to send notifications to the same SQS queue from which the SQS-based S3 input polls message.
  • To achieve high throughput data ingestion from an S3 bucket, configure multiple SQS-based S3 inputs for the S3 bucket to scale out data collection.
  • After configuring an SQS-based S3 input, you might need to wait for a few minutes before new events are ingested and can be searched. Also, a more verbose logging level causes longer data digestion time. Debug mode is extremely verbose and is not recommended on production systems.
  • The SQS-based input allows you to ingest data from S3 buckets by optimizing the API calls made by the add-on and relying on SQS/SNS to collect events upon receipt of notification.
  • The SQS-based S3 input is stateless, which means that when multiple inputs are collecting data from the same bucket, if one input goes down, the other inputs continue to collect data and take over the load from the failed input. This lets you enhance fault tolerance by configuring multiple inputs to collect data from the same bucket.


Configure an SQS-based S3 input using Splunk Web

To configure inputs in Splunk Web, click Splunk Add-on for AWS in the left navigation bar on Splunk Web home, then choose one of the following menu paths depending on which data type you want to collect:

  • Create New Input > CloudTrail > SQS-based S3
  • Create New Input > CloudFront Access Log > SQS-based S3
  • Create New Input > Config > SQS-based S3
  • Create New Input > ELB Access Logs > SQS-based S3
  • Create New Input > S3 Access Logs > SQS-based S3
  • Create New Input > Others > SQS-based S3

You must have the admin_all_objects role enabled in order to add new inputs.

Choose the menu path that corresponds to the data type you want to collect. The system will automatically set the source type and display relevant field settings in the subsequent configuration page.

Argument Corresponding Field in Splunk Web Description
aws_account AWS Account The AWS account or EC2 IAM role the Splunk platform uses to access the keys in your S3 buckets. In Splunk Web, select an account from the drop-down list. In inputs.conf, enter the friendly name of one of the AWS accounts that you configured on the Configuration page or the name of the autodiscovered EC2 IAM role.

Note: If the region of the AWS account you select is GovCloud, you may encounter errors like Failed to load options for S3 Bucket. You need to manually add AWS GovCloud Endpoint in the S3 Host Name field. See http://docs.aws.amazon.com/govcloud-us/latest/UserGuide/using-govcloud-endpoints.html for more information.

aws_iam_role Assume Role The IAM role to assume, see Manage IAM roles.
sqs_queue_region AWS Region AWS region that the SQS queue is in e.g. us-east-1.
sqs_queue_url SQS Queue The SQL queue URL.
sqs_batch_size SQS Batch Size The maximum number of messages to pull from the SQS queue in one batch. Enter an integer between 1 and 10 inclusive. Splunk recommends that you set a larger value for small files, and a smaller value for large files. The default SQS batch size is 10. If you are dealing with large files and your system memory is limited, set this to a smaller value.
s3_file_decoder S3 File Decoder The decoder to use to parse the corresponding log files. The decoder is set according to the Data Type you select. If you select a Custom Data Type, choose one from Cloudtrail, Config, ELB Access Logs, S3 Access Logs, CloudFront Access Logs.
sourcetype Source Type The source type for the events to collect, automatically filled in based on the decoder chosen for the input.
interval Interval The length of time in seconds between two data collection runs. The default is 300 seconds.
index Index The index name where the Splunk platform should put the SQS-based S3 data. The default is main.


Configure an SQS-based S3 input using the configuration file

When you configure inputs manually in inputs.conf, create a stanza using the following template and add it to $SPLUNK_HOME/etc/apps/Splunk_TA_aws/local/inputs.conf. If the file or path does not exist, create it.

You can configure the parameters below.

[aws_sqs_based_s3://<stanza_name>]
aws_account = <value>
interval = <value>
s3_file_decoder = <value>
sourcetype = <value>
sqs_batch_size = <value>
sqs_queue_region = <value>
sqs_queue_url = <value>

Valid values for s3_file_decoder are: CloudTrail, Config, S3 Access Logs, ELB Access Logs, CloudFront Access Logs, and CustomLogs.

If you want to ingest custom logs other the natively supported AWS log types, you must set s3_file_decoder = CustomLogs. This lets you ingest custom logs into Splunk but does not parse the data. To process custom logs into meaningful events, you need to perform additional configurations in props.conf and transforms.conf to parse the collected data to meet your specific requirements.

For more information on these settings, see /README/inputs.conf.spec under your add-on directory.


Migrate from the Generic S3 input to the SQS-based S3 input

SQS-based S3 is the recommended input type for real-time data collection from S3 buckets because it is scalable and provides better ingestion performance than the other S3 input types.

If you are already using a generic S3 input to collect data, use the following steps to switch to the SQS-based S3 input.

  1. Perform prerequisite configurations of AWS services:
    • Set up an SQS queue with a dead-letter queue and proper visibility timeout configured. See Configure SQS.
    • Set up the S3 bucket (with S3 key prefix, if specified) from which you are collecting data to send notifications to the SQS queue. See Configure SNS.
  2. Add an SQS-based S3 input using the SQS queue you just configured. See Configure and SQS-based S3 input. After the setup, make sure the new input is enabled and starts collecting data from the bucket.
  3. Edit your old generic S3 input and set the End Date/Time field to now (the current system time) to phase it out.
  4. Wait until all the task executions of the old input are complete. As a best practice, wait at least double your polling frequency.
  5. Disable the old generic S3 input.
  6. Run the following searches to delete any duplicate events collected during the transition: For CloudTrail events:

    index=xxx sourcetype=aws:cloudtrail | streamstats count by source, eventID | search count > 1 | eval indexed_time=strftime(_indextime, "%+") | eval dup_id=source.eventID.indexed_time | table dup_id | outputcsv dupes.csv

    index=xxx sourcetype=aws:cloudtrail | eval indexed_time=strftime(_indextime, "%+") | eval dup_id=source.eventID.indexed_time | search [|inputcsv dupes.csv | format "(" "" "" "" "OR" ")"] | delete

    For S3 access logs:

    index=xxx sourcetype=aws:s3:accesslogs | streamstats count by source, request_id | search count > 1 | eval indexed_time=strftime(_indextime, "%+") | eval dup_id=source.request_id.indexed_time | table dup_id | outputcsv dupes.csv

    index=xxx sourcetype=aws:s3:accesslogs | eval indexed_time=strftime(_indextime, "%+") | eval dup_id=source.request_id.indexed_time | search [|inputcsv dupes.csv | format "(" "" "" "" "OR" ")"] | delete

    For CloudFront access logs:

    index=xxx sourcetype=aws:cloudfront:accesslogs | streamstats count by source, x_edge_request_id | search count > 1 | eval indexed_time=strftime(_indextime, "%+") | eval dup_id=source.x_edge_request_id.indexed_time | table dup_id | outputcsv dupes.csv

    index=xxx sourcetype=aws:cloudfront:accesslogs | eval indexed_time=strftime(_indextime, "%+") | eval dup_id=source.x_edge_request_id.indexed_time | search [|inputcsv dupes.csv | format "(" "" "" "" "OR" ")"] | delete

    For classic load balancer (elb) access logs:

    Because events do not have unique IDs, use the hash function to remove duplication.

    index=xxx sourcetype=aws:elb:accesslogs | eval hash=sha256(_raw) | streamstats count by source, hash | search count > 1 | eval indexed_time=strftime(_indextime, "%+") | eval dup_id=source.hash.indexed_time | table dup_id | outputcsv dupes.csv

    index=xxx sourcetype=aws:elb:accesslogs | eval hash=sha256(_raw) | eval indexed_time=strftime(_indextime, "%+") | eval dup_id=source.hash.indexed_time | search [|inputcsv dupes.csv | format "(" "" "" "" "OR" ")"] | delete

  7. Optionally, delete the old generic S3 input.

Auto-scale data collection with SQS-based S3 inputs

With the SQS-based S3 input type, you can take full advantage of the auto-scaling capability of the AWS infrastructure to scale out data collection by configuring multiple inputs to ingest logs from the same S3 bucket without creating duplicate events. This is particularly useful if you are ingesting logs from a very large S3 bucket and hit a bottleneck in your data collection inputs.

  1. Create an AWS auto scaling group for your heavy forwarder instances where the SQS-based S3 inputs will be running.
    To create an auto scaling group, you can either specify a launch configuration or create an AMI to provision new EC2 instances that host heavy forwarders, and use bootstrap script to install the Splunk Add-on for AWS and configure SQS-based S3 inputs. For detailed information about the auto scaling group and how to create it, refer to the AWS documentation: http://docs.aws.amazon.com/autoscaling/latest/userguide/AutoScalingGroup.html.
  2. Set CloudWatch alarms for one of the following Amazon SQS metrics:
    • ApproximateNumberOfMessagesVisible (recommended): The number of messages available for retrieval from the queue.
    • ApproximateAgeOfOldestMessage: The approximate age (in seconds) of the oldest non-deleted message in the queue.
    For instructions on setting CloudWatch alarms for Amazon SQS metrics, refer to the AWS documentation: http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/SQS_AlarmMetrics.html.
  3. Use the CloudWatch alarm as a trigger to provision new heavy forwarder instances with SQS-based S3 inputs configured to consume messages from the same SQS queue to improve ingestion performance.
PREVIOUS
Configure Incremental S3 inputs for the Splunk Add-on for AWS
  NEXT
Configure Billing inputs for the Splunk Add-on for AWS

This documentation applies to the following versions of Splunk® Supported Add-ons: released


Comments

Hello Vishalbhalla,
Unfortunately, Cloudwatch logs are significantly easier to ingest via the Splunk Add-on for Kinesis Firehose than if you were to concoct a massive set of transforms on the AWS side in order to ingest your data through the Splunk Add-on for AWS. If interested, you can learn more about the Splunk Add-on for Kinesis Firehose on Splunkbase: https://splunkbase.splunk.com/app/3719/

Mglauser splunk, Splunker
March 11, 2019

Shame this doesn't work with failed events buckets used by: https://www.splunk.com/blog/2019/02/21/how-to-ingest-any-log-from-aws-cloudwatch-logs-via-firehose.html :(

Vishalbhalla
March 5, 2019

The Cloudtrail input has "exclude_describe_events" option. If I switch to use SQS-based S3 input for Cloudtrail logs, how can I exclude certain events?

Plutochen12
February 9, 2019

Hi Dlut chen,

When you're creating a new input and select a particular Data Type, the S3 File Decoder is set according to the Data Type you selected and cannot be modified. If you select Custom Data Type, you can modify the S3 File Decoder.

Bashby splunk, Splunker
May 23, 2018

Why I could not select S3 File Decoder? It is disabled in the GUI.

Dlut chen
May 2, 2018

Was this documentation topic helpful?

Enter your email address, and someone from the documentation team will respond to you:

Please provide your comments here. Ask a question or make a suggestion.

You must be logged into splunk.com in order to post comments. Log in now.

Please try to keep this discussion focused on the content covered in this documentation topic. If you have a more general question about Splunk functionality or are experiencing a difficulty with Splunk, consider posting a question to Splunkbase Answers.

0 out of 1000 Characters