Use the Amazon S3 Connector with Splunk DSP
Use the Amazon S3 Connector to collect data from Amazon S3 buckets. This connector is based on the Amazon Simple Queue Service (SQS).
To use the Amazon S3 Connector, start by creating a connection that allows it to access data from Amazon S3. Then, add the Amazon S3 Connector to the start of your data pipeline and configure it to use the connection that you created.
The Amazon S3 Connector can't be used to send data to Amazon S3 buckets. If you want to send data to S3 as a destination, you must use a "Write to S3-compatible storage" connection and sink function. See Send data from Splunk DSP to Amazon S3 for more information.
Behavior of the Amazon S3 Connector
The Amazon S3 Connector collects newly created files stored in S3 buckets in response to notifications received from SQS. The connector doesn't collect an event if it occurred before the SQS queue is set up or if notifications about the event are not received through SQS.
When new data is added to an Amazon S3 bucket, S3 sends a notification to SQS with information about the new content such as bucket_name
, key
, size
, and so on. The Amazon S3 Connector uses this information to download the new files from the Amazon S3 bucket, read and parse the file content, wrap the results into events, and then send the events into the DSP pipeline using the Collect service.
The connector handles data differently depending on the type of file that it is reading the data from:
- CSV files: The connector ingests each line as an event. The first line in the file must be the header.
- Plain text files: The connector ingests each line as an event.
- JSON files: By default, the connector ingests each JSON file as an event. However, you can configure the connector to ingest each element from a specific array field in the file as an event.
- Parquet files: The connector ingests each row as an event. The connector supports row group sizes up to 265 MB, and supports uncompressed files as well as files that are compressed with the Snappy, GZIP, or zstd compression algorithms.
For use cases where a mix of file types might be added to the Amazon S3 bucket, the connector can automatically detect the CSV, JSON, and plain text file types and then handle the data accordingly. Parquet files can't be auto-detected, so if the Amazon S3 bucket also contains Parquet files, you must create a new job and explicitly set the File Type parameter to Parquet to ingest the files.
The connector detects CSV, JSON, and plain text file types based on the following rules:
- If the S3 object is an uncompressed file, the
Content-Type
header attribute determines the file type:"text/csv"
: A CSV file."application/json"
: A JSON file."text/plain"
: A plain text file.
- If the S3 object is a compressed file that has
Content-Type
set toapplication/x-gzip
, the connector treats it as a GZIP file. The connector downloads and decompresses the GZIP archive, and then reads all the files in the archive. The connector uses the file extensions of the files in the archive to determine their file types:- .csv or .csv.gz: A CSV file.
- .json or .json.gz: A JSON file.
- All other file extensions: A plain text file.
- If the S3 object is a compressed file that has
Content-Type
set toapplication/zip
, the connector treats it as a ZIP file. The connector downloads and decompresses the ZIP archive, and then only reads the first file in the archive. The connector uses the file extension of this first file to determine its type:- .csv: A CSV file.
- .json: A JSON file.
- All other file extensions: A plain text file.
- If the S3 object has an empty
Content-Type
, the connector treats it as a plain text file. - If the S3 object is any other
Content-Type
, the connector does not ingest data from the file.
Performance of the Amazon S3 Connector
The data ingestion rate of each worker on a scheduled job is influenced by many factors. Some of these factors are:
- The average file size
- File compression type and compression rates
- The number of new files added to the S3 bucket between each job execution
- The file formats
- Event sizes
- Download speed and bandwidth
Generally, a small number of larger files is faster than a large number of small files, compressed files are faster than uncompressed files, and CSV files are slower than plain text and JSON files.
The following tables show the average ingestion rates per worker for some common scenarios.
The average ingestion rates are just examples and don't take into account external influences on your ingestion rate, such as download speeds, event sizes, and so on.
Filetype | Number of files | File size before compression | Compression | Average ingestion rate per worker |
---|---|---|---|---|
txt | 10,000 | 1 MiB | none | 5 MiB/s |
txt | 200 | 50 MiB | none | 30 MiB/s |
Filetype | Number of files | File size before compression | Compression | Average ingestion rate per worker |
---|---|---|---|---|
txt | 200 | 50 MiB | none | 30 MiB/s |
txt | 200 | 50 MiB | gz | 70 MiB/s |
Filetype | Number of files | File size before compression | Compression | Average ingestion rate per worker |
---|---|---|---|---|
txt | 200 | 50 MiB | gz | 70 MiB/s |
json | 200 | 50 MiB | gz | 70 MiB/s |
csv | 200 | 50 MiB | gz | 40 MiB/s |
Limitations of the Amazon S3 Connector
The Amazon S3 Connector has the following limitations:
- If a ZIP file in the Amazon S3 bucket contains more than one file, the connector only reads the first file in the archive.
- The connector uses the file extension and the
Content-Type
header attribute to determine if a file is supported. To avoid having files that the connector can't read, don't customize the file extensions when uploading files to Amazon S3, and don't customize theContent-Type
header attribute intext/csv
,text/json
, ortext/plain
formatted files. - Parquet files are not auto-detected. You must explicitly set the File Type parameter to Parquet if your S3 bucket contains Parquet files.
Create a connection using the Amazon S3 Connector
Create a connection so that the Amazon S3 Connector can access data from Amazon S3 buckets and send the data into a DSP pipeline.
If you are editing a connection that's being used by an active pipeline, you must reactivate that pipeline after making your changes.
Prerequisites
Before you can use the Amazon S3 Connector, you must have an AWS account, and you must set up notifications to be sent to SQS whenever new events are written to the Amazon S3 bucket.
Make sure to do the following:
- If SQS notifications aren't set up, ask your AWS administrator to configure Amazon S3 to notify SQS whenever new events are written to the Amazon S3 bucket, or couple the Amazon Simple Notification Service (SNS) to SQS to send the notifications. For more information, see the following:
- Configure SQS in the Splunk Add-on for AWS manual
- Configure SNS in the Splunk Add-on for AWS manual
- "Getting Started with Amazon SQS" in the AWS documentation
- If you don't have an AWS account, ask your AWS administrator to create an account and provide the access key ID and secret access key.
Make sure your AWS account has at least read and write permissions for the queue, as well as read permissions for the related Amazon S3 bucket. See the following list of permissions:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "sqs:GetQueueUrl", "sqs:ReceiveMessage", "sqs:DeleteMessage", "sqs:GetQueueAttributes", "sqs:ListQueues", "s3:GetObject" ], "Resource": "*" } ] }
If any of the files being added to the Amazon S3 bucket are encrypted with a custom AWS KMS custom master key (CMK), make sure that your AWS account also has the kms.decrypt
permission. This permission allows the connector to decrypt and download the files.
You do not need any additional permissions for files that are encrypted with AES-256 or the default AWS KMS CMK (aws/s3).
Steps
- From the Data Management page, click the Connections tab.
- Click Create New Connection.
- Select Amazon S3 Connector and then click Next.
- Complete the following fields:
Field Description Connection Name A unique name for your connection. Access Key ID Your AWS access key ID. Secret Access Key Your AWS secret access key. SQS Queue URL The full URL for the AWS SQS queue. File Type The type of files that the connector collects data from. Select one of the following options: - Auto (default): The connector automatically detects the file type based on a set of rules, and handles the data from each file based on its detected type. For more information, see Behavior of the Amazon S3 Connector in this topic.
- CSV: The connector collects data from CSV files, and ingests each line as an event. The first line of the file must be the header.
- Plain Text: The connector collects data from plain text files, and ingests each line as an event.
- JSON: The connector collects data from JSON files, and ingests each JSON file as an event by default. You can configure the connector to ingest each element from a specific array field in the JSON file as an event. See the Field parameter for more information.
- Parquet: The connector collects data from Parquet files, and ingests each row as an event. The connector supports row group sizes up to 265 MB, and supports uncompressed files as well as files that are compressed with the Snappy, GZIP, or zstd compression algorithms.
Parquet files can't be auto-detected, so you must explicitly set this File Type parameter to Parquet if your S3 bucket contains Parquet files.
Field (Optional) The name of an array field in the JSON files that the connector is collecting data from. The connector ingests each element from this array as an event. If the JSON files are generated by AWS, set this parameter to Records. Otherwise, type the name of a first-level array field.
If you leave this field empty, the connector ingests each JSON file as an event. If you type an invalid value, such as the name of an object or a second-level field, the connector does not ingest any data.Scheduled This parameter is on by default, indicating that jobs run automatically. Toggle this parameter off to stop the scheduled job from automatically running. Jobs that are currently running are not affected. Schedule The time-based job schedule that determines when the connector executes jobs for collecting data. Select a predefined value or write a custom CRON schedule. All CRON schedules are based on UTC. Workers The number of workers you want to use to collect data. If your data fails to get into DSP, check the fields again to make sure you have the correct name, AWS access key ID, and AWS secret access key for your Amazon S3 connection. DSP doesn't run a check to see if you enter the valid credentials.
- Click Save.
You can now use your connection in a data pipeline.
Use the Amazon Metadata Connector with Splunk DSP | Use the Azure Monitor Metrics Connector with Splunk DSP |
This documentation applies to the following versions of Splunk® Data Stream Processor: 1.1.0
Feedback submitted, thanks!