Connecting Amazon S3 to your DSP pipeline as a data source

The Amazon S3 connector is planned for deprecation. See the Release Notes for more information.

When creating a data pipeline in Splunk Data Stream Processor, you can connect to Amazon S3 and use it as a data source. You can get data from Amazon S3 into a pipeline, transform the data as needed, and then send the transformed data out from the pipeline to a destination of your choosing.

To connect to Amazon S3 as a data source, you must complete the following tasks:

Create a connection that allows DSP to access your Amazon S3 data. See Create a DSP connection to get data from Amazon S3.
Create a pipeline that starts with the Amazon S3 source function. See the Building a pipeline chapter in the Use the Data Stream Processor manual for instructions on how to build a data pipeline.
Configure the Amazon S3 source function to use your Amazon S3 connection. See Get data from Amazon S3 in the Function Reference manual.

When you activate the pipeline, the source function starts collecting event data from Amazon S3 in response to notifications received from Amazon Simple Queue Service (SQS). Each event is received into the pipeline as a record.

If your data fails to get into DSP, check the connection settings to make sure you have the correct access key ID and secret access key for your Identity and Access Management (IAM) user, as well as the correct URL for the Amazon SQS queue where event notifications are being transmitted. DSP doesn't run a check to see if you enter valid credentials or specify a valid Amazon SQS queue.

How Amazon S3 data is collected

The source function collects data according to the job schedule that you specified in the connection settings. See Scheduled data collection jobs for more information, including a list of the limitations that apply to all scheduled data collection jobs.

The Amazon S3 connector relies on event notifications received from SQS to identify new data in the S3 bucket. If an event occurred before the SQS queue is set up or if notifications about the event are not received through SQS, then the event is not collected into the DSP pipeline.

When new data is added to an Amazon S3 bucket, S3 sends a notification to SQS with information about the new content such as bucket_name, key, size, and so on. The Amazon S3 connector uses this information to download the new files from the Amazon S3 bucket, read and parse the file content, wrap the results into events, and then send the events into the DSP pipeline using the Collect service.

The connector handles data differently depending on the type of file that it is reading the data from:

CSV files: The connector ingests each line as an event. The first line in the file must be the header.
Plain text files: The connector ingests each line as an event.
JSON files: By default, the connector ingests each JSON file as an event. However, you can configure the connector to ingest each element from a specific array field in the file as an event.
Parquet files: The connector ingests each row as an event. The connector supports row group sizes up to 265 MB, and supports uncompressed files as well as files that are compressed with the Snappy, GZIP, or zstd compression algorithms.

For use cases where a mix of file types might be added to the Amazon S3 bucket, the connector can automatically detect the CSV, JSON, and plain text file types and then handle the data accordingly. Parquet files can't be auto-detected, so if the Amazon S3 bucket also contains Parquet files, you must create a new data collection job and explicitly set the File Type parameter in the connection settings to Parquet to ingest the files.

The connector detects CSV, JSON, and plain text file types based on the following rules:

If the S3 object is an uncompressed file, the Content-Type header attribute determines the file type:
- "text/csv": A CSV file.
- "application/json": A JSON file.
- "text/plain": A plain text file.
If the S3 object is a compressed file that has Content-Type set to application/x-gzip, the connector treats it as a GZIP file. The connector downloads and decompresses the GZIP archive, and then reads all the files in the archive. The connector uses the file extensions of the files in the archive to determine their file types:
- .csv or .csv.gz: A CSV file.
- .json or .json.gz: A JSON file.
- All other file extensions: A plain text file.
If the S3 object is a compressed file that has Content-Type set to application/zip, the connector treats it as a ZIP file. The connector downloads and decompresses the ZIP archive, and then only reads the first file in the archive. The connector uses the file extension of this first file to determine its type:
- .csv: A CSV file.
- .json: A JSON file.
- All other file extensions: A plain text file.
If the S3 object has an empty Content-Type, the connector treats it as a plain text file.
If the S3 object is any other Content-Type, the connector does not ingest data from the file.

Performance of the Amazon S3 connector

The data ingestion rate of each worker on a scheduled job is influenced by many factors. Some of these factors are:

The average file size
File compression type and compression rates
The number of new files added to the S3 bucket between each job execution
The file formats
Event sizes
Download speed and bandwidth

Generally, a small number of larger files is faster than a large number of small files, compressed files are faster than uncompressed files, and CSV files are slower than plain text and JSON files.

The following tables show the average ingestion rates per worker for some common scenarios.

The average ingestion rates are just examples and don't take into account external influences on your ingestion rate, such as download speeds, event sizes, and so on.

A large number of large files compared to a small number of large files
Filetype	Number of files	File size before compression	Compression	Average ingestion rate per worker
txt	10,000	1 MiB	none	5 MiB/s
txt	200	50 MiB	none	30 MiB/s

An equal number of uncompressed files compared to compressed files
Filetype	Number of files	File size before compression	Compression	Average ingestion rate per worker
txt	200	50 MiB	none	30 MiB/s
txt	200	50 MiB	gz	70 MiB/s

An equal number of plain text, JSON, and CSV files
Filetype	Number of files	File size before compression	Compression	Average ingestion rate per worker
txt	200	50 MiB	gz	70 MiB/s
json	200	50 MiB	gz	70 MiB/s
csv	200	50 MiB	gz	40 MiB/s

Limitations of the Amazon S3 connector

The Amazon S3 connector has the following limitations:

If a ZIP file in the Amazon S3 bucket contains more than one file, the connector only reads the first file in the archive.
The connector uses the file extension and the Content-Type header attribute to determine if a file is supported. To avoid having files that the connector can't read, don't customize the file extensions when uploading files to Amazon S3, and don't customize the Content-Type header attribute in text/csv, text/json, or text/plain formatted files.
Parquet files are not auto-detected. You must explicitly set the File Type parameter to Parquet if your S3 bucket contains Parquet files.

Connecting Amazon S3 to your DSP pipeline as a data source

How Amazon S3 data is collected

Performance of the Amazon S3 connector

Limitations of the Amazon S3 connector

Comments

Connecting Amazon S3 to your DSP pipeline as a data source

Was this topic useful?