Splunk® Data Stream Processor

Connect to Data Sources and Destinations with DSP

Acrobat logo Download manual as PDF


Acrobat logo Download topic as PDF

Create a DSP connection to get data from Amazon S3

To get data from Amazon S3 into a data pipeline in Splunk Data Stream Processor (DSP), you must first create a connection using the Amazon S3 connector. In the connection settings, provide your Identity and Access Management (IAM) user credentials so that DSP can access your data, then provide the URL to an Amazon Simple Queue Service (SQS) queue where S3 notifications are being transmitted, and then schedule a data collection job to specify how frequently DSP retrieves the data. You can then use the connection in the Amazon S3 source function to get data from Amazon S3 into a DSP pipeline.

The Amazon S3 connector can't be used to send pipeline data to S3 buckets. If you want to send data to Amazon S3, you must use the Write Connector for Amazon S3. See Create a DSP connection to send data to Amazon S3 for more information.

Prerequisites

Before you can create the Amazon S3 connection, you must have the following:

  • Notifications set up to be sent to SQS whenever new events are written to your Amazon S3 bucket. See the Setting up SQS notifications section on this page for more information.
  • An IAM user with at least read and write permissions for the SQS queue, as well as read permissions for your Amazon S3 bucket. Permissions for decrypting KMS-encrypted files might also be required. See the Setting up an IAM user section on this page for more information.
  • The access key ID and secret access key for the IAM user.
  • An understanding of the types of files that you plan to get data from. If you plan to get data from Parquet files that are being added to your Amazon S3 bucket, then you will need to set the File Type parameter in the connection to Parquet.

Setting up SQS notifications

If SQS notifications aren't set up, ask your Amazon Web Services (AWS) administrator to configure Amazon S3 to notify SQS whenever new events are written to the Amazon S3 bucket, or couple the Amazon Simple Notification Service (SNS) to SQS to send the notifications. For more information, see the following:

  • Configure SQS in the Splunk Add-on for AWS manual
  • Configure SNS in the Splunk Add-on for AWS manual
  • "Getting Started with Amazon SQS" in the AWS documentation

Setting up an IAM user

If you don't have an IAM user, ask your AWS administrator to create it and provide the associated access key ID and secret access key.

Make sure your IAM user has at least read and write permissions for the queue, as well as read permissions for the related Amazon S3 bucket. See the following list of permissions:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "sqs:GetQueueUrl",
        "sqs:ReceiveMessage",
        "sqs:DeleteMessage",
        "sqs:GetQueueAttributes",
        "sqs:ListQueues",
        "s3:GetObject"
      ],
      "Resource": "*"
    }
  ]
}

If any of the files being added to the Amazon S3 bucket are encrypted using the SSE-KMS algorithm with a custom Customer Master Key (CMK), make sure that your IAM user also has the kms.decrypt permission. This permission allows the connector to decrypt and download the files.

You do not need any additional permissions for files that are encrypted with AES-256 or the default AWS KMS CMK (aws/s3).

Steps

  1. From the Data Stream Processor home page, click Data Management and then select the Connections tab.
  2. Click Create New Connection.
  3. Select Amazon S3 and then click Next.
  4. Complete the following fields:
    Field Description
    Connection Name A unique name for your connection.
    Access Key ID The access key ID for your IAM user.
    Secret Access Key The secret access key for your IAM user.
    SQS Queue URL The full URL for the AWS SQS queue.
    File Type The type of files that the connector collects data from. Select one of the following options:
    • Auto (default): The connector automatically detects the file type based on a set of rules, and handles the data from each file based on its detected type. For more information, see How Amazon S3 data is collected.
    • Parquet files can't be auto-detected, so you must explicitly set this File Type parameter to Parquet if your S3 bucket contains Parquet files.

    • CSV: The connector collects data from CSV files, and ingests each line as an event. The first line of the file must be the header.
    • Plain Text: The connector collects data from plain text files, and ingests each line as an event.
    • JSON: The connector collects data from JSON files, and ingests each JSON file as an event by default. You can configure the connector to ingest each element from a specific array field in the JSON file as an event. See the Field parameter for more information.
    • Parquet: The connector collects data from Parquet files, and ingests each row as an event. The connector supports row group sizes up to 265 MB, and supports uncompressed files as well as files that are compressed with the Snappy, GZIP, or zstd compression algorithms.
    Field (Optional) The name of an array field in the JSON files that the connector is collecting data from. The connector ingests each element from this array as an event. If the JSON files are generated by AWS, set this parameter to Records. Otherwise, type the name of a first-level array field.


    If you leave this field empty, the connector ingests each JSON file as an event. If you type an invalid value, such as the name of an object or a second-level field, the connector does not ingest any data.

    Scheduled This parameter is on by default, indicating that jobs run automatically. Toggle this parameter off to stop the scheduled job from automatically running. Jobs that are currently running are not affected.
    Schedule The time-based job schedule that determines when the connector executes jobs for collecting data. Select a predefined value or write a custom CRON schedule. All CRON schedules are based on UTC.
    Workers The number of workers you want to use to collect data.

    Any credentials that you upload are transmitted securely by HTTPS, encrypted, and securely stored in a secrets manager.

  5. Click Save.

    If you're editing a connection that's being used by an active pipeline, you must reactivate that pipeline after making your changes.

You can now use your connection in an Amazon S3 source function at the start of your data pipeline to get data from Amazon S3. For instructions on how to build a data pipeline, see the Building a pipeline chapter in the Use the manual. For information about the source function, see Get data from Amazon S3 in the Function Reference manual.

Last modified on 06 August, 2021
PREVIOUS
Connecting Amazon S3 to your DSP pipeline as a data destination
  NEXT
Create a DSP connection to send data to Amazon S3

This documentation applies to the following versions of Splunk® Data Stream Processor: 1.2.0, 1.2.1


Was this documentation topic helpful?

You must be logged into splunk.com in order to post comments. Log in now.

Please try to keep this discussion focused on the content covered in this documentation topic. If you have a more general question about Splunk functionality or are experiencing a difficulty with Splunk, consider posting a question to Splunkbase Answers.

0 out of 1000 Characters