Map an Amazon S3 federated index to a Splunk-managed AWS Glue table dataset

Follow this topic to create a federated index that maps to a Splunk-managed AWS Glue table dataset. If you want to define a federated index that maps to a customer-managed AWS Glue table dataset, see Map a federated index to a customer-created AWS Glue table dataset.

After you create an Amazon S3 federated provider for your Splunk Cloud Platform deployment, you create federated indexes for use in federated searches. Each federated index you create maps to a specific AWS Glue table, which in turn references an Amazon S3 dataset. You invoke federated indexes in your federated searches to tell Splunk software which Amazon S3 dataset you intend to search.

The Splunk platform creates federated indexes on the search head of your Splunk Cloud Platform deployment.

You can create federated indexes for two kinds of AWS Glue tables: customer-created AWS Glue tables and Splunk-managed AWS Glue tables. This task guides you through the process of creating a federated index that maps to an AWS Glue table, which Splunk software creates and manages behind the scenes.

In this task, you do these things:

Provide the name of the federated index.
Select an Amazon S3 federated provider that is configured with Amazon S3 locations that point to AWS CloudTrail datasets.
Select the AWS Glue table (Splunk managed) dataset type.
Supply an Amazon S3 location path that points to the AWS CloudTrail dataset to which this federated index will be mapped.
Provide the maximum relative time range for searches of the dataset.
List the AWS Account ID and AWS Region values that can be used as partition keys for your searches of the AWS CloudTrail dataset.

You can map a federated index to only one AWS CloudTrail dataset at a time. If a federated provider has Amazon S3 locations for several AWS CloudTrail datasets over which you want to run federated searches, define a separate federated index for each AWS CloudTrail dataset.

Prerequisites

A role on your Splunk Cloud Platform deployment that has the admin_all_objects capability.
Datasets in your Amazon S3 buckets that are composed entirely of AWS CloudTrail data.
You must have already defined an Amazon S3 federated provider that is set up for the creation of Splunk-managed AWS Glue tables. See Define an Amazon S3 federated provider.

Steps

On your Splunk Cloud Platform deployment, in Splunk Web, select Settings, then Federated search.
On the Federated index tab, select Add federated index.

You might also come to the Add federated index screen directly after creating a federated provider.

Using the following table, specify the settings for your federated index.

Setting	Description	Default Value
Federated index name	Enter a unique name for the federated index. Federated index names have the following restrictions: They can contain only letters, numbers, underscores, and hyphens. They must begin with a letter or number. They cannot be more than 2,048 characters in length. They cannot be named kvstore. You can use this string in a longer name, like abc_kvstore.	No default
Federated provider	Select an Amazon S3 federated provider.	No default
Remote dataset - Dataset type	Select AWS Glue table (Splunk managed).	AWS Glue table (customer created)
Amazon S3 location	Provide the Amazon S3 location path for the AWS CloudTrail dataset that you will search with this federated index. Splunk software will create an AWS Glue table which represents this dataset, and the federated index will map to that AWS Glue table. This Amazon S3 location path must be included in the list of Amazon S3 locations that are defined for the federated provider associated with the federated index. Provide the Amazon S3 location path up to but not including the 12-digit AWS Account ID folder part of the path. For more information about filling out this field, see Get the Amazon S3 location path for an Amazon CloudTrail dataset.	No default
Time settings	The time settings define the time field for the dataset to which the federated index maps. Because AWS CloudTrail datasets have a stable schema, the time settings have default values that you cannot change.	The default event Time field is eventtime.
Time partitions	The time partition settings determine the fields by which the dataset to which the federated index maps is partitioned by time. Because AWS CloudTrail datasets have a stable schema, we can provide one default partition key that represents the three partition keys supported by CloudTrail datasets: year, month, and day.	The default Time partition field is pk_timestamp.
Max search time range	Specify the maximum relative time range within which searches of the AWS CloudTrail dataset return results. Max search time range applies to the time partitions in your data. For example, if you set a search that looks for the last 3 years in terms of time partitions and Max search time range is set to 1 year, your search returns results only for data within the last year partition. Federated searches with time ranges of 2 years or more might suffer from reduced search performance. If you occasionally need to run searches over data that is older than the Max search time range, consider setting up additional federated indexes with larger Max search time range values. For example, you might run most of your searches over a federated index with a Max search time range of 1 year. But you very occasionally have to run searches over data that is between 1 and 2 years of age, and for those searches you can set up a second federated index with a Max search time range of 2 years.	1 year
AWS Account IDs	Provide the 12-digit AWS account IDs by which the AWS CloudTrail dataset to which this federated index maps is partitioned. You must provide at least 1 AWS account ID. Alternatively, you can provide a wildcard symbol () to partition the dataset by all available AWS account IDs. If you provide a wildcard for AWS Account IDs*, when you invoke this federated index in an `sdselect` search, you must add a WHERE clause that uses a `pk_account_id` field strictly in an equality condition to identify the AWS account ID partitions involved in the search. For example, `WHERE pk_account_id = "123456789012"` is supported, but `WHERE pk_account_id != "123456789012"` is not supported. For more information about obtaining AWS account IDs by which AWS CloudTrail datasets are partitioned, see Identify partitions to optimize searches of AWS CloudTrail datasets.	No default
AWS Regions	Provide the AWS region by which the AWS CloudTrail dataset to which this federated index maps is partitioned. You must provide at least 1 AWS region. Alternatively, you can provide a wildcard symbol () to partition the dataset by all available AWS regions. If you provide a wildcard for AWS Regions*, when you invoke this federated index in an `sdselect` search, you must add a WHERE clause that uses a `pk_region` field strictly in an equality condition to identify the AWS region partitions involved in the search. For example, `WHERE pk_region = "us_east_1"` is supported, but `WHERE pk_region != "us_east_1"` is not supported. For more information about obtaining AWS regions by which AWS CloudTrail datasets are partitioned, see Identify partitions to optimize searches of AWS CloudTrail datasets.	No default

Select Save to save the federated index configuration.
(Optional) Give your users access to the federated index. To run searches over the remote dataset to which the federated index maps, your users must have access permissions for the federated index. See Give your users role-based access control of federated indexes.

Get the Amazon S3 location path for an Amazon CloudTrail dataset

When you set up a federated index that maps to a Splunk-managed AWS Glue table, you must provide an Amazon S3 location path that defines the AWS CloudTrail dataset that you want to search. Splunk software creates an AWS Glue table that represents this dataset, and it is to this AWS Glue table that the federated index you are defining is mapped.

To get the Amazon S3 location, go to the Amazon S3 console and inspect the bucket for AWS CloudTrail dataset. The bucket contains the full Amazon S3 location path for the dataset.

Splunk software needs only the first few folders of the Amazon S3 location path, up to, but not including the folder with the 12-digit AWS-account-ID. In other words, when the AWS CloudTrail dataset is associated with 1 AWS account ID, its Amazon S3 location value follows this syntax:

s3://<bucket-name>/<additional-prefix-folders>/AWSLogs/

The <additional-prefix-folders> might not be present in the location path. This can be one or more additional folders that people optionally set up when they place objects such as AWS CloudTrail log files in Amazon S3 buckets, to differentiate datasets that are being stored in the same Amazon S3 bucket.

An AWS CloudTrail dataset can be associated with multiple AWS account IDs. When there are multiple AWS account IDs associated with an AWS CloudTrail dataset, its AWS S3 location path includes an organization ID that is generated by AWS. Here is the Amazon S3 location syntax for such AWS CloudTrail datasets:

s3://<bucket-name>/<additional-prefix-folders>/o-<organization-ID>/AWSLogs/

For detailed information about using the Amazon S3 console to review and manage Amazon S3 bucket contents, Working with objects in Amazon S3 in the Amazon Simple Storage Service (S3) User Guide.

Identify partitions to optimize searches of AWS CloudTrail datasets

Partitioning is an organization strategy for large datasets that makes it possible for you to search them efficiently. When you partition your data, you organize it into a hierarchical directory structure based on the distinct values of 1 or more fields in the data. Log files in AWS CloudTrail datasets are partitioned by time, meaning they are organized into folders by year, month, and day. This means all of the files associated with a specific date can easily be searched for.

Because AWS CloudTrail datasets have a stable schema, definitions for federated indexes that map to Splunk-managed AWS Glue tables come with default partition time field values that you cannot change.

However, all AWS CloudTrail datasets are also partitioned by two other fields (or "keys"): AWS account ID and AWS region. Splunk software cannot predict the values for these partition keys, so it is up to you to supply them. When you identify the partition keys in the federated index definition, you can run sdselect searches of the AWS CloudTrail dataset to which the federated index maps are efficient and cost-effective.

Get partition key values for an AWS CloudTrail dataset

All AWS CloudTrail datasets are partitioned by at least 1 AWS account ID and 1 AWS region. This means that when you set up a federated index that maps to a Splunk-managed AWS Glue table, you must provide at least 1 value for the AWS account IDs and AWS regions partition key fields. Splunk software cannot know in advance which AWS account IDs and AWS regions a specific AWS CloudTrail dataset is partitioned by, so these fields do not have default values.

To get values for the AWS account IDs, and AWS regions fields, go to the Amazon S3 console and inspect the full Amazon S3 location path for the dataset. The bold folders in the following AWS Cloudtrail location path syntax example show you where these values can be found:

s3://<bucket-name>/<additional-prefix-folders>/AWSLogs/<AWS-account-ID>/CloudTrail/<AWS-region>/<year>/<month>/<day>/<filename>

For example, in the Amazon S3 console, when you open the AWSLogs folder for an AWS CloudTrail dataset, you'll see the AWS account IDs the dataset is associated with. Similarly, when you open the CloudTrail folder for an AWS CloudTrail dataset, you'll see the AWS regions the dataset is associated with.

Optionally identify all possible partition keys with a wildcard

If an AWS CloudTrail dataset is associated with large number of AWS account IDs or AWS regions and you do not want to take the time to enter every key value into those fields, you can save time by entering wildcard symbols (*) into the fields instead. The wildcard symbol indicates that all possible key values for the field are applied to the federated index definition.

When you use a wildcard symbol for either AWS account IDs or AWS regions in a federated index definition, you must include a WHERE clause that filters results by pk_account_id or pk_region when you invoke that federated index in an sdselect search. See sdselect command WHERE clause operations.

Search your AWS Glue table datasets

After you set up federated indexes that map to AWS Glue table datasets, you can use the sdselect command to search those datasets. See sdselect command overview.

Delete a federated index

You can delete a federated index that maps to an AWS Glue table that you no longer need to search. You can also delete federated indexes when your data scanning entitlements are depleted, to prevent unintentional usage.

Prerequisites

A role on your Splunk Cloud Platform deployment that has the admin_all_objects capability.
A federated index for Federated Search for Amazon S3 that you want to delete.

Steps

On your Splunk Cloud Platform deployment, in Splunk Web, select Settings, then Federated search.
On the Federated index tab, identify a federated index that you want to delete.
Select Delete for the index you want to delete.

Map an Amazon S3 federated index to a Splunk-managed AWS Glue table dataset

Prerequisites

Steps

Get the Amazon S3 location path for an Amazon CloudTrail dataset

Identify partitions to optimize searches of AWS CloudTrail datasets

Get partition key values for an AWS CloudTrail dataset

Optionally identify all possible partition keys with a wildcard

Search your AWS Glue table datasets

Delete a federated index

Prerequisites

Steps

Comments

Map an Amazon S3 federated index to a Splunk-managed AWS Glue table dataset

Was this topic useful?