Set up data ingest and retention rules for data lake indexes

This topic covers the Set up data lake indexes step of the workflow for creating an Amazon Security Lake federated provider. You cannot follow this step until you complete the steps that precede it in the federated provider setup workflow. See the checklist of tasks to set up Federated Analytics.

When you create your Amazon Security Lake federated provider, Federated Analytics creates data lake indexes on your Splunk Cloud Platform deployment and begins ingesting data from your Amazon Security Lake account into those indexes.

Data lake indexes are standard Splunk event indexes that you can search with Splunk search processing language (SPL) and the full range of related search tools. You can manage and set up permissions for data lake indexes, just like other indexes into which you currently ingest data.

Federated Analytics starts you off with a data lake index for each major Open Cybersecurity Framework (OCSF) category. In this step, for each data lake index Federated Analytics generates, you can do the following things:

Rename the data lake index.
Fine-tune the dataset that the data lake index ingests by adjusting its event class filters so each index is ingesting precisely the event data that you want to search later.
Determine how long the data lake index holds onto data by adjusting its retention period.
Remove the data lake index if you do not want to ingest Amazon Security Lake data for its OCSF category.

The Set up data lake indexes step of the workflow for creating an Amazon Security Lake federated provider triggers an indexer restart. Because you must create the Amazon Security Lake provider and its indexes in one session, schedule provider setup outside of peak business hours, when system load is relatively low.

Prerequisites

You must have already created a new subscriber for data ingestion access in your Amazon Security Lake account, and you must have added its AWS Role ARN and Subscription Endpoint fields to the federated provider definition. See Create the Amazon Security Lake subscriber for data ingestion.

Steps

On your Splunk Cloud Platform deployment, in Splunk Web, at the Set up data lake indexes step of the Add a new federated provider workflow, you will find a list of six data lake indexes, one for each OCSF category. Using the following table, optionally adjust the settings for your data lake indexes.

Setting	Description	Default Value
Data lake index name	Optionally change the name of the data lake index. Data lake index names have the following restrictions: They can contain only letters, numbers, underscores, and hyphens. They must begin with a letter or number. They cannot be more than 2,048 characters in length. Make sure that each data lake index has a name that is unique among the names of all of your existing indexes.	Each index has a default name corresponding to its OCSF Category. By default, data lake index names are prefixed with dl_ and appended with _index. For example, the default name for a data lake index that ingests data belonging to the Discovery OCSF category is dl_discovery_index. Splunk software creates an additional dl_main index for ingested Amazon Security Lake datasets that have unrecognized OCSF categories, or that do not follow the AWS source version 2 data schema (OCSF 1.1.0).
OCSF category	Identifies the Open Cybersecurity Framework (OCSF) category of the data lake index. The OCSF category is a filter ensuring that only data matching the selected category is ingested into the data lake index. There are six possible OCSF category values: Application Activity Discovery Findings Identity & Access Network Activity System Activity. The OCSF category field is editable only when one or more of the default six data lake indexes have been removed. You cannot give the same OCSF category to two or more data lake indexes.	By default, each data lake index is assigned one of the six OCSF category values.
Event class	Each OCSF category has a different set of event classes that are specific to it. All Amazon Security Lake events apply to one of these event classes. Select the event classes that you want to ingest into each data lake index. Deselect the event classes that you want to filter out of the ingest pipeline for each data lake index. You must select at least one event class for each data lake index. As a best practice, do not leave all of the event classes selected. Reduce the cost of your data ingestion and improve the performance of your future data lake index searches by ensuring that you are indexing only the event classes that you know you must be able to search.	All event classes are selected for each data lake index.
Retention period	Determine the number of days that ingested Amazon Security Lake data is retained in each data lake index, up to a maximum of 31 days. Splunk software removes data that exceeds the retention period from the data lake index.	30 days

(Optional) Select the trash can icon () to remove a data lake index that you do not need. When you remove an index, the OCSF category field becomes editable for other data lake indexes. You cannot give the same OCSF category to two or more data lake indexes.
(Optional) When there are less than six data lake indexes on the Set up data lake indexes step, you can select Add another index to replace a missing data lake index.
When you are satisfied with your set of data lake index definitions, select Add & Continue to move on to the Set up federated indexes step. See Map federated indexes to AWS Glue tables.

After you select Add & Continue Splunk software initiates an indexer restart to complete data late index setup.

For more information about the AWS source version 1 and 2 data schemas, see Security Lake queries in the Amazon Security Lake User Guide.

Full index-time field extraction for improved performance of data lake index searches

To help your data lake index searches complete quicker on average than searches of ordinary Splunk indexes, Federated Analytics features full index-time field extraction on the Amazon Security Lake datasets that it ingests. How does full index-time field extraction make your data lake index searches more efficient?

Searches that involve fields complete faster when those fields are indexed. Indexed fields are extracted at index time, before you run the search, which means that your search head does not need to do the extra work of extracting the field at search time, while the search is running. See Use fields effectively, in the Search Manual.

Ordinarily when Splunk software indexes data, it only extracts only a few fields by default, most notably the host, source, and sourcetype fields. It does this because Splunk indexes typically ingest data that follows a wide variety of schemas. Reducing the number of fields that need to be extracted from that data is therefore a strategy to improve indexing performance.

To improve data lake index search efficiency, Federated Analytics performs full index-time field extraction on the Amazon Security Lake data that it ingests into its data lake indexes. In other words, each field in every event indexed into a data lake index becomes an indexed field. Federated Analytics can do this because all ASL data conforms to a common OCSF data schema, and all of the fields that that schema contains are known in advance.

Because Federated Analytics performs full search-time indexing of all ASL fields, you might find that data model acceleration and similar search acceleration methods do not deliver expected search performance gains when you apply them to data lake index datasets.

Another benefit of applying complete index-time field extraction to the data you ingest into data lake indexes is that it allows you to run tstats searches on the contents of your data lake indexes. See the tstats command reference topic in the Search Reference.

For more information about searching with Federated Analytics, see Run Federated Analytics searches.

Related answers from Splunk Community

Set up data ingest and retention rules for data lake indexes

Prerequisites

Steps

Full index-time field extraction for improved performance of data lake index searches

Comments

Set up data ingest and retention rules for data lake indexes

Was this topic useful?