About Federated Search for Amazon S3
Use Federated Search for Amazon S3 to search data from your Amazon S3 buckets from your Splunk Cloud Platform deployment without the need to ingest or index it first.
With Federated Search for Amazon S3, you run searches that apply filtering and statistical functions to AWS Glue tables which represent the data in your Amazon S3 buckets.
Overview of Amazon S3 federated search
Federated Search for Amazon S3 is ideal for searches over Amazon S3 datasets that conform to specific schema structures or which utilize partition key filters.
With Federated Search for Amazon S3 you can:
- Investigate historical security data in Amazon S3 buckets, potentially over 2 or more years of logs.
- Perform infrequent statistical analysis over historical data in Amazon S3 buckets, also potentially over 2 or more years of logs.
- Enrich data indexed into your Splunk Cloud Platform deployment through lookups of Amazon S3 data.
- Explore Amazon S3 datasets to locate business-critical data for ingestion into your Splunk Cloud Platform deployment.
- Manage infrequently-accessed Amazon S3 datasets that you store for regulation, compliance, or investigatory reasons.
AWS Glue table requirement
Federated Search for Amazon S3 searches apply filtering and statistical functions to AWS Glue tables that contain column and schema definitions for datasets in your Amazon S3 buckets. This means that an AWS Glue table must be created for each Amazon S3 dataset you intend to search.
Splunk software can create and manage AWS Glue tables for Amazon S3 datasets that follow the AWS CloudTrail schema. If you have CloudTrail datasets, all you need to do is set up your federated provider and federated indexes for them, and Splunk software can create and manage the AWS Glue tables for those datasets behind the scenes. See Define an Amazon S3 federated provider.
You must create AWS Glue tables for Amazon S3 datasets that have other source types, such as VPC Flow Log datasets. See Create an AWS Glue table.
You can have a mix of Splunk-managed and customer-created AWS Glue tables.
Restrictions
Federated Search for Amazon S3 is available only to Splunk Cloud Platform users with deployments in AWS regions.
Federated Search for Amazon S3 does not support the following kinds of Splunk Cloud Platform deployments:
- Deployments in Google Cloud regions.
- FedRAMP High, and DoD IL5 deployments.
Federated Search for Amazon S3 does support FedRAMP Moderate deployments.
To run Federated Search for Amazon S3, your Splunk Cloud Platform deployment must have one or more standalone search heads or a search head cluster. Federated Search for Amazon S3 is not supported on single instance deployments.
Each AWS Glue table that you search must be based on Amazon S3 file objects with the same file type and compression type.
Federated Search for Amazon S3 can search objects only in Amazon S3 buckets that have S3 Standard storage classes. Federated Search for Amazon S3 cannot search objects in Amazon S3 buckets that have alternative storage classes, such as S3 Intelligent-Tiering and S3 Glacier.
Federated Search for Amazon S3 cannot search Amazon S3 buckets that you have configured as Requester pays buckets. See Amazon S3 bucket configuration.
Supported file types and data formats
Federated Search for Amazon S3 supports the following file types and data formats.
- CSV or CSV-type formats
- new-line JSON
- Parquet (version 2.5.0 or higher)
- ORC
- Avro
- XML
For more information about the file types supported by AWS Glue tables, see Creating tables using the console in the AWS Glue Developer Guide.
Federated Search for Amazon S3 supports data originating from Splunk Cloud Platform features such as the Edge Processor solution and ingest actions. See Use Edge Processors or Use ingest actions to improve the data input process in Getting Data In.
Federated Search for Amazon S3 does not support data in Dynamic Data Self-Storage (DDSS) format.
For more information about accepted file types and data formats, see Identify the Amazon S3 data that you want to search.
Supported compression types
Federated Search for Amazon S3 supports the following compression types:
- GZIP
- BZIP2
- LZ4
- Snappy, both standard and Hadoop formats
Federated searches of compressed files might take longer to complete than federated searches of uncompressed files.
Supported encryption standards
Federated Search for Amazon S3 supports the following encryption standards.
- Server-side encryption with Amazon S3-Managed Keys (SSE-S3)
- Server-side encryption with the AWS Key Management Service (SSE-KMS)
Federated Search for Amazon S3 supports SSE-S3 without any additional setup requirements. For more information, see Using server-side encryption with Amazon S3-managed encryption keys (SSE-S3) in the Amazon Simple Storage Service User Guide.
Federated Search for Amazon S3 supports only customer-managed SSE-KMS keys. See Using server-side encryption with AWS KMS keys (SSE-KMS) in the Amazon Simple Storage Service User Guide.
Federated Search for Amazon S3 supports both SSE-KMS encryption at the Amazon S3 bucket level and at the AWS Glue Data Catalog level.
- For more information about applying SSE-KMS encryption to your Amazon S3 buckets, see Specifying server-side encryption with AWS KMS in the Amazon Simple Storage Service User Guide.
- For more information about applying SSE-KMS encryption to your AWS Glue Data Catalog, see Encrypting your Data Catalog in the AWS Glue User Guide.
SSE-KMS support requires some setup when you define your federated provider. See Define an Amazon S3 federated provider.
For more information about KMS encryption pricing, see AWS Key Management Service Pricing on the AWS website.
Partitioning
Federated Search for Amazon S3 supports partitioned and unpartitioned datasets. Examples of supported partitioning styles include Apache Hive and non-Apache Hive. Apache Hive partitions are made up of key-value pairs, while non-Apache Hive partitions are only the values.
Partitioning style | Format | Example |
---|---|---|
Apache Hive | ./<partition_unit>=<value>/<partition_unit>=<value>/
|
./year=2022/month=06/
|
non-Apache Hive | ./<value>/<value>/
|
./2022/06/
|
For more information about creating partitioned AWS Glue tables, see Creating tables in the AWS Glue Developer Guide.
Search processing language (SPL) requirements
Use the sdselect
command to search AWS Glue tables.
The sdselect
command has features similar to those of the tstats
and sort
commands. The sdselect
command supports filtering, statistical analysis, and group-by clauses.
See sdselect command overview.
What you need to get started
To get started with federated search of Amazon S3 data, you must have the following things:
- A Splunk Cloud Platform deployment that has Federated Search for Amazon S3 activated.
- An AWS account with data in Amazon S3 buckets that conforms to supported file and compression types.
- One or more AWS Glue tables that reference the data in those Amazon S3 buckets.
If you are new to AWS Glue and do not have AWS Glue tables, don't worry.
- If you have CloudTrail datasets in your Amazon S3 buckets, you can arrange to have Amazon S3 create AWS Glue tables that reference those datasets. See Define an Amazon S3 federated provider to get started.
- You can find a list of different ways to manually create AWS Glue tables for Amazon S3 datasets with other source types in Create an AWS Glue table.
Activate federated search
To activate Federated Search for Amazon S3 for your Splunk Cloud Platform deployment, contact your Splunk Sales representative. As part of this activation, you acquire a data scan entitlement that is based on the amount of remote Amazon S3 data, in terabytes, that you are projected to search over the upcoming year. Data scan entitlements are made up of Data Scan Units (DSUs). Each DSU is equivalent to 10 TB of data scanning capabilities.
If you also use Federated Analytics, you have one pool of DSUs that you share between Federated Search for Amazon S3 and Federated Analytics.
For more information about DSUs, see Splunk Offerings Purchase Capacity and Limitations.
Monitor your data scan entitlement
You can see what your total data scan entitlement is for your current license term and track how much of that data scan entitlement your Federated Search for Amazon S3 searches have used to date with the Federated Search for Amazon S3 dashboard in the Cloud Monitoring Console. See Use the License Usage dashboards in the Splunk Cloud Platform Admin Manual.
Checklist of tasks to set up Federated Search for Amazon S3
Use this checklist to guide you through the cross-account setup of Federated Search for Amazon S3.
Step | Task | Description | Service |
---|---|---|---|
1 | Turn on token authentication | You must turn on token authentication to allow for the automatic setup of cross-account permissions between your Splunk Cloud Platform deployment and the Amazon S3 account. You must turn token authentication on if it is turned off. | Splunk Cloud Platform |
2 | Identify the Amazon S3 data that you want to search | Find the data that you want to search in your Amazon S3 account. If it is not there, create buckets and put your data in them. | Amazon S3 |
3 | Create an AWS Glue table | If you do not already have an AWS Glue table that contains column definitions of the data you want to search in your Amazon S3 bucket, there are a variety of methods you can use to create one. You can skip this step if the datasets you are planning to search in your Amazon S3 buckets are composed entirely of AWS CloudTrail data. Splunk software can create and manage AWS Glue tables for CloudTrail datasets. |
|
4 | Define an Amazon S3 federated provider | Help your Splunk Cloud Platform deployment access your AWS data by creating a federated provider definition. This task breaks down into the following subtasks:
|
|
5 | Map a federated index to an AWS Glue table dataset | Create federated indexes and map them to specific customer-created or Splunk-managed AWS Glue table datasets. Define time fields and partition settings for your federated indexes.
|
Splunk Cloud Platform |
6 | Give your users role-based access control of federated indexes | Determine which AWS Glue table datasets your users can search. Set up role-based access to federated indexes for your users so they can reference the federated indexes in their federated searches. | Splunk Cloud Platform |
7 | Search your AWS Glue table datasets | Learn how to search your AWS Glue table datasets with the sdselect command.
|
Splunk Cloud Platform |
Amazon S3 bucket configuration
You cannot use Federated Search for Amazon S3 to search Amazon S3 buckets that are configured to be Requester Pays buckets. If Requester Pays is turned on for your Amazon S3 bucket and you try to run a federated search over that bucket, Splunk Cloud Platform rejects your search.
Searches of Amazon S3 buckets that are configured to be Requester Pays buckets incur data transfer charges in accordance with the Amazon S3 pricing schedule located at Amazon S3 Pricing on the AWS website.
Splunk is not liable for any such data transfer charges incurred.
Compliance and certifications for Federated Search for Amazon S3
Splunk Cloud Platform has attained a number of compliance attestations and certifications from industry-leading auditors as part of Splunk's commitment to adhere to industry standards worldwide and Splunk's efforts to safeguard customer data. Generally Available products and features that are currently in scope of Splunk's compliance program may not be a part of the third-party audit report until the next assessment cycle. Federated Search for Amazon S3 is in scope of the following compliance programs and will be audited at the next assessment cycle.
- SOC 2 Type II: The SOC 2 audit assesses an organization's security, availability, process integrity, and confidentiality processes to provide assurance about the systems that a company uses to protect customers' data. If you require the SOC 2 Type II attestation to review, contact your Splunk sales representative.
- Health Insurance Portability and Accountability Act (HIPAA): HIPAA is a U.S. federal law that sets forth national standards governing the processing of protected health information (PHI). HIPAA is intended to improve the effectiveness and efficiency of healthcare systems by establishing standards for the use of electronic records in healthcare; establishing standards for accessing, storing and transmitting PHI; and by protecting the privacy and security of PHI. Splunk's HIPAA compliance offering is annually audited by a third-party for compliance with HIPAA requirements, resulting in annual third party attestation reports.
- The Payment Card Industry Data Security Standard (PCI DSS): PCI DSS is a global information security standard created to better control cardholder data and reduce credit card fraud. PCI DSS applies to all entities that store, process, or transmit cardholder data and/or sensitive authentication data. Authorized users can access related documentation in the Customer Trust Portal.
- FedRAMP Authorization at the Moderate Impact Level: This authorization allows for the use of Federated Search for Amazon S3 within Splunk Cloud Platform by U.S. Federal Government agencies requiring cloud-based services authorized at the moderate security impact level. Additional information about FedRAMP is available to Splunk customers under non-disclosure agreement from the Customer Trust Portal.
For additional information about compliance and certifications, see Compliance at Splunk.
Turn off transparent mode | Identify the Amazon S3 data that you want to search |
This documentation applies to the following versions of Splunk Cloud Platform™: 9.3.2408
Feedback submitted, thanks!