Create an AWS Glue table
Federated search for Amazon S3 features an interconnection with AWS Glue, Amazon's data integration service. In a later step, you'll define a federated index that maps to a specific AWS Glue table, which in turn contains a column and schema definition for a particular dataset in your Amazon S3 buckets. The federated searches that you run over this AWS Glue table can efficiently return results on the Amazon S3 dataset that it represents, even if the dataset is large.
Each AWS Glue table must reference an Amazon S3 location that contains a dataset you want to search. This dataset must also be composed of data with a uniform data format and compression type. If you have not yet identified the Amazon S3 locations you want to search, see Identify the Amazon S3 data you want to search.
For the creation of AWS Glue tables, you have two options:
- If you have datasets in Amazon S3 that are composed entirely of AWS CloudTrail data, you can let Splunk software create AWS Glue tables for you.
- If you have datasets in Amazon S3 that are composed of data with other source types, such as VPC Flow Log data or a proprietary dataset, you must create AWS Glue tables for those datasets.
Let Splunk software create AWS Glue tables that represent AWS CloudTrail datasets
Splunk software can automatically create AWS Glue tables for Amazon S3 locations that contain datasets composed entirely of AWS CloudTrail data.
If all of the Amazon S3 locations that you want to search contain data with the AWS CloudTrail source type, skip this topic. Go to Define an Amazon S3 federated provider and create an Amazon S3 federated provider that can accommodate Splunk-created Glue tables which reference AWS CloudTrail datasets.
If some of the Amazon S3 locations that you want to search contain datasets that have other source types, such as VPC Flow Logs, continue reading this topic to learn how to create AWS Glue tables for those locations.
Manually create AWS Glue tables that represent other dataset source types
You must create AWS Glue tables for Amazon S3 locations that contain data with source types other than AWS CloudTrail, such as VPC Flow Log datasets.
The task of manually creating AWS Glue tables breaks down into the following tasks:
- Optionally encrypt your AWS Glue Data Catalog and its connection passwords.
- Create an AWS Glue Data Catalog database within your AWS Glue Data Catalog.
- Define 1 or more AWS Glue tables that are contained by that AWS Glue database. The following table defines these object types and provides example object names.
Object | Definition | Example |
---|---|---|
AWS Glue Data Catalog | A persistent technical metadata store in the AWS Cloud. Each AWS account has one AWS Glue Data Catalog per AWS region. Each Data Catalog is a highly scalable collection of AWS Glue tables that are organized into AWS Glue databases. | n/a |
AWS Glue database | A namespace that contains 1 or more AWS Glue tables. You must define an AWS Glue database for your AWS Glue tables if you do not already have one. Each AWS Glue table can belong to only 1 AWS Glue database. | vpc_flow_log_tables_us_east
|
AWS Glue table | A metadata definition that describes a single searchable dataset in an Amazon S3 location. You must have at least 1 AWS Glue table to run a federated search over Amazon S3 data.
|
ksmith_vpc_flow_log_table
|
For detailed information about AWS Glue Data Catalogs, AWS Glue databases, and AWS Glue tables, start with AWS Glue Components in the AWS Glue User Guide.
For more information about Amazon S3 locations, see Bucket configuration options in the AWS Glue User Guide.
Keep track of the AWS Glue databases and AWS Glue tables that you define, and of the Amazon S3 locations that your AWS Glue tables describe. You need them when you define your Amazon S3 federated provider in the following step. See Define an Amazon S3 federated provider.
Apply AWS KMS encryption to your AWS Glue Data Catalog
(Optional) For additional security over your sensitive data, you can apply AWS Key Management Service (AWS KMS) encryption to your AWS Glue Data Catalog. When you enable AWS KMS encryption to your AWS Glue Data Catalog, all new AWS Glue databases and AWS Glue tables that you create within that catalog are automatically encrypted.
If you already have AWS Glue databases and AWS Glue tables in your AWS Glue Data Catalog when you apply AWS KMS encryption to it, you must recreate those databases and tables to ensure that they are encrypted.
See Encrypting your Data Catalog in the AWS Glue User Guide.
Create an AWS Glue Data Catalog database
You can define an AWS Glue Data Catalog database through the Databases tab in the AWS Glue console.
An AWS Glue database that you create for federated search for Amazon S3 has the following restrictions:
- The AWS Glue database must have the same AWS Region as your Splunk Cloud Platform deployment. For example, if your Splunk Cloud Platform deployment is in the US East (N. Virginia) AWS Region, your AWS Glue database must also be in the US East (N. Virginia) AWS Region. When you view your AWS Glue database in the AWS Glue console, the AWS Region displays at the top of the console, near your user ID.
- The name of the AWS Glue database can contain only lowercase letters, numbers, hyphens, and underscores.
- Federated Search for Amazon S3 does not support the AWS Lake Formation permission model. If your AWS Glue database was created with AWS Lake Formation permissions, you must restrict new AWS Glue tables in the database so that they use only IAM access control. See Update an AWS Glue database so that it gives IAM permissions only to new AWS Glue tables.
For more information, including alternate methods for creating AWS Glue databases, see Getting started with the AWS Glue Data Catalog in the AWS Glue Developer Guide.
Update an AWS Glue database so that it gives IAM permissions only to new AWS Glue tables
If you have a preexisting AWS Glue database that is set up to give Lake Formation permissions to new AWS Glue tables, and you want to use that database for Federated Search for Amazon S3, you must edit the database so that it gives IAM permissions only to new AWS Glue tables.
- Go to the AWS Lake Formation console and navigate to the Databases page.
- Select the button next to the name of the database that you want to edit. The Default permissions for new tables column lets you quickly identify databases that are set up to give new tables Lake Formation permissions.
- Select Actions and then select Edit.
- Select Use only IAM access control for new tables in this database.
- Select Save to save your change to the database.
For more information, see Creating a database in the AWS Lake Formation Developers Guide.
Create AWS Glue tables
If you don't already have AWS Glue tables that represent your Amazon S3 data, you have 4 options for creating AWS Glue table definitions and adding them to AWS Glue databases in the AWS Glue Data Catalog.
- Use the AWS Glue Create Table form in the AWS Glue console to manually design an AWS Glue table definition.
- Enter a JSON schema template in the AWS Glue Create Table form to quickly create a AWS Glue table definition for a known source type.
- Run a SQL DDL
CREATE TABLE
statement in the Amazon Athena console. - Use the AWS Glue crawler to generate AWS Glue table definitions and add them to AWS Glue databases in the AWS Glue Data Catalog.
Option | Use for | Description | Knowledge required to create dataset schema | For more information |
---|---|---|---|---|
Use the Create Table form in the AWS Glue console to manually define a form. | Simple datasets, proof-of-concept projects. | Use a guided web form to manually enter each column of the object schema into an AWS Glue table definition. Do not use this method for large AWS Glue tables. | You must know the schema to manually define the AWS Glue table columns. | Search the AWS Glue Developer Guide for "Working with tables on the AWS Glue console". |
Enter a JSON schema template directly into the Create Table form in the AWS Glue console. | Known source types with stable schemas such as Amazon VPC Flow Logs. | If you have a predefined JSON schema template for a known source type such as Amazon VPC Flow Logs, use the Create Table form to add the template directly with the Edit schema as JSON tool, potentially streamlining your AWS Glue table creation process. Requires technical knowledge. |
If you supply a predefined JSON schema template, you do not need to know the AWS Glue table schema in advance. | Search the AWS Glue Developer Guide for "Working with tables on the AWS Glue console". |
Use a Hive DDL CREATE TABLE statement to create an AWS Glue table in the Amazon Athena console.
|
Relatively complex AWS Glue tables, or for application of partition projection to a partitioned dataset. | This SQL DDL script method offers control, consistency, automation, and a rich language syntax. Includes features like partition projection and SQL search reuse for known source types. Requires technical knowledge. |
If you supply a predefined CREATE TABLE statement for a known source type such as Amazon VPC Flow Logs, you do not need to know the AWS Glue table schema in advance. You do need to know the AWS Glue table schema if you manually define or edit the CREATE TABLE statement.
|
Search the Amazon Athena User Guide for "Creating tables in Athena". |
Generate an AWS Glue table with the AWS Glue crawler. | File objects with large schemas, or schemas that require constant update | The AWS Glue crawler is an automatic AWS Glue table generation tool that can handle large datasets. You can configure the AWS Glue crawler to run on a schedule. | You do not need to know the AWS Glue table schema. The AWS Glue crawler automatically detects dataset schemas, and by default it creates a separate AWS Glue table for each schema it detects within a dataset. You can configure the AWS Glue crawler to combine compatible schemas into a common AWS Glue table definition. However, the AWS Glue crawler is not optimized for the construction of AWS Glue tables from datasets that contain varied JSON schemas. |
Search the AWS Glue Developer Guide for "How crawlers work". |
For help with AWS IAM permissions relating to any of these AWS Glue table creation methods, contact your AWS administrator.
Ensure your AWS Glue table has a column for each field in your data
When you create your AWS Glue table, ensure that it has a column for each field in your dataset schema. This is easy if your dataset has a consistent schema where each event has the same fields, as demonstrated by this collection of JSON-formatted data:
{"date":"2023-02-12", "purchase":{"item": "t-shirt", "size":"M"} } {"date":"2023-02-13", "purchase":{"item": "vest", "size":"S"} } {"date":"2023-02-16", "purchase":{"item": "coat", "size":"XL"} } ...
From a dataset that only contains events like these, you can create an AWS Glue table with the following columns: date
, purchase
, item
, and size
.
However, you might have a dataset with a variable schema, where different fields appear in different events, like this collection of JSON-formatted data:
{"date":"2023-01-01", "purchase":{"item": "t-shirt", "size":"M"} } {"date":"2023-01-05", "sale":{"price": 500}, "item": "coffee" } {"date":"2023-01-08", "trade": true, "item": "coffee" } ...
An AWS Glue table for a dataset like this requires a column for each unique field in the data. An AWS Glue table for this set of events (and events like them) has the following columns: date
, purchase
, sale
, trade
, price
, item
, and size
. This AWS Glue table also has blank cells in at least some of its rows.
After you create an AWS Glue table, open it in the AWS Glue table editor and verify that it has the columns that you want to search with federated search for Amazon S3. You can use the AWS Glue table editor to add or remove columns from the AWS Glue table as necessary. For more information, see Creating tables in the AWS Glue Developer Guide.
Create a federated provider
After you create an AWS Glue database and some AWS Glue tables that belong to that AWS Glue database, you can use them to define a federated provider for federated search for Amazon S3. See Define an Amazon S3 federated provider.
Identify the Amazon S3 data that you want to search | Define an Amazon S3 federated provider |
This documentation applies to the following versions of Splunk Cloud Platform™: 9.3.2408
Feedback submitted, thanks!