Create an AWS Glue Data Catalog table

Federated search for Amazon S3 features an interconnection with AWS Glue, Amazon's data integration service. You build tables in AWS Glue that represent the data in your Amazon S3 buckets. The federated searches that you run over these AWS Glue Data Catalog tables can efficiently return results on large Amazon S3 datasets.

The AWS Glue tables you create must reference Amazon S3 locations which contain data that has the same data format and compression type. If you have not yet identified the Amazon S3 locations you want to search, see Identify the Amazon S3 data you want to search.

In this step you do the following things:

Optionally encrypt your AWS Glue Data Catalog and its connection passwords.
Create an AWS Glue Data Catalog database within your AWS Glue Data Catalog.
Define 1 or more AWS Glue tables that are contained by that AWS Glue database. The following table defines these object types and provides example object names.

Object	Definition	Example
AWS Glue Data Catalog	A persistent technical metadata store in the AWS Cloud. Each AWS account has one AWS Glue Data Catalog per AWS region. Each Data Catalog is a highly scalable collection of AWS Glue tables that are organized into AWS Glue databases.	n/a
AWS Glue database	A namespace that contains 1 or more AWS Glue tables. You must define an AWS Glue database for your AWS Glue tables if you do not already have one. Each AWS Glue table can belong to only 1 AWS Glue database.	`cloudtrail_tables_us_east`
AWS Glue table	A metadata definition that describes a single searchable dataset in an Amazon S3 location. You must have at least 1 AWS Glue table to run a federated search over Amazon S3 data. Each AWS Glue table you create describes a dataset at a specific Amazon S3 location.	`ksmith_cloudtrail_table`

For detailed information about AWS Glue Data Catalogs, AWS Glue databases, and AWS Glue tables, start with AWS Glue Components in the AWS Glue User Guide.

For more information about Amazon S3 locations, see Bucket configuration options in the AWS Glue User Guide.

Keep track of the AWS Glue databases and AWS Glue tables that you define, and of the Amazon S3 locations that your AWS Glue tables describe. You need them when you define your Amazon S3 federated provider in the following step. See Define an Amazon S3 federated provider.

Apply AWS KMS encryption to your AWS Glue Data Catalog

(Optional) For additional security over your sensitive data, you can apply AWS Key Management Service (AWS KMS) encryption to your AWS Glue Data Catalog. When you enable AWS KMS encryption to your AWS Glue Data Catalog, all new AWS Glue databases and AWS Glue tables that you create within that catalog are automatically encrypted.

If you already have AWS Glue databases and AWS Glue tables in your AWS Glue Data Catalog when you apply AWS KMS encryption to it, you must recreate those databases and tables to ensure that they are encrypted.

See Encrypting your Data Catalog in the AWS Glue User Guide.

Create an AWS Glue Data Catalog database

You can define an AWS Glue Data Catalog database through the Databases tab in the AWS Glue console.

An AWS Glue database that you create for federated search for Amazon S3 has the following restrictions:

The AWS Glue database must have the same AWS Region as your Splunk Cloud Platform deployment. For example, if your Splunk Cloud Platform deployment is in the US East (N. Virginia) AWS Region, your AWS Glue database must also be in the US East (N. Virginia) AWS Region. When you view your AWS Glue database in the AWS Glue console, the AWS Region displays at the top of the console, near your user ID.
The name of the AWS Glue database can contain only lowercase letters, numbers, hyphens, and underscores.
Federated Search for Amazon S3 does not support the AWS Lake Formation permission model. If your AWS Glue database was created with AWS Lake Formation permissions, you must restrict new AWS Glue tables in the database so that they use only IAM access control. See Update an AWS Glue database so that it gives IAM permissions only to new AWS Glue tables.

For more information, including alternate methods for creating AWS Glue databases, see Getting started with the AWS Glue Data Catalog in the AWS Glue Developer Guide.

Update an AWS Glue database so that it gives IAM permissions only to new AWS Glue tables

If you have a preexisting AWS Glue database that is set up to give Lake Formation permissions to new AWS Glue tables, and you want to use that database for Federated Search for Amazon S3, you must edit the database so that it gives IAM permissions only to new AWS Glue tables.

Go to the AWS Lake Formation console and navigate to the Databases page.
Select the button next to the name of the database that you want to edit. The Default permissions for new tables column lets you quickly identify databases that are set up to give new tables Lake Formation permissions.
Select Actions and then select Edit.
Select Use only IAM access control for new tables in this database.
Select Save to save your change to the database.

For more information, see Creating a database in the AWS Lake Formation Developers Guide.

Create Glue Data Catalog tables

If you don't already have AWS Glue tables that represent your Amazon S3 data, you have 4 options for creating AWS Glue table definitions and adding them to AWS Glue databases in the AWS Glue Data Catalog.

Use the AWS Glue Create Table form in the AWS Glue console to manually design an AWS Glue table definition.
Enter a JSON schema template in the AWS Glue Create Table form to quickly create a AWS Glue table definition for a known source type.
Run a SQL DDL CREATE TABLE statement in the Amazon Athena console.
Use the AWS Glue crawler to generate AWS Glue table definitions and add them to AWS Glue databases in the AWS Glue Data Catalog.

Option	Use for	Description	Knowledge required to create dataset schema	For more information
Use the Create Table form in the AWS Glue console to manually define a form.	Simple datasets, proof-of-concept projects.	Use a guided web form to manually enter each column of the object schema into an AWS Glue table definition. Do not use this method for large AWS Glue tables.	You must know the schema to manually define the AWS Glue table columns.	Search the AWS Glue Developer Guide for "Working with tables on the AWS Glue console".
Enter a JSON schema template directly into the Create Table form in the AWS Glue console.	Known source types with stable schemas such as AWS CloudTrail or Amazon VPC Flow Logs.	If you have a predefined JSON schema template for a known source type such as AWS CloudTrail or Amazon VPC Flow Logs, use the Create Table form to add the template directly with the Edit schema as JSON tool, potentially streamlining your AWS Glue table creation process. Requires technical knowledge.	If you supply a predefined JSON schema template, you do not need to know the AWS Glue table schema in advance.	Search the AWS Glue Developer Guide for "Working with tables on the AWS Glue console".
Use a Hive DDL `CREATE TABLE` statement to create an AWS Glue table in the Amazon Athena console.	Relatively complex AWS Glue tables, or for application of partition projection to a partitioned dataset.	This SQL DDL script method offers control, consistency, automation, and a rich language syntax. Includes features like partition projection and SQL search reuse for known source types. Requires technical knowledge.	If you supply a predefined `CREATE TABLE` statement for a known source type such as AWS CloudTrail or Amazon VPC Flow Logs, you do not need to know the AWS Glue table schema in advance. You do need to know the AWS Glue table schema if you manually define or edit the `CREATE TABLE` statement.	Search the Amazon Athena User Guide for "Creating tables in Athena".
Generate an AWS Glue table with the AWS Glue crawler.	File objects with large schemas, or schemas that require constant update	The AWS Glue crawler is an automatic AWS Glue table generation tool that can handle large datasets. You can configure the AWS Glue crawler to run on a schedule.	You do not need to know the AWS Glue table schema. The AWS Glue crawler automatically detects dataset schemas, and by default it creates a separate AWS Glue table for each schema it detects within a dataset. You can configure the AWS Glue crawler to combine compatible schemas into a common AWS Glue table definition. However, the AWS Glue crawler is not optimized for the construction of AWS Glue tables from datasets that contain varied JSON schemas.	Search the AWS Glue Developer Guide for "How crawlers work".

For help with AWS IAM permissions relating to any of these AWS Glue table creation methods, contact your AWS administrator.

Ensure your AWS Glue table has a column for each field in your data

When you create your AWS Glue table, ensure that it has a column for each field in your dataset schema. This is easy if your dataset has a consistent schema where each event has the same fields, as demonstrated by this collection of JSON-formatted data:

{"date":"2023-02-12", "purchase":{"item": "t-shirt", "size":"M"} }  
{"date":"2023-02-13", "purchase":{"item": "vest", "size":"S"} } 
{"date":"2023-02-16", "purchase":{"item": "coat", "size":"XL"} }
...

From a dataset that only contains events like these, you can create an AWS Glue table with the following columns: date, purchase, item, and size.

However, you might have a dataset with a variable schema, where different fields appear in different events, like this collection of JSON-formatted data:

{"date":"2023-01-01", "purchase":{"item": "t-shirt", "size":"M"} }  
{"date":"2023-01-05", "sale":{"price": 500}, "item": "coffee" } 
{"date":"2023-01-08", "trade": true, "item": "coffee" }
...

An AWS Glue table for a dataset like this requires a column for each unique field in the data. An AWS Glue table for this set of events (and events like them) has the following columns: date, purchase, sale, trade, price, item, and size. This AWS Glue table also has blank cells in at least some of its rows.

After you create an AWS Glue table, open it in the AWS Glue table editor and verify that it has the columns that you want to search with federated search for Amazon S3. You can use the AWS Glue table editor to add or remove columns from the AWS Glue table as necessary. For more information, search on "Working with tables on the console" in the AWS Glue Developer Guide.

Create a federated provider

After you create an AWS Glue database and some AWS Glue tables that belong to that AWS Glue database, you can use them to define a federated provider for federated search for Amazon S3. See Define an Amazon S3 federated provider.

Related answers from Splunk Community

Create an AWS Glue Data Catalog table

Apply AWS KMS encryption to your AWS Glue Data Catalog

Create an AWS Glue Data Catalog database

Update an AWS Glue database so that it gives IAM permissions only to new AWS Glue tables

Create Glue Data Catalog tables

Ensure your AWS Glue table has a column for each field in your data

Create a federated provider

Comments

Create an AWS Glue Data Catalog table

Was this topic useful?