Splunk Cloud Platform

Federated Search

Acrobat logo Download manual as PDF


Acrobat logo Download topic as PDF

Create an AWS Glue Data Catalog table

Federated search for Amazon S3 features an interconnection with AWS Glue, Amazon's data integration service. You build tables in AWS Glue that represent the data in your Amazon S3 buckets. The federated searches that you run over these AWS Glue Data Catalog tables can efficiently return results on large Amazon S3 datasets.

The AWS Glue tables you create must reference Amazon S3 locations which contain data that has the same data format and compression type. If you have not yet identified the Amazon S3 locations you want to search, see Identify the Amazon S3 data you want to search.

In this step you create an AWS Glue Data Catalog database. Then you define 1 or more AWS Glue tables that are contained by that AWS Glue database. The following table defines these object types and provides example object names.

Object Definition Example
AWS Glue database A namespace that contains 1 or more AWS Glue tables. You must define an AWS Glue database for your AWS Glue tables if you do not already have one. Each AWS Glue table can belong to only 1 AWS Glue database. cloudtrail_tables_us_east
AWS Glue table A metadata definition that describes a single searchable dataset in an Amazon S3 location. You must have at least 1 AWS Glue table to run a federated search over Amazon S3 data.


Each AWS Glue table you create describes a dataset at a specific Amazon S3 location.

ksmith_cloudtrail_table

For detailed information about AWS Glue databases, AWS Glue tables, and Amazon S3 locations, search the AWS Glue Developer Guide for those terms.

Keep track of the AWS Glue databases and AWS Glue tables that you define, and of the Amazon S3 locations that your AWS Glue tables describe. You need them when you define your Amazon S3 federated provider in the following step. See Define an Amazon S3 federated provider.

Create an AWS Glue Data Catalog database

You can define an AWS Glue Data Catalog database through the Databases tab in the AWS Glue console.

An AWS Glue database that you create for federated search for Amazon S3 has the following restrictions:

  • The AWS Glue database must have the same AWS Region as your Splunk Cloud Platform deployment. For example, if your Splunk Cloud Platform deployment is in the US East (N. Virginia) AWS Region, your AWS Glue database must also be in the US East (N. Virginia) AWS Region. When you view your AWS Glue database in the AWS Glue console, the AWS Region displays at the top of the console, near your user ID.
  • The name of the AWS Glue database can contain only lowercase letters, numbers, hyphens, and underscores.

For more information, including alternate methods for creating AWS Glue databases, search on "Getting started with the AWS Glue Data Catalog" in the AWS Glue Developer Guide.

Create Glue Data Catalog tables

If you don't already have AWS Glue tables that represent your Amazon S3 data, you have 4 options for creating AWS Glue table definitions and adding them to AWS Glue databases in the AWS Glue Data Catalog.

  • Use the AWS Glue Create Table form in the AWS Glue console to manually design an AWS Glue table definition.
  • Enter a JSON schema template in the AWS Glue Create Table form to quickly create a AWS Glue table definition for a known source type.
  • Run a SQL DDL CREATE TABLE statement in the Amazon Athena console.
  • Use the AWS Glue crawler to generate AWS Glue table definitions and add them to AWS Glue databases in the AWS Glue Data Catalog.
Option Use for Description Knowledge required to create dataset schema For more information
Use the Create Table form in the AWS Glue console to manually define a form. Simple datasets, proof-of-concept projects. Use a guided web form to manually enter each column of the object schema into an AWS Glue table definition. Do not use this method for large AWS Glue tables. You must know the schema to manually define the AWS Glue table columns. Search the AWS Glue Developer Guide for "Working with tables on the AWS Glue console".
Enter a JSON schema template directly into the Create Table form in the AWS Glue console. Known source types with stable schemas such as AWS CloudTrail or Amazon VPC Flow Logs. If you have a predefined JSON schema template for a known source type such as AWS CloudTrail or Amazon VPC Flow Logs, use the Create Table form to add the template directly with the Edit schema as JSON tool, potentially streamlining your AWS Glue table creation process.

Requires technical knowledge.
If you supply a predefined JSON schema template, you do not need to know the AWS Glue table schema in advance. Search the AWS Glue Developer Guide for "Working with tables on the AWS Glue console".
Use a Hive DDL CREATE TABLE statement to create an AWS Glue table in the Amazon Athena console. Relatively complex AWS Glue tables, or for application of partition projection to a partitioned dataset. This SQL DDL script method offers control, consistency, automation, and a rich language syntax. Includes features like partition projection and SQL search reuse for known source types.

Requires technical knowledge.
If you supply a predefined CREATE TABLE statement for a known source type such as AWS CloudTrail or Amazon VPC Flow Logs, you do not need to know the AWS Glue table schema in advance. You do need to know the AWS Glue table schema if you manually define or edit the CREATE TABLE statement. Search the Amazon Athena User Guide for "Creating tables in Athena".
Generate an AWS Glue table with the AWS Glue crawler. File objects with large schemas, or schemas that require constant update The AWS Glue crawler is an automatic AWS Glue table generation tool that can handle large datasets. You can configure the AWS Glue crawler to run on a schedule. You do not need to know the AWS Glue table schema. The AWS Glue crawler automatically detects dataset schemas, and by default it creates a separate AWS Glue table for each schema it detects within a dataset.

You can configure the AWS Glue crawler to combine compatible schemas into a common AWS Glue table definition. However, the AWS Glue crawler is not optimized for the construction of AWS Glue tables from datasets that contain varied JSON schemas.
Search the AWS Glue Developer Guide for "How crawlers work".

For help with AWS IAM permissions relating to any of these AWS Glue table creation methods, contact your AWS administrator.

Ensure your AWS Glue table has a column for each field in your data

When you create your AWS Glue table, ensure that it has a column for each field in your dataset schema. This is easy if your dataset has a consistent schema where each event has the same fields, as demonstrated by this collection of JSON-formatted data:

{"date":"2023-02-12", "purchase":{"item": "t-shirt", "size":"M"} }  
{"date":"2023-02-13", "purchase":{"item": "vest", "size":"S"} } 
{"date":"2023-02-16", "purchase":{"item": "coat", "size":"XL"} }
...

From a dataset that only contains events like these, you can create an AWS Glue table with the following columns: date, purchase, item, and size.

However, you might have a dataset with a variable schema, where different fields appear in different events, like this collection of JSON-formatted data:

{"date":"2023-01-01", "purchase":{"item": "t-shirt", "size":"M"} }  
{"date":"2023-01-05", "sale":{"price": 500}, "item": "coffee" } 
{"date":"2023-01-08", "trade": true, "item": "coffee" }
...

An AWS Glue table for a dataset like this requires a column for each unique field in the data. An AWS Glue table for this set of events (and events like them) has the following columns: date, purchase, sale, trade, price, item, and size. This AWS Glue table also has blank cells in at least some of its rows.

After you create an AWS Glue table, open it in the AWS Glue table editor and verify that it has the columns that you want to search with federated search for Amazon S3. You can use the AWS Glue table editor to add or remove columns from the AWS Glue table as necessary. For more information, search on "Working with tables on the console" in the AWS Glue Developer Guide.

Create a federated provider

After you create an AWS Glue database and some AWS Glue tables that belong to that AWS Glue database, you can use them to define a federated provider for federated search for Amazon S3. See Define an Amazon S3 federated provider.

Last modified on 06 March, 2024
PREVIOUS
Identify the Amazon S3 data that you want to search
  NEXT
Define an Amazon S3 federated provider

This documentation applies to the following versions of Splunk Cloud Platform: 9.0.2305, 9.1.2308 (latest FedRAMP release), 9.1.2312


Was this documentation topic helpful?


You must be logged into splunk.com in order to post comments. Log in now.

Please try to keep this discussion focused on the content covered in this documentation topic. If you have a more general question about Splunk functionality or are experiencing a difficulty with Splunk, consider posting a question to Splunkbase Answers.

0 out of 1000 Characters