sdselect command examples for Amazon S3

The following search examples show how you can use sdselect for Federated Search for Amazon S3. To learn more about the sdselect command, see sdselect command overview.

1. See the schema of the data in a AWS Glue Data Catalog table dataset

The following search returns 5 rows of data from the AWS Glue table dataset represented by a federated index named my_csv_data.

| sdselect * FROM my_csv_data LIMIT 5

This search returns the schema of a specific AWS Glue table. This type of search is useful when you are not sure what an AWS Glue table contains. You might start with this search before you begin constructing more complex searches.

2. Get a count of all events in a AWS Glue table dataset

The following search returns the count of all events in a AWS Glue table dataset represented by a federated index named my_csv_data.

| sdselect count FROM my_csv_data

3. Perform statistical aggregations on fields and group results by time

The following search gets the count() and sum() of the scan_count field in the AWS Glue table dataset represented by the my_csv_data federated index. The search organizes the results by their event timestamps into groups with a 1 day granularity.

To use a GROUPBY clause with span argument in an sdselect search, you must include the Unix time field in the clause. The Unix time field for the my_csv_data index is set to _time.

| sdselect count(scan_count), sum(scan_count) FROM my_csv_data GROUPBY _time, host span=1d

4. Return the values of a set of fields for a specific username

The following search of the my_csv_data federated index returns the _time, sid, action, and scan_count fields for a user named Tom where the _time values fall within the past 14 days.

As in the previous example, the _time field in this search is the Unix time field.

| sdselect _time, sid, action, scan_count FROM my_csv_data WHERE _time > relative_time(now(), "-14d") AND user="Tom"

5. Get averages of a nested field and group them by time

This example demonstrates usage of a nested field in a sdselect search. The fact that the audit.scan_count field contains dot ( . ) character notation but is not surrounded by single or double quotation marks indicates that it is a nested struct type field that represents a dataset in a hierarchical structure, such as a JSON array or a directory of files within an Amazon S3 bucket. See Special handling for elements of sdselect syntax.

| sdselect avg(audit.scan_count) FROM mydata GROUPBY _time span=1d

In this example, the Time field for the mydata federated index is set to audit.time. However, because the search uses the span argument in the GROUPBY clause, the search also has to use the Unix time field value of _time in that clause. The Unix time field is an alias of the Time field, so this search is still grouping results by the values of audit.time.

6. Get averages of a flattened field and group them by time

This search is the same as the preceding search, except that here, single quotes indicate that the field with a dot ( . ) character in it is a flattened field rather than a nested field, and that the dots do not necessarily indicate any kind of notation. See Special handling for elements of sdselect syntax.

| sdselect avg('audit.scan_count') FROM mydata GROUPBY _time span=1d

7. Return information for events that have UNIX timestamps earlier than a set date

The following search of the my_csv_data federated index returns information about events grouped into 4-hour spans. The my_csv_data federated index definition declares a Time field named Event_Time, and a Unix time field named _time. Event_Time has UNIX time format timestamps, and _time is an alias of Event_Time that also has UNIX time format timestamps.

In the following search, a WHERE clause retains only events with Event_Time timestamps earlier than a specified date and time. Meanwhile, a GROUPBY clause groups the retained events by their _time timestamps.

| sdselect count, values(user), values(scan_count), min(Event_Time), max(Event_Time), FROM my_csv_data WHERE Event_Time < 1678409481 GROUPBY _time span=4h

Because Event_Time has UNIX time format values, you can use Event_Time directly in the WHERE clause. However, because the GROUPBY clause uses span, that clause must have the Unix time field as a group-by field.

See Use time fields in sdselect searches.

8. Avoid lexicographical sort order errors in federated searches of Amazon S3 data

The following example is similar to the preceding example except that in this case, CustomStringTime, the Time field for the federated index, has timestamp values that are in a custom string format.

| sdselect count, values(user), values(scan_count), min(CustomStringTime), max(CustomStringTime), min(_time), max(_time) FROM my_csv_data WHERE CustomStringTime < "2023-3-9 16:51:21.000 -0800" GROUPBY _time span=4h

You might run this search, which has CustomStringTime in the WHERE clause, without errors. However, because sdselect compares time fields with time string literals in lexicographical order, it might incorrectly match some time string literals with data points in your AWS Glue table. For example, when sdselect applies lexicographic order to string date values, it considers the date value 2023-3-12 to be smaller than the date value 2023-3-9.

The following example shows the change made to the WHERE clause to avoid lexicographical sorting errors. In this example, the WHERE clause uses _time, the Unix time field value. The _time field has numeric UNIX-time-formatted timestamps, which the sdselect command always sorts in the correct sequential order.

| sdselect count, values(user), values(scan_count), min(CustomStringTime), max(CustomStringTime), min(_time), max(_time) FROM my_csv_data WHERE _time < strptime("2023-3-9 16:51:21.000 -0800", "%Y-%m-%d %H:%M:%S.%Q %z") GROUPBY _time span=4h

This example uses the strptime() function to convert the timestamp in the WHERE clause into a UNIX time value with a numeric datatype so that sdselect can compare the timestamp values against the values of the _time field. This function wasn't necessary in the original example.

See Use time fields in sdselect searches.

9. Search over data with time values that have the SQL timestamp datatype

The following search returns an event count of the data in the current year. It uses a Time field named timestampTimeField that has values with the SQL timestamp data type.

| sdselect count FROM time_field_int WHERE strftime(timestampTimeField, "%Y") = strftime(now(), "%Y"))

See Apply date and time eval functions to fields in the WHERE clause.

10. Use eval functions to convert improperly-formatted UNIX timestamps

The search in this example returns an event count of 1 year of data from a timestamp set by a numeric constant. However, there is a problem with the numeric constant. 1682297305000 appears to be a UNIX-formatted timestamp that is extended to include subsecond information, but it does not delimit the subsecond with a dot character ( . ).

To resolve this, the search uses tostring() to convert the numeric value into a string value. Then it uses strptime() to convert that string value into a correct UNIX formatted timestamp with a dot-delimited subsecond.

| sdselect count FROM time_field_int WHERE strptime(tostring(1682297305000), "%s%3Q") > relative_time(now(), "-1y")

See Apply date and time eval functions to fields in the WHERE clause.

11. Return results based on a partition key

The following search of the my_csv_data federated index returns the _time, sid, action, and scan_count fields for a partition key representing an hour of a day in July 2023. The partition key is defined in the AWS Glue table that is referenced by the my_csv_data federated index.

| sdselect _time, sid, action, scan_count FROM my_csv_data WHERE partition_key_datetime="2023-07-11 06"

For more information, search the AWS Blog website for "Work with partitioned data in AWS Glue".

12. Return a limited set of results, ordered by the values of a specific field

The following search of the my_csv_data federated index gets the count of the scan_count field in the AWS Glue table dataset. The search groups the counts by the values of the sid and action fields. Then it returns the first 10 search results, ordered by the values of the user field.

| sdselect count(scan_count) FROM my_csv_data GROUPBY sid,action,user ORDERBY user LIMIT 10

Related answers from Splunk Community

sdselect command examples for Amazon S3

1. See the schema of the data in a AWS Glue Data Catalog table dataset

2. Get a count of all events in a AWS Glue table dataset

3. Perform statistical aggregations on fields and group results by time

4. Return the values of a set of fields for a specific username

5. Get averages of a nested field and group them by time

6. Get averages of a flattened field and group them by time

7. Return information for events that have UNIX timestamps earlier than a set date

8. Avoid lexicographical sort order errors in federated searches of Amazon S3 data

9. Search over data with time values that have the SQL timestamp datatype

10. Use eval functions to convert improperly-formatted UNIX timestamps

11. Return results based on a partition key

12. Return a limited set of results, ordered by the values of a specific field

See also

Comments

sdselect command examples for Amazon S3

Was this topic useful?