sdselect command examples for Amazon S3
The following search examples show how you can use sdselect
for Federated Search for Amazon S3. To learn more about the sdselect
command, see sdselect command overview.
1. See the schema of the data in a AWS Glue Data Catalog table dataset
The following search returns 5 rows of data from the AWS Glue table dataset represented by a federated index named my_csv_data
.
| sdselect * FROM my_csv_data LIMIT 5
This search returns the schema of a specific AWS Glue table. This type of search is useful when you are not sure what an AWS Glue table contains. You might start with this search before you begin constructing more complex searches.
2. Get a count of all events in a AWS Glue table dataset
The following search returns the count of all events in a AWS Glue table dataset represented by a federated index named my_csv_data
.
| sdselect count FROM my_csv_data
3. Perform statistical aggregations on fields and group results by time
The following search gets the count()
and sum()
of the scan_count
field in the AWS Glue table dataset represented by the my_csv_data
federated index. The search organizes the results by their event timestamps into groups with a 1 day granularity.
To use a GROUPBY clause with span
argument in an sdselect
search, you must include the Unix time field in the clause. The Unix time field for the my_csv_data
index is set to _time.
| sdselect count(scan_count), sum(scan_count) FROM my_csv_data GROUPBY _time, host span=1d
4. Return the values of a set of fields for a specific username
The following search of the my_csv_data
federated index returns the _time
, sid
, action
, and scan_count
fields for a user named Tom
where the _time
values fall within the past 14 days.
As in the previous example, the _time
field in this search is the Unix time field.
| sdselect _time, sid, action, scan_count FROM my_csv_data WHERE _time > relative_time(now(), "-14d") AND user="Tom"
5. Get averages of a nested field and group them by time
This example demonstrates usage of a nested field in a sdselect
search. The fact that the audit.scan_count
field contains dot ( . ) character notation but is not surrounded by single or double quotation marks indicates that it is a nested struct type field that represents a dataset in a hierarchical structure, such as a JSON array or a directory of files within an Amazon S3 bucket. See Special handling for elements of sdselect syntax.
| sdselect avg(audit.scan_count) FROM mydata GROUPBY _time span=1d
In this example, the Time field for the mydata federated index is set to audit.time. However, because the search uses the span
argument in the GROUPBY clause, the search also has to use the Unix time field value of _time in that clause. The Unix time field is an alias of the Time field, so this search is still grouping results by the values of audit.time.
6. Get averages of a flattened field and group them by time
This search is the same as the preceding search, except that here, single quotes indicate that the field with a dot ( . ) character in it is a flattened field rather than a nested field, and that the dots do not necessarily indicate any kind of notation. See Special handling for elements of sdselect syntax.
| sdselect avg('audit.scan_count') FROM mydata GROUPBY _time span=1d
7. Return information for events that have UNIX timestamps earlier than a set date
The following search of the my_csv_data
federated index returns information about events grouped into 4-hour spans. The my_csv_data
federated index definition declares a Time field named Event_Time, and a Unix time field named _time. Event_Time has UNIX time format timestamps, and _time is an alias of Event_Time that also has UNIX time format timestamps.
In the following search, a WHERE clause retains only events with Event_Time timestamps earlier than a specified date and time. Meanwhile, a GROUPBY clause groups the retained events by their _time timestamps.
| sdselect count, values(user), values(scan_count), min(Event_Time), max(Event_Time), FROM my_csv_data WHERE Event_Time < 1678409481 GROUPBY _time span=4h
Because Event_Time
has UNIX time format values, you can use Event_Time
directly in the WHERE clause. However, because the GROUPBY clause uses span
, that clause must have the Unix time field as a group-by field.
See Use time fields in sdselect searches.
8. Avoid lexicographical sort order errors in federated searches of Amazon S3 data
The following example is similar to the preceding example except that in this case, CustomStringTime, the Time field for the federated index, has timestamp values that are in a custom string format.
| sdselect count, values(user), values(scan_count), min(CustomStringTime), max(CustomStringTime), min(_time), max(_time) FROM my_csv_data WHERE CustomStringTime < "2023-3-9 16:51:21.000 -0800" GROUPBY _time span=4h
You might run this search, which has CustomStringTime
in the WHERE clause, without errors. However, because sdselect
compares time fields with time string literals in lexicographical order, it might incorrectly match some time string literals with data points in your AWS Glue table. For example, when sdselect
applies lexicographic order to string date values, it considers the date value 2023-3-12
to be smaller than the date value 2023-3-9
.
The following example shows the change made to the WHERE clause to avoid lexicographical sorting errors. In this example, the WHERE clause uses _time
, the Unix time field value. The _time
field has numeric UNIX-time-formatted timestamps, which the sdselect
command always sorts in the correct sequential order.
| sdselect count, values(user), values(scan_count), min(CustomStringTime), max(CustomStringTime), min(_time), max(_time) FROM my_csv_data WHERE _time < strptime("2023-3-9 16:51:21.000 -0800", "%Y-%m-%d %H:%M:%S.%Q %z") GROUPBY _time span=4h
This example uses the strptime()
function to convert the timestamp in the WHERE clause into a UNIX time value with a numeric datatype so that sdselect
can compare the timestamp values against the values of the _time
field. This function wasn't necessary in the original example.
See Use time fields in sdselect searches.
9. Search over data with time values that have the SQL timestamp datatype
The following search returns an event count of the data in the current year. It uses a Time field named timestampTimeField that has values with the SQL timestamp
data type.
| sdselect count FROM time_field_int WHERE strftime(timestampTimeField, "%Y") = strftime(now(), "%Y"))
See Apply date and time eval functions to fields in the WHERE clause.
10. Use eval functions to convert improperly-formatted UNIX timestamps
The search in this example returns an event count of 1 year of data from a timestamp set by a numeric constant. However, there is a problem with the numeric constant. 1682297305000
appears to be a UNIX-formatted timestamp that is extended to include subsecond information, but it does not delimit the subsecond with a dot character ( . ).
To resolve this, the search uses tostring()
to convert the numeric value into a string value. Then it uses strptime()
to convert that string value into a correct UNIX formatted timestamp with a dot-delimited subsecond.
| sdselect count FROM time_field_int WHERE strptime(tostring(1682297305000), "%s%3Q") > relative_time(now(), "-1y")
See Apply date and time eval functions to fields in the WHERE clause.
11. Return results based on a partition key
The following search of the my_csv_data
federated index returns the _time
, sid
, action
, and scan_count
fields for a partition key representing an hour of a day in July 2023. The partition key is defined in the AWS Glue table that is referenced by the my_csv_data
federated index.
| sdselect _time, sid, action, scan_count FROM my_csv_data WHERE partition_key_datetime="2023-07-11 06"
For more information, search the AWS Blog website for "Work with partitioned data in AWS Glue".
12. Return a limited set of results, ordered by the values of a specific field
The following search of the my_csv_data
federated index gets the count of the scan_count
field in the AWS Glue table dataset. The search groups the counts by the values of the sid
and action
fields. Then it returns the first 10 search results, ordered by the values of the user
field.
| sdselect count(scan_count) FROM my_csv_data GROUPBY sid,action,user ORDERBY user LIMIT 10
See also
Use time fields in sdselect searches |
This documentation applies to the following versions of Splunk Cloud Platform™: 9.0.2305, 9.1.2308, 9.1.2312
Feedback submitted, thanks!