Index time vs search time JSON field extractions

Overview

CrowdStrike FDR uploads events to a dedicated S3 bucket in JSON format, and you can configure a sourcetype stanza for automatic json parsing and fields extraction either at index time or in search time.

Index time automatic field extraction extracts field name-value pairs only once, when events enter the Splunk ingest pipeline. Extracted properties and values are then stored in the Splunk index along with the raw event itself. This lets you save resources and speed up searches later, since these fields and values no longer require extraction with every search.

Index time extraction uses more index space and Splunk license usage and should typically be configured only if temporal data, such as IP or hostname, would be lost or if the logs will be used in multiple searches.

Search time automatic field extraction takes time with every running search which avoids using additional index space but increases resources and time required for searches to run.

Default configuration

Historically, the Splunk Add-on for Crowdstrike FDR is configured to do index time automatic extractions. This is implemented using below settings under corresponding sourcetype stanza (this can be also seen in add-on sourcetype configurations via Splunk user interface). INDEXED_EXTRACTIONS=json KV_MODE = none The benefits of the default configuration: Sensor events collected under the S3 bucket data folder belong to several sourcetypes. Historically, sourcetype assignment is implemented using index-time transformations and required event json to be parsed in order to get access to properties used for the sourcetype selection decision. Searches over security events are often parts of various detection dashboards and visualizations, so multiplied searches over same events are very likely and index time extractions can be more beneficial CrowdStrike sensors generate a huge amount of data, up to tens of thousands events per device per minute. Extracting and indexing event's JSON files enables using event fields in TSTATS searches that are times faster than regular STATS

As of version 1.3.0, sourcetype assignment is fully implemented in the modular input part and index time JSON extraction is no longer a requirement. If points two and three in above list do not bring sufficient benefits in comparison to saving Splunk license, it is possible to reconfigure add-on for search-time-only extractions.

Switch from index time to search time JSON fields extractions

Turning off index-time JSON extraction will not remove indexed properties from the old (already ingested) events. However, turning on search time extractions will cause field extraction duplication for the old events (fields extracted at index time plus same fields extracted at search time). As a result, field types will change from atomic types (number, string. etc) to multi value type. This will break CIM extractions implemented in the add-on and custom extractions used in your searches for all events ingested before configuration change.

Avoid changing this configuration for some of crowdstrike:inventory:* sourcetypes, because the TSTATS command is used to build kvstore lookups for host resolution (crowdstrike:inventory:aidmaster sourcetype), host ip and MAC address resolution (crowdstrike:inventory:managedassets sourcetype). Turning off index time json extractions can affect results of the TSTATS based saved searches.

Reconfigure using Splunk user interface

In the menu select Settings, then click the Source types item.
In the App dropdown list, select Splunk Add-on for CrowdStrike FDR to see only add-on

dedicated sourcetypes.

Click the Sourcetype you want to adjust.
In the Advanced tab, locate INDEXED_EXTRACTIONS property and click the button next to field value to delete the field.
Locate the KV_MODE property and change value none to json.
Click Save.
Make sure these changes are applied at all Splunk hosts where this add-on is installed.

Reconfigure using Splunk props.conf

In the folder where Splunk Add-on for CrowdStrike FDR is installed, find the "local" folder. If it does not exist, create it.
Inside the local folder find the "props.conf" file. If it does not exist, create it.
Inside props.conf file locate desired sourcetype stanza. If it does not exist, create it.
Assign empty value to property INDEXED_EXTRACTIONS:

INDEXED_EXTRACTIONS =

Assign value jsonto property KV_MODE:

KV_MODE = json

Save your file and restart Splunk.
Apply these changes to all Splunk hosts where this add-on is installed.

Initial estimation of CrowdStrike FDR data volume

It is possible to estimate the volume of CrowdStrike FDR data before configuring ingesting modular inputs. Crowdstrike FDR S3 bucket monitor provides all necessary information for this purpose. As described earlier this modular input connects directly to CrowdStrike FDR dedicated S3 bucket and logs information about all resources located there. Typical log event looks as the following: FDR S3 bucket monitor, new event file detected: fdr_scan_checkpoint="None", fdr_bucket=bucket-name, fdr_event_batch=fdrv2/userinfo/e1acbcb4-a016-4dd1-9c2f-3b5eaaa79817, fdr_event_file=fdrv2/userinfo/e1acbcb4-a016-4dd1-9c2f-3b5eaaa79817/part-00000.gz, fdr_event_source=s3://bucket-name/fdrv2/userinfo/e1acbcb4-a016-4dd1-9c2f-3b5eaaa79817/part-00000.gz, fdr_event_file_last_modified="2022-09-12 07:58:33+00:00", fdr_event_file_size=647 It shows information about a single file containing events. By analyzing fdr_event_file, fdr_event_file_last_modified and fdr_event_file_size values it is possible to understand distribution of incoming data volumes during a day or wider period of time. For example, the following search can be used: index=_internal sourcetype=crowdstrike_fdr_ta* "FDR S3 bucket monitor" | eval event_type_split=split(fdr_event_file, "/") | eval _time = strptime(fdr_event_file_last_modified, "%Y-%m-%d %H:%M:%S%z") | timechart sum(fdr_event_file_size) span=1h

Additionally data can be split by event type (data, aidmaster, userinfo, …) as it is in the example search below: index=_internal sourcetype=crowdstrike_fdr_ta* "FDR S3 bucket monitor" | eval event_type_split=split(fdr_event_file, "/") | eval event_type=if(mvindex(event_type_split, 0) == "fdrv1", mvindex(event_type_split, 1),mvindex(event_type_split, 0)) | eval _time = strptime(fdr_event_file_last_modified, "%Y-%m-%d %H:%M:%S%z") | timechart sum(fdr_event_file_size) by event_type span=1h

Note that fdr_event_file_size property value is the size of the event file as it's stored at S3 bucket, i.e. it is the size of compressed data. A compressed event file 25M in size can turn into 650M of uncompressed data and contain around 700 000 events on average.

Initial setup and scaling

Overview

The following steps are required to start ingesting CrowdStrike events using Splunk Add-on for CrowdStrike FDR.

Install the Splunk Add-on for Crowdstrike FDR in order to create the FDR AWS Collection. Specify connection information for the CrowdStrike FDR feed located in the AWS environment.
Configure a filter that will allow you to ingest only events you need. This is an optional step because the add-on already has one predefined filter that drops all heartbeat events. On the CrowdStrike FDR site, you can also configure a filter that controls which collected events should be sent to your FDR feed. This makes the amount of data stored as AWS S3 bucket smaller, which saves additional resources when the add-on downloads, unpacks, and scans event files being ingested.
Configure modular inputs to start ingesting CrowdStrike events. You can use direct or distributed ingestion architecture and therefore one or another set of ingesting modular input types:

1. Crowdstrike FDR SQS based S3 consumer is a modular input responsible for all the steps of the ingest process, from getting next SQS notification to ingesting all files in corresponding event batch, one by one. Use this for PoC environments and CrowdStrike environments generating up to 1-2 TB of events per day in Splunk license usage. 2. Crowdstrike FDR SQS based manager and Crowdstrike FDR managed S3 consumer. These inputs split the responsibilities of monitoring the SQS queue and ingesting events. Crowdstrike FDR SQS based manager :

takes care about SQS queue
gets new batches of event files when needed
keeps the journal of received and ingested files
updates checkpoints
tells available consumer inputs which event file to ingest.

The Crowdstrike FDR managed S3 consumer input ingests event files requested by the manager input.

The manager and worker modular inputs require the KVStore cluster to communicate properly if run on different hosts. KVStore cluster is available by default in Splunk Cloud Victoria search head cluster, in other configurations, you must install it manually.

If possible,start with creating a single input instance of each modular input type belonging to the selected architecture. This means you should either create single Crowdstrike FDR SQS based S3 consumer input for direct architecture, or create one input of Crowdstrike FDR SQS based manager and Crowdstrike FDR managed S3 consumer if distributed architecture is selected.
Windows OS is not supported as an ingester host for this add-on.
Once you have configured your selected inputs, check your ingestion and determine whether it needs additional resources to consume all events. You can do this using the Splunk Add-on for CrowdStrike FDR monitoring dashboard. In Splunk Web, go to Inputs > Configuration and Search tab and check the values in . To calculate resources, you can use the Modular input's average batch processing time (in seconds)
New CrowdStrike events batches arrive to dedicated AWS S3 bucket approximately every 7 minutes (420 seconds), factoring this into the average time the add-on spends to ingest a single event batch you can estimate how many ingesting input processes (Crowdstrike FDR SQS based S3 consumer or Crowdstrike FDR SQS based S3 consumer) are needed.
For the Crowdstrike FDR SQS based S3 consumer input, ingesting a batch is an effort of a single modular input process.o figure out the minimal required number of inputs just divide the average batch ingest time on 420 and round up the result.
For Crowdstrike FDR managed S3 consumer ingesting a batch is an effort of all inputs of this type, so the minimal number of required ingest input processes can be calculated as a product of average batch ingestion time and current number of ingest input processes divided by 420 and rounded up.
To create your calculated resources, you can:
Create modular inputs on separate Splunk hosts
create all of your inputs on a single super powerful Splunk instance
Spread them between hosts in any proportion depending on the resources they have.
If you plan to run several ingesting input processes on the same host, make sure the host has sufficient processing resources and a proper number of parallel ingestion pipelines configured. Take into account that a single ingesting input process can fully load one Splunk ingestion pipeline and that each Splunk ingestion pipeline can use up to 6 vCPUs. So, for example, if you plan to use two ingesting input processes on the same Splunk host, the host should have at least two parallel ingestion pipelines configured (parallelIngestionPipelines=2 in host server.conf) and at least 12 vCPUs dedicated to ingestion.

Note the following for Splunk Cloud Victoria:

To increase the number of ingest pipelines, contact the Splunk Cloud support team to request an exception.
The CrowdStrike FDR SQS based S3 consumer and Crowdstrike FDR managed S3 consumer modular inputs are configured by default so that Splunk runs each created input on each cluster search head. So, for example, if your search head cluster has three hosts and you configure a single CrowdStrike FDR managed S3 consumer input, Splunk runs three ingesting processes.

Index time host resolution

Overview

In version 1.5.0 new modular input "CrowdStrike Device API Inventory Sync Service" were introduced which allows users perform index time resolution in Splunk Cloud Platform (SCP) stacks. Now users can choose between two types of index time host resolution "Inventory events" and "CrowdStrike Device API".

Inventory events index time host resolution

Pluses of Inventory events index time host resolution:

Host information can be used for search time host resolution

Disadvantages of Inventory events index time host resolution:

Host information may arrive with delay
Enrichment of events at ingest time increases load to pipeline 10%-20% depending on resolution table size
Corruption of host resolution table (CSV lookup) breaks ingestion
Extra events need to be collected(aidmaster and managedassets) in order to make host resolution work

This variant of index time host resolution is not supported by Splunk Cloud Platform Stacks (SCP).

CrowdStrike Device API index time host resolution

Pluses of CrowdStrike Device API host resolution index time:

Ingesting performance improvement, because it doesn't use resources of pipeline
Comes with a new type of filter, where users can specify required host information fields.
Risk of corruption of any CSV lookup table used at index time as none is used
Coming with new modular input, that also allows specifying bucket check intervals.

Disadvantages of CrowdStrike Device API host resolution index time:

Data collected from Device API can't be used in search time host resolution

This variant of index time host resolution is supported by Splunk Cloud Platform Stacks (SCP) and by Splunk Enterprise.

Related answers from Splunk Community

Index time vs search time JSON field extractions

Overview

Default configuration

Switch from index time to search time JSON fields extractions

Reconfigure using Splunk user interface

Reconfigure using Splunk props.conf

Initial estimation of CrowdStrike FDR data volume

Initial setup and scaling

Overview

Index time host resolution

Overview

Inventory events index time host resolution

CrowdStrike Device API index time host resolution

Comments

Index time vs search time JSON field extractions

Was this topic useful?