Troubleshoot the Splunk Add-on for CrowdStrike FDR
To troubleshoot your forwarder setup, see "Troubleshoot the forwarder/receiver connection" in the Forwarding Data manual.
Monitor the troubleshooting dashboard
Starting in version 1.3.0, the add-on provides a monitoring dashboard that lets you quickly spot possible issues in the ingest process:
SQS message by event type
This is a time chart that shows SQS messages received by the add-on in one hour. Based on the batch folder that SQS message points to, the graph splits messages by color to different kinds: data, aidmaster, managedassets, notmanaged, appinfo, and userinfo.
S3 located event files versus event files received in SQS notifications
This time chart uses information collected by "CrowdStrike FDR S3 bucket monitor" input. It compares the list of batches found in your S3 bucket with the list of batches that the add-on receives in SQS messages. Batches that have not yet been received in SQS messages are marked as "Missed". Depending on the size of the event backlog and ingestion rate of the Splunk environment, the timechart can show the maximum number of missed batches in the immediate past, showing fewer messages for older batches. There should be no missed batches shown earlier than some period into the past from current time. If you find missed batches in the past, there may be another unknown process consuming SQS events from the same CrowdStrike feed.
Batches seen by the add-on in SQS messages are marked as "Notified". Missed and Notified batches are split into types j so that in the timechart legend one can see, for example, "Missed aidmaster" or/and "Notified data"
Bucket resources ingest by stage
This shows ingestion stages of event files received by the add-on in SQS messages to help make sure that events from batch files received in SQS messages appear in Splunk index by checking source property of ingested events (all CrowdStrike events have S3 bucket resource url they originate from as their source value). If events from a corresponding event file are not yet visible at the Splunk index, the add-on tries to show the ingestion stage of this file based on add-on logs:
- Skipped: Means that the whole SQS message containing this event file was ignored. This can happen because this type of event was not selected for ingestion or because an "Ignore SQS messages older than" parameter was configured and the SQS message is too old to be ingested.
- Scheduled: Means that the corresponding SQS message was not skipped. The event file was registered in the ingest journal but no managed consumer input has started to ingest it.
- InProgress: -Means that the event file was assigned to a managed consumer input and it has started ingesting it.
- Failed : Means that the managed consumer input has failed to ingest the event file. If the error was recoverable (for example, if it is a communication issue), the add-on will retry ingesting this event file later.
- Ingested: Means that events from batch files received in SQS messages appeared in Splunk index
Event ingestion average delay
Calculates the difference between the time the event was created by CrowdStrike and the time the event appeared in Splunk ingest pipeline, and shows the average time per hour. A large ingestion average delay indicates a significant backlog of events to be ingested. Combined with a noticeable number of "Failed" event file ingestion, this can point to Splunk environment configuration and/or communication issues. If the timechart shows that ingestion average delay is growing, then the environment most likely has insufficient resources. To confirm this, you can check the "Modular inputs average batch processing time (in seconds)" time chart and make sure that average time divided by the number of running consuming modinput is less than seven minutes, an approximate interval in which CrowdStrike uploads next event batch to the S3 bucket.
Ingested vs expected (missing and duplicated events)
This time chart compares the number of events reported by the add-on as sent to Splunk ingest pipeline with the number of events it can find in Splunk index on per event file basis. It will show if any events are duplicated and if any are missing. Ideally, all calculated values are zero and the time chart does not show negative or positive columns.
- Missing: This is a negative value that shows a number of events found in the Splunk index minus the number of events sent by the add-on to the Splunk pipeline. This lets you see whether all events sent to the pipeline were really ingested to let you point potential errors in the Splunk ingest pipeline.
Some small number of events will be shown as missing during a short period of time after ingestion, as some time is needed for the pipeline to ingest and index an event.
- Duplicated: Shows the number of events found in the Spunk index minus the number of events sent by the add-on to the Splunk pipeline and can show you if the same event file was ingested more than once. This duplication indicates digestion interruption and can point to an issue/crash or could be caused by a user changing the input configuration or manually restarting (disabling and then enabling) the input
Modular inputs ingest rates (MB/hour)
This time chart shows the size of raw data in megabytes sent by each consuming modular input per hour. This number is similar to Splunk license consumption, however it does not exactly match it because it does not take into account the size of index internal structures and event index time extracted fields.
Modular inputs ingest rates (files/hour)
This time chart shows how many event files have been processed by each consuming modular input per hour.
Modular inputs average batch processing time (in seconds)
This time chart shows how much time, in seconds, on average, that it takes to ingest a batch of event files. It counts time from the moment the add-on receives SQS message pointing to the event batch to the time the last event file from this batch is successfully ingested. This means that if there are errors during an event file's ingestion, this batch processing time will also include the waiting period and time required for another attempt to ingest this event file
Troubleshoot event ingestion
If "CrowdStrike FDR SQS based S3 consumer" is running but you do not see new events appear in your index, try the following to diagnose and mitigate:
- Try to make the search time window larger (Time interval on the right of the search expression area). Set it, for example, to seven days. Since the add-on assigns the events the time of event creation, not the time of ingestion, ingested events can be several days old and will not be seen within the default search time frame.
- Switch search time frame to last 15 or 60 minutes and run the following search:
index="_internal" sourcetype="crowdstrike_fdr_ta*". By default the add-on is configured to log only informational and error messages and this search should show you the latest logs and give you an idea about the Splunk Add-on for CrowdStrike FDRs activities. Here are message examples that you can find when you run this search:
cs_input_stanza=simple_consumer_input://my_input1i, error='aws_error_message='Proxy connection error: Indicates that provided proxy configuration does not allow the add-on to communicate with CrowdStrike AWS environment. Additional information about proxy settings can be found im log messages like
AWS proxy is disabled, aws_proxy=disabledand
AWS proxy is enabled, aws_proxy=https://*****:*****@proxy.host.fqnd:765
FILE processing summary: cs_input_stanza=simple_consumer_input://my_input1, cs_file_time_taken=223.106, cs_file_path=data/d811c19e-7729-4c9b-abb8-357d539aa4a0/part-00000.gz, cs_file_size_bytes=24178342, cs_file_error_count=0: indicates that one event file 'data/d811c19e-7729-4c9b-abb8-357d539aa4a0/part-00000.gz' of size 24178342 bytes has been ingested by input 'my_input1' during 223.106 seconds with 0 errors occurred during the process.
INGEST |< cs_input_stanza=simple_consumer_input://my_input1, cs_ingest_time_taken=229.321, cs_ingest_file_path=s3://crowdstrike-generated-big-batch-us-west-2/data/d811c19e-7729-4c9b-abb8-357d539aa4a0/part-00016.gz, cs_ingest_total_events=600540, cs_ingest_filter_matches=599705, cs_ingest_error_count=0: indicates that input 'my_input1' consumed S3 bucket file 's3://crowdstrike-generated-big-batch-us-west-2/data/d811c19e-7729-4c9b-abb8-357d539aa4a0/part-00016.gz' for 229.321 seconds and 599705 of total 600540 events in this file matched the filter criteria and have been sent to Splunk index. Pay attention to the number of matching events. If it's 0 for all logged messages, consider checking the selected filter - it can be incorrectly defined.
BATCH processing summary: cs_input_stanza=simple_consumer_input://si1, cs_batch_time_taken=230.002, cs_batch_path=data/d811c19e-7729-4c9b-abb8-357d539aa4a0, cs_batch_error_count=0: Indicates that one event file batch located at 'data/d811c19e-7729-4c9b-abb8-357d539aa4a0' has been ingested by input 'my_input1' during 230.002 seconds with 0 errors occurred during the process.
simple_consumer_input://si1, Skipping batch fdrv2/aidmaster/d811c19e-7729-4c9b-abb8-357d539aa4a0 according to input configuration: Indicates that whole batch 'data/d811c19e-7729-4c9b-abb8-357d539aa4a0' has been skipped because it was configured not to ingest this kind of events. Only inventory events can be skipped like this.
simple_consumer_input://my_input1, Stopping input as EVENT WRITER PIPE IS BROKEN. The add-on will re-try to ingest failed file after AWS SQS visibility_timeout expires: Indicates that during the file ingestion process the communication with indexers has been broken and input 'my_input1' has to shutdown to be started again by Splunk. In Enterprise Cloud Platform, this often happens when you apply new configuration to a running input or stop an input. The error is triggered when Splunk tries to restart or stop the input in response. If you cannot correlate this error message with corresponding input
enableAQUgure actions, check communication between ingesting host (heavy forwarder, IDM or search head) and indexers
- If none of the above messages appear, try to switch the add-on logging level to DEBUG. - IKGo to the Splunk Add-on for CloudStrike FDR Configuration screen and select the Logging tab. Then select DEBUG in the 'logging level dropdown box and click Save. Restart input to make it use the new logging level. Wait for several minutes to let the add-on log new information. Re-run search
index="_internal" sourcetype="crowdstrike_fdr_ta*". Look for the following messages to make sure that the add-on can successfully communicate with AWS infrastructure:
<<< aws_error_code=AWS.SimpleQueueService.NonExistentQueue, aws_error_message='The specified queue does not exist for this wsdl version.': indicates that AWS client error has taken place.
aws_error_messagecan vary depending on the exact AWS client issue.
<<< receive_sqs_messages_time_taken=0.940, receive_sqs_message_count=1: Indicates that a request for a new SQS message has been sent and one message was returned. If the value for receive_sqs_message_count is 0 then there are no messages in the SQS queue. Check there are no other consumers getting messages from this SQS queue. Also take into account that CrowdStrike FDR does not create new messages in SQS very often - one SQS message every 7-10 minutes. This means that you may have to wait for a new message to appear.
- Check for the following message:
<<< check_success_time_taken=0.934, found_SUCCESS=True. If the
found_SUCCESSis False, then the event batch referenced by received SQS message will be skipped and no ingestion takes place. To figure out which batch has failed the check look for preceding log message like
>>> check_success_bucket=crowdstrike-generated-big-batch-us-west-2, check_success_bucket_prefix=data/d811c19e-7729-4c9b-abb8-357d539aa4a0
- If you see the message:
<<< download_file_time_taken=7.107, download_file_path=data/d811c19e-7729-4c9b-abb8-357d539aa4a0/part-00023.gz, then the add-on is able to successfully download message files from the S3 bucket.
- If you do not see any add-on logs, run the following searches:
index="_internal" ERROR. Look for error messages in the returned logs.
Cannot find the destination field 'ComputerName' in the lookup table: this error can indicate a corruption of a CSV lookup used at index time. This error can prevent ingestion of CrowdStrike events. If you see this error and new CrowdStrike events do not appear in the index, refer to "Recover index time host resolution lookup" below. For Splunk cloud environments contact Splunk support
Data is not going through all opened pipelines
This issue can take place with multiple pipelines mostly under heavy workload. Each Add-on ingesting modular input instance can fully load a dedicated Splunk ingest pipeline. To increase the throughput, you can increase the number of pipelines if the host hardware allows it. The number of ingesting modular inputs can be increased as well so each input would have a dedicated ingest pipeline. However it can happen that inputs are not evenly distributed among pipelines and two or more add-on's modinputs become assigned to the same ingest pipeline. This results in decreased ingestion rate while not fully using available resources. As a possible solution consider using weighted_random value for pipelineSetSelectionPolicy setting in server.conf file, i.e:
Recover index time host resolution lookup
CrowdStrike event ingestion can be blocked by corruption of an index time host resolution lookup CSV. As a result of corruption, index time lookup fails with error message
Cannot find the destination field 'ComputerName' in the lookup table logged to _internal prefix. This corruption can be the result of running "Crowdstrike FDR host information sync" input when configured with incorrect source search head or limited user access. In version 1.2.0 of the Splunk Add-on for CloudStrike, additional validation has been added to the "Crowdstrike FDR host information sync" modular input code. This helps to prevent damaging the CSV lookup with bad data received during the sync process. If for some reason CSV lookup table has been corrupted follow the steps below to fix it:
If it's still running, disable the "CrowdStrike FDR host information sync" input.
On each heavy forwarder (IDM or Search head in case of Splunk Cloud environment) locate file
Splunk_TA_CrowdStrike_FDR/lookups/crowdstrike_ta_index_time_host_resolution_lookup.csv' under splunk etc/apps
Download Splunk_TA_CrowdStrike_FDR from splunkbase and unpack it. Locate
lookups/crowdstrike_ta_index_time_host_resolution_lookup.csv and replace the same file at Splunk instance.
Splunk will reload the updated CSV file automatically.
In a heavily-loaded environment, batches can be processed more than once. This can happen when a message is not processed at the expected time or when an input job is interrupted.
Processed message is visible again in SQS queue
Visibility timeout addresses the same SQS message again in case the software that started to process this message is not able to finish the processing or shutdown gracefully. When the processing time for a single batch takes more than the visibility timeout defined for related SQS messages, it becomes visible in the queue again and other jobs can re-ingest the same data. This results in event duplication in the indexer. To mitigate this:
- When you configure the Splunk Add-on for CloudStrike, set the visibility timeout to six hours by default. This value is sufficient to ingest big event batches (300-400 files up to 20MB each) which is specific to heavy-loaded environments with around 10TB of raw event data per day. If your environment has different amounts of raw event data per day, figure out the biggest batch and change visibility timeout proportionally. Maximum allowed value is 12 hours, minimum value is five minutes). Decreasing visibility timeout will make the SQS message return to the SQS queue faster and have more opportunity to be processed until it expires and is removed from queue permanently
- Scale out data collection horizontally by adding additional heavy forwarders (HF)/IDM and use less inputs for each HF/IDM.
Visibility timeout troubleshooting
The Splunk Add-on for CrowdStrike FDR logs information to help you determine if selected visibility timeout is adequate for your environment. To mitigate:
- Run the search:
index="_internal" sourcetype="crowdstrike_fdr_ta*" "BATCH processing summary:"
to see time taken to ingest event batches. You will see messages like this:
2022-11-16 08:37:59,814 INFO pid=2228 tid=Thread-2 file=sqs_manager.py:finalize_batch_process:129 | BATCH processing summary: cs_input_stanza=sqs_based_manager://cs_feed1_sqs_man, cs_batch_time_taken=30.982, cs_batch_bucket=cs-prod-cannon-076270a656259f84-c33a6429, cs_batch_path=data/942fae26-bc6d-42cb-ae14-9e1eb84f761e, cs_batch_error_count=0. This tells you how much time, in seconds, it takes to process one event batch. Run this search with sufficient time frame selected, you can select "All time", or find the largest ingest time taken and update the visibility timeout setting an equal or larger size. This will improve the likelihood that all future event batches will be processed within visibility timeout.
- If the add-on finishes processing an event file after visibility timeout has expired, it logs a warning message like this:
ALERT: data/d811c19e-7729-4c9b-abb8-357d539aa4a0/part-00018.gz ingested 233.720 seconds after SQS message visibility timeout expiration. Run the following search to see these messages:
index="_internal" sourcetype="crowdstrike_fdr_ta*" "seconds after SQS message visibility timeout expiration"
Use this maximum value to adjust input visibility time. Consider creating a Splunk alert based on this log message to receive alerts about visibility timeout.
Verify that there is no other process consuming SQS notifications from the CrowdStrike queue
Make sure that no other process is reading and deleting SQS notifications from the SQS queue used for ingesting CrowdStrike events to Splunk index. If another process is running, SQS notifications received by it will never reach ingesting modular inputs and so a significant portion of events collected during 7-10 minutes will never be ingested and indexed. A process can run on any host in the world capable of connecting to the CrowdStrike feed dedicated AWS SQS queue, so it can be hard to catch. Some noticeable symptoms of such a process starting to work include:
Frequency of received SQS messages decreases,
Amount of ingested CrowdStrike data decreases which in turn decreases Splunk license usage.
To help identify any process running, a new monitoring modular input has been added to version 1.3.0, called
Crowdstrike FDR S3 bucket monitor. This modular input is optional and can be used only when monitoring is required. This modular input reads all available CrowdStrike resources at the event feed dedicated S3 bucket and logs the findings. These logs can be found by running the following search:
index=_internal sourcetype=crowdstrike_fdr_ta* "FDR S3 bucket monitor, new event file detected:"
Information about found S3 bucket resources can be compared with other add-on logs carrying information about SQS messages received by the add-on:
index=_internal sourcetype=crowdstrike_fdr_ta* "is processing SQS messages: sqs_msg_bucket="
Information about found S3 bucket resources compared with other add-on logs carrying information about SQS messages received by the add-on shows that some event batches (folders) presenting at S3 bucket are missing in SQS messages received by the add-on, this can indicate another process stealing notifying SQS messages. It also can be evidence of a growing backlog so this assumption should be verified by additional checks.
It's not necessary to run the above searches manually as they are encapsulated into the add-ons monitoring dashboard as "S3 located event files vs event files received in SQS notifications" time chart. For more details, see the documentation section dedicated to the monitoring/troubleshooting dashboard.
To summarize here are the steps required to spot existence of an external process "stealing" CrowdStrike SQS messages from SQS queue:
- Make sure "Crowdstrike FDR S3 bucket monitor" modular input is configured and running
- Give it time to collect and log information
- Switch to monitoring dashboard and analyze "S3 located event files vs event files received in SQS notifications" time chart
Interrupted input job
When indexer connectivity is lost, messages left after the configured visibility timeout will still be available to process. Data ingested so far from the interrupted job is present on the indexer and will be ingested again with the next processing attempt. Try to avoid unnecessary input reconfiguration and establish stable connections between your instances.
Troubleshooting host resolution
search results over crowdstrike:events:sensor sourcetype events you do not see the
aid_computer_name field, then host resolution did not work out. Below are the steps to troubleshoot the host resolution process:
Search time host resolution troubleshooting
- Search for all events with
sourcetype=crowdstrike:inventory:aidmaster. AIDmaster events are used as a source for host resolution. If no events are found then there is no host information ingested and there is nothing to use for host resolution. In that case, check "CrowdStrike FDR SQS based S3 consumer" modinput configuration to make sure AIDmaster events have been chosen for ingestion. If you see AIDmated events, narrow the search and try to find an unresolved record.
- Check the
aid_masterinformation for aid values in events. You should be able to find AIDmaster events with the same aid. If information is missing, then Splunk lacks the information ingested to resolve all host names.
- Use an SPL search to check the lookup
crowdstrike_ta_host_resolution_lookup. Look for the AID inside this lookup.
- Find savedsearch
crowdstrike_ta_build_host_resolution_tableand execute it manually in Splunk Web, then check the lookup again.
Index time host resolution troubleshooting
If you configured and started "CrowdStrike FDR host information sync" input for index time host resolution, run the search
index="_internal" sourcetype="crowdstrike_fdr_ta-inventory_sync_service" for useful messages. Below is a list of log samples pointing to successful and failed host information sync operations:
Inventory collection successfully synced. File size: 677 bytes. Records count: 2. Time taken: 0.5001018047332764 seconds.
Failed to retrieve collection
Inventory collection is not synced as source collection is empty
Inventory collection is not synced as source collection has unexpected formatting
Unexpected error when retrieving kvstore collection
Failed to authenticate to splunk instance
Inventory collection final rewrite failed with error
Always a good idea to check
Even when ingestion of CrowdStrike seems is running smoothly there are several log messages that good to check from time to time:
LineBreakingProcessor - Truncating line because limit of 150000 bytes has been exceeded- indicates that actual CrowdStrike event was longer than the maximum value configured in TRUNCATE setting so the event was truncated. For sensor events TRUNCATE value set to 150000 which was confirmed by CrowdStrike as sufficient to handle all possible sensor events but this can change in future. It's better to search for this message without relation to limit value, for example:
index=* "LineBreakingProcessor - Truncating line because limit of ", to be aware of any case of exceeding the limit
Failed to parse timestamp in first MAX_TIMESTAMP_LOOKAHEAD- another log message that is good to be watched to make sure event timestamps are extracted correctly and there are no unreadable values or datetime format.
Release notes for the Splunk Add-on for CrowdStrike FDR
This documentation applies to the following versions of Splunk® Supported Add-ons: released