Deep dive: Using ML to identify network traffic anomalies
The goal of this deep dive is to identify periods of time when there is unusual data transfer traffic on your network. Spotting outliers in data transfer traffic data can help identify a multitude of issues ranging from the benign, to performance impacting misconfigurations, to data exfiltration from a malicious actor.
You can use the following data sources in this deep dive:
This deep dive uses
For best results, use the DensityFunction algorithm.
As an alternative approach, try stats or the DBSCAN algorithm.
Train the model
Before you begin training the model, do the following things:
- Change the index and source type for your environment if needed.
- You must pick a search window that has enough data to be representative of your environment. Search over 30 days at a minimum for this analytic. The more data the better.
Enter the following search into the search bar of the app you where want the analytic in production:
index=botsv2 sourcetype=pan:traffic | eval src_dest_pair=src."|".dest | bin _time span=5m | stats sum(bytes_in) as bytes_in sum(bytes_out) as bytes_out by _time src_dest_pair | eval HourOfDay=strftime(_time,"%H") | fit DensityFunction bytes_in by HourOfDay as outlier_bytes_in into bytes_in_outlier_detection_model | fit DensityFunction bytes_out by HourOfDay as outlier_bytes_out into bytes_out_outlier_detection_model
This search counts the bytes in and out per source and destination combination over 5 minute time intervals, enriches the data with the hour of day, and then trains two anomaly detection models: one to detect unusual bytes in by the hour of day, and another to detect unusual bytes out by the hour of day.
After you have run this search and are confident that it is generating results, save it as a report. Schedule the report to periodically retrain the models. As a best practice, retrain the models every week. Schedule model training for a time when your Splunk platform instance has low utilization.
Model training with MLTK can use a high volume of resources.
After training the model you can select Settings in the top menu bar, then select Lookups, then select Lookup table files. and search for your trained model.
Make sure that the permissions for the model are correct. By default, models are private to the user who has trained them, but since you have used the
app: prefix in your search, the model is visible to all users who have access to the app the model was trained in.
Apply the model
Now that you have set up the model training cycle and have an accessible model, you can start applying the model to data as it is coming into the Splunk platform. Use the following search to apply the model to data:
index=botsv2 sourcetype=pan:traffic | eval src_dest_pair=src."|".dest | bin _time span=5m | stats sum(bytes_in) as bytes_in sum(bytes_out) as bytes_out by _time src_dest_pair | eval HourOfDay=strftime(_time,"%H") | apply bytes_in_outlier_detection_model | apply into bytes_out_outlier_detection_model
This search can be used to populate a dashboard panel or can be used to generate an alert.
When looking to flag outliers as alerts, you can append the following to the search:
| eval anomaly_score=outlier_bytes_in+outlier_bytes_out | where anomaly_score>0
This addition filters your results to only show those that have been identified as outliers. An
anomaly_score of 2 means that both the bytes in and bytes out count appear unusual, whereas an
anomaly_score of 1 means that either the bytes in or bytes out count is out of the expected range.
This search can then be saved as an alert that should trigger when the number of results is greater than 0, which can be run on a scheduled basis such as hourly.
Tune the model
When training and applying your model, you might find that the number of outliers being identified is not proportionate to the data: that the model is either flagging too many or too few outliers. The DensityFunction algorithm has a number of parameters that can be tuned to your data, creating a more manageable set of alerts.
The DensityFunction algorithm has a threshold option that is set at 0.01 by default, which means it will identify the least likely 1% of the data as an outlier. This threshold can be configured as the apply stage, so it can be increased or decreased depending on the tolerance for outliers, as shown in the following search:
index=botsv2 sourcetype=pan:traffic | eval src_dest_pair=src."|".dest | bin _time span=5m | stats sum(bytes_in) as bytes_in sum(bytes_out) as bytes_out by _time src_dest_pair | eval HourOfDay=strftime(_time,"%H") | apply bytes_in_outlier_detection_model threshold=0.005 | apply into bytes_out_outlier_detection_model threshold=0.005 | eval anomaly_score=outlier_bytes_in+outlier_bytes_out | where anomaly_score>0
Additional fields can also be extracted and used during the fit and apply stages. For example, if your data has hourly and daily variance, such as significantly more errors during working hours on a weekday, you can include the hour of the day and the day of the week in the
by clause to more finely tune your model to your data, as shown in the following search:
index=botsv2 sourcetype=pan:traffic | eval src_dest_pair=src."|".dest | bin _time span=5m | stats sum(bytes_in) as bytes_in sum(bytes_out) as bytes_out by _time src_dest_pair | eval HourOfDay=strftime(_time,"%H"), DayOfWeek=strftime(_time,"%a") | fit DensityFunction bytes_in by "HourOfDay,DayOfWeek" as outlier_bytes_in into bytes_in_outlier_detection_model | fit DensityFunction bytes_out by "HourOfDay,DayOfWeek" as outlier_bytes_out into bytes_out_outlier_detection_model
Make sure that all additional fields that are used for training your model are also included in your model apply search.
Common questions when running this deep dive
Log sources such as pan:traffic contain lots of other useful information for outlier detection. Can I use this type of approach on other fields?
Although this deep dive examples is focused on bytes in and out, there is no reason why you couldn't take a similar approach when analyzing the number of connections, packets transferred, duration and so on.
Can I use the src_dest_pair field in the DensityFunction model training by clause?
src_dest_name field can provide an added layer of granularity to your searches, creating baselines at an IP pair level, they can quickly increase the processing time and requirements for DensityFunction. Typically if there are more than 1,000 different combinations of elements in the
by clause, DensityFunction is not the best option. You can take an alternative approach using stats and lookups. There is a great example of this approach from IG Group at .conf21 on how to handle high cardinality data. See, Anomaly Detection, Sealed with a KISS.
Calculating a 5 minute aggregate is taking a long time to compute. Is there anything I can do?
Although you are aggregating over 5 minute time spans for the search here, you might find that your search performs better if you look at larger time frame aggregates, such as hourly.
I'm finding too many outliers in my data, what can I do?
See the Tune the model section. In particular, look at how the threshold option can be used to tune the detection sensitivity.
I don't understand how DensityFunction is identifying outliers. How can I find out more about what the algorithm is doing with my data?
You can use the
summary command to see information about the models generated using DensityFunction. You can see the distribution type the model has mapped your data to, some statistics about the data distribution, and a cardinality field that tells you how many records have been used to train the model.
A couple of key metrics to investigate are the cardinality and the Wasserstein distance metric. For cardinality, the higher this number is the better. For the Wasserstein distance metric, which tells you how closely the probability distribution matches your actual data, the lower the number the better.
Are there any gotchas I need to know about?
There are occasions when DensityFunction will incorrectly identify outliers when data is mapped to the beta distribution with certain parameters (e.g. alpha=beta=0.5). In this scenario you need to ignore results from DensityFunction and could choose to select the distribution type when fitting the model rather than letting it run in auto as it does by default, for example by setting
dist_type=normal in the fit search.
See the following customer use cases from the Splunk .conf archives:
See the following Splunk blog posts on outlier detection:
- Cyclical Statistical Forecasts and Anomalies - Part 1
- Cyclical Statistical Forecasts and Anomalies - Part 4
- Cyclical Statistical Forecasts and Anomalies - Part 5
- Anomalies Are Like a Gallon of Neapolitan Ice Cream - Part 1
- Anomalies Are Like a Gallon of Neapolitan Ice Cream - Part 2
- Building Machine Learning Models with DensityFunction
To learn about implementing analytics and data science projects using Splunk platform statistics, machine learning, and built-in and custom visualization capabilities, see Splunk for Analytics and Data Science.
Deep dive: Using ML to detect outliers in server response time
Deep dive: Create a data ingest anomaly detection dashboard using ML-SPL commands
This documentation applies to the following versions of Splunk® Machine Learning Toolkit: 4.5.0, 5.0.0, 5.1.0, 5.2.0, 5.2.1, 5.2.2, 5.3.0, 5.3.1