Splunk® Machine Learning Toolkit

User Guide

Dataset credits

The Splunk Machine Learning Toolkit contains datasets that were provided by others. We want to thank and acknowledge the contributors for them, and provide the licenses for their use.

Disclaimer

This application may contain certain sample files and datasets, which are provided for your convenience only. Such files and datasets contain information and data compiled by third parties, and Splunk makes no representation or warranty that the data contained in such files and datasets are true, accurate, complete or sanitized. In using the datasets, you understand and agree that the data contained therein are subject to error and cannot be relied upon to perform the task you intend. You understand and agree that your use of the data is at your sole risk. The datasets are made available on an "as is" and "as available" basis without any warranties of any kind, whether express or implied, including without limitation implied warranties of merchantability, fitness for a particular purpose, and non-infringement. In no event will Splunk assume any legal liability or responsibility for loss or damages arising from the sample datasets.

App logons count

Dataset: applogonscount.csv

Used in example:

  • Forecast App Logons with Special Days

License terms: Free to use, collected by Splunk.

App statistics

Dataset: app_usage.csv

Used in examples:

  • Predict VPN Usage
  • Cluster Behavior by App Usage

License terms: Free to use, collected by Splunk.

Authorization log

Dataset: authorization.csv

Used in example:

  • Cluster failed ssh login attempts.

License terms: Free to use with citation. Security Repo (http://www.secrepo.com/) by Mike Sconzo is licensed under a Creative Commons Attribution 4.0 International License

Bitcoin transactions

Dataset: bitcoin_transactions.csv

Used in example:

  • Detect Outliers in Bitcoin Transactions

License terms: Free to use with citation: http://compbio.cs.uic.edu/data/bitcoin/

Bluetooth devices

Dataset: bluetooth.csv

Used in example:

  • Forecast the Number of Bluetooth Devices

License terms: CRAWDAD Data License

Dear Licensee:

Thank you for your interest in obtaining and using data from the CRAWDAD archive, hereinafter referred to as "Data". CRAWDAD is the Community Resource for Archiving Wireless Data At Dartmouth, and is operated by Dartmouth College under a grant from the National Science Foundation. Data Licensing Information:

Dartmouth College hereby grants a nonexclusive, nontransferable license to use the Data for commercial, educational, and research purposes only. The Data shall not be redistributed without the express written prior approval of Dartmouth College.

Licensee agrees to respect the privacy of those human subjects whose wireless-network activity is captured by the Data. Do not attempt to reverse the anonymization process to identify specific MAC addresses, IP address, telephone number, or other identifiers, or to identify their actual location. Use only the header information in packet traces; do not attempt to extract further information. (Header information specifies the type of information that is being transferred over the network, and specifically excludes the contents of the data, such as usernames, passwords, filenames, files, or URLs.)

Licensee agrees to acknowledge the source of the Data in any publications reporting on Licensee's use of it. For example, "We gratefully acknowledge the use of wireless data from the CRAWDAD archive at Dartmouth College."

Dartmouth expressly reserves the right to use the Data by its faculty, staff and researchers, for educational and research purposes. Dartmouth further reserves the right to provide Data Providers with statistical information regarding licensee's access to and use of the Provider's Data and with the Licensee's name and address.

Dartmouth College provides the Data "AS IS," without any warranty or promise of technical support, and disclaims any liability of any kind for any damages whatsoever resulting from use of Data.

DARTMOUTH MAKES NO WARRANTIES, EXPRESS OR IMPLIED WITH RESPECT TO THE DATA, INCLUDING ANY IMPLIED WARRANTY OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, WHICH ARE HEREBY EXPRESSLY DISCLAIMED.

Your acceptance and use of the Data binds you to the terms and conditions of this License as stated herein.

Trustees of Dartmouth College David F. Kotz, Ph.D. Professor of Computer Science 6211 Sudikoff Lab Hanover, NH 03755 USA

E-mail: kotz@cs.dartmouth.edu http://crawdad.org/nus/bluetooth/20070903/sql/

Call center data

Dataset: call_center.csv

Used in example:

  • Detect Cyclical Outliers in Call Center Data

License terms: Free to use, collected by Splunk.

Churn

Dataset: churn.csv

Used in example:

  • Predict Telecom Customer Churn

License terms: Free to use, with citation request: http://www.sgi.com/tech/mlc/db/churn.all

Lichman, M. (2013). UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science.

Cyclical business process

Dataset: cyclical_business_process.csv

Used in examples:

  • Detect Cyclical Outliers in Logins
  • Predict Future Logins
  • Cluster Business Anomalies to Reduce Noise

License terms: Free to use, collected by Splunk.

Cyclical business process with anomalies

Dataset: cyclical_business_process_with_external_anomalies.csv

Used in example:

  • Predict External Anomalies

License terms: Free to use, collected by Splunk.

Diabetes

Dataset: diabetes.csv

Used in example:

  • Predict Incidence of Diabetes from Health Metrics

License terms: Free to use with citation: http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.names.

Database originally owned by National Institute of Diabetes and Digestive and Kidney Diseases; Database Donor: Sigillito Vincent, Research Center RMI Group Leader, The Johns Hopkins University. 9 May 1990.

Lichman, M. (2013). UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science.

Diabetic data

Dataset: diabetic.csv

Used in example:

  • Detect Outliers in Diabetes Patient Records

License terms: Free to use with citation:

Beata Strack, Jonathan P. DeShazo, Chris Gennings, Juan L. Olmo, Sebastian Ventura, Krzysztof J. Cios, and John N. Clore, "Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records," BioMed Research International, vol. 2014, Article ID 781670, 11 pages, 2014.

Lichman, M. (2013). UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science

Disk failures

Dataset: disk_failures.csv

Used in examples:

  • Detect Outliers in Disk Failure Events
  • Predict the Failure of Hard Drives using SMART Metrics
  • Cluster Hard Drives by SMART Metrics
  • Find Anomalies in Hard Drives

License terms: Free to use with the following constraints:

A. We are encouraged to cite Backblaze as the source (not a mandatory requirement).

B. We accept that we are solely responsible for how we use the data.

C. We do not sell this data to anyone, it is free.

https://www.backblaze.com/hard-drive-test-data.html

Employee logins

Dataset: logins.csv

Used in examples:

  • Forecast the Number of Employee Logins
  • Detect Outliers in Number of Logins (vs. Predicted Value)

License terms: Free to use, collected by Splunk.

Exchange rate TWI

Dataset: exchange.csv

Used in example:

  • Forecast Exchange Rate TWI using ARIMA

License terms: DataMarket, Default Open License

Source: https://datamarket.com/data/set/22tb/exchange-rate-twi-may-1970-aug-1995

This data release is licensed as follows:

  • You may copy and redistribute the data.
  • You may make derivative works from the data.
  • You may use the data for commercial purposes.
  • You may not sublicense the data when redistributing it.
  • You may not redistribute the data under a different license.
  • Source attribution on any use of this data: Must refer source.

Firewall traffic

Dataset: firewall_traffic.csv

Used in examples:

  • Predict the Presence of Malware from Firewall Traffic
  • Predict Presence of Known Vulnerability in Data

License terms: Free to use, collected by Splunk.

Housing

Dataset: housing.csv

Used in examples:

  • Predict Median House Value
  • Cluster Neighborhoods by Properties
  • Cluster Houses by Property Descriptions

License terms: Free to use with citation: https://archive.ics.uci.edu/ml/datasets/Housing

Lichman, M. (2013). UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science.

Internet traffic

Dataset: internet_traffic.csv

Used in example:

  • Forecast Internet Traffic

License terms: Free to use with citation: P. Cortez, M. Rio, M. Rocha and P. Sousa. Multiscale Internet Traffic Forecasting using Neural Networks and Time Series Methods. In Expert Systems, Wiley-Blackwell, In press.

This data release is licensed as follows:

  • You may copy and redistribute the data.
  • You may make derivative works from the data.
  • You may use the data for commercial purposes.
  • You may not sublicense the data when redistributing it.
  • You may not redistribute the data under a different license.
  • Source attribution on any use of this data: Must refer source.

Milk 2 dataset

Dataset: milk2.csv

Used in examples:

  • Documentation for StateSpaceForecast algorithm
  • Integration tests
  • conftest.py

Mortgage loans

Dataset: mortgage_loan_ny.csv

Used in example:

  • Detect Outliers in Mortgage Contract Data
  • Cluster Mortgage Loans

License terms: http://www.fhfa.gov/AboutUs/Policies/Pages/API.aspx

This product uses FHFA Data but is neither endorsed nor certified by FHFA.

PDF demo data

Dataset: pdfdemo.csv

Used in example:

  • Detect Outliers on the fitted density functions onto measurements by two different cities in PDF Demo Data

License terms: Free to use, synthesized by Splunk.

Phone usage

Dataset: phone_usage.csv

Used in example:

  • Detect Outliers in Mobile Phone Activity

License terms: CRAWDAD Data License Dear Licensee:

Thank you for your interest in obtaining and using data from the CRAWDAD archive, hereinafter referred to as "Data". CRAWDAD is the Community Resource for Archiving Wireless Data At Dartmouth, and is operated by Dartmouth College under a grant from the National Science Foundation. Data Licensing Information:

Dartmouth College hereby grants a nonexclusive, nontransferable license to use the Data for commercial, educational, and research purposes only. The Data shall not be redistributed without the express written prior approval of Dartmouth College.

Licensee agrees to respect the privacy of those human subjects whose wireless-network activity is captured by the Data. Do not attempt to reverse the anonymization process to identify specific MAC addresses, IP address, telephone number, or other identifiers, or to identify their actual location. Use only the header information in packet traces; do not attempt to extract further information. (Header information specifies the type of information that is being transferred over the network, and specifically excludes the contents of the data, such as usernames, passwords, filenames, files, or URLs.)

Licensee agrees to acknowledge the source of the Data in any publications reporting on Licensee's use of it. For example, "We gratefully acknowledge the use of wireless data from the CRAWDAD archive at Dartmouth College."

Dartmouth expressly reserves the right to use the Data by its faculty, staff and researchers, for educational and research purposes. Dartmouth further reserves the right to provide Data Providers with statistical information regarding licensee's access to and use of the Provider's Data and with the Licensee's name and address.

Dartmouth College provides the Data "AS IS," without any warranty or promise of technical support, and disclaims any liability of any kind for any damages whatsoever resulting from use of Data.

DARTMOUTH MAKES NO WARRANTIES, EXPRESS OR IMPLIED WITH RESPECT TO THE DATA, INCLUDING ANY IMPLIED WARRANTY OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, WHICH ARE HEREBY EXPRESSLY DISCLAIMED.

Your acceptance and use of the Data binds you to the terms and conditions of this License as stated herein.

Trustees of Dartmouth College David F. Kotz, Ph.D. Professor of Computer Science 6211 Sudikoff Lab Hanover, NH 03755 USA

E-mail: kotz@cs.dartmouth.edu

http://crawdad.org/ctu/personal/20120315/

Power plant humidity

Dataset: power_plant.csv

Used in examples:

  • Predict Power Plant Energy Output
  • Detect Outliers in Power Plant Humidity
  • Cluster Power Plant Operating Regimes

License terms: Free to use, with citation request: UCI Machine Learning Repository http://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant

Pınar Tüfekci, Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods, International Journal of Electrical Power & Energy Systems, Volume 60, September 2014, Pages 126-140, ISSN 0142-0615, [Web Link]. ([Web Link])

Heysem Kaya, Pınar Tüfekci , Sadık Fikret Gürgen: Local and Global Learning Methods for Predicting Power of a Combined Gas & Steam Turbine, Proceedings of the International Conference on Emerging Trends in Computer and Electronics Engineering ICETCEE 2012, pp. 13-18 (Mar. 2012, Dubai)

Server power

Dataset: server_power.csv

Used in example:

  • Predict Server Power Consumption
  • Predict Disk Utilization

License terms: Free to use, with citation request: https://www.usenix.org/legacy/event/hotpower08/tech/full_papers/rivoire/rivoire.pdf

Server response time

Dataset: hostperf.csv

Used in example:

  • Detect Outliers in Server Response Time

License terms: Free to use, collected by Splunk.

Souvenir sales

Dataset: souvenir_sales.csv

Used in example:

  • Forecast Monthly Sales for a Souvenir Shop

License terms: Default open license. This data release is licensed as follows:

  • You may copy and redistribute the data.
  • You may make derivative works from the data.
  • You may use the data for commercial purposes.
  • You may not sublicense the data when redistributing it.
  • You may not redistribute the data under a different license.
  • Source attribution on any use of this data: Must refer to source:

https://datamarket.com/data/set/22mh/monthly-sales-for-a-souvenir-shop-on-the-wharf-at-a-beach-resort-town-in-queensland-australia-jan-1987-dec-1993=!ds=22mh&display=line

Special days

Dataset: specialdays.csv

Used in example:

  • Forecast App Logons with Special Days

License terms: Free to use, collected by Splunk.

Supermarket purchases

Dataset: supermarket.csv

Used in example:

  • Detect Outliers in Supermarket Purchases
  • Find Anomalies in Supermarket Purchases

License terms: Free to use with citation: Pennacchioli, D., Coscia, M., Rinzivillo, S., Pedreschi, D. and Giannotti, F., 'Explaining the Product Range Effect in Purchase Data'. In BigData, 2013.

Track day

Dataset: track_day.csv

Used in examples:

  • Predict Vehicle Type from Onboard Metrics
  • Cluster Vehicles by Onboard Metrics

License terms: Free to use, collected by Splunk.

Last modified on 14 November, 2023
Third-party software  

This documentation applies to the following versions of Splunk® Machine Learning Toolkit: 5.2.0, 5.2.1, 5.2.2, 5.3.0, 5.3.1, 5.3.3, 5.4.0, 5.4.1, 5.4.2, 5.5.0


Was this topic useful?







You must be logged into splunk.com in order to post comments. Log in now.

Please try to keep this discussion focused on the content covered in this documentation topic. If you have a more general question about Splunk functionality or are experiencing a difficulty with Splunk, consider posting a question to Splunkbase Answers.

0 out of 1000 Characters