Dataset credits
The Splunk Machine Learning Toolkit contains datasets that were provided by others. We want to thank and acknowledge the contributors for them, and provide the licenses for their use.
Disclaimer
This application may contain certain sample files and datasets, which are provided for your convenience only. Such files and datasets contain information and data compiled by third parties, and Splunk makes no representation or warranty that the data contained in such files and datasets are true, accurate, complete or sanitized. In using the datasets, you understand and agree that the data contained therein are subject to error and cannot be relied upon to perform the task you intend. You understand and agree that your use of the data is at your sole risk. The datasets are made available on an "as is" and "as available" basis without any warranties of any kind, whether express or implied, including without limitation implied warranties of merchantability, fitness for a particular purpose, and non-infringement. In no event will Splunk assume any legal liability or responsibility for loss or damages arising from the sample datasets.
App logons count
Dataset: applogonscount.csv
Used in example:
- Forecast App Logons with Special Days
License terms: Free to use, collected by Splunk.
App statistics
Dataset: app_usage.csv
Used in examples:
- Predict VPN Usage
- Cluster Behavior by App Usage
License terms: Free to use, collected by Splunk.
Authorization log
Dataset: authorization.csv
Used in example:
- Cluster failed ssh login attempts.
License terms: Free to use with citation. Security Repo (http://www.secrepo.com/) by Mike Sconzo is licensed under a Creative Commons Attribution 4.0 International License
Bitcoin transactions
Dataset: bitcoin_transactions.csv
Used in example:
- Detect Outliers in Bitcoin Transactions
License terms: Free to use with citation: http://compbio.cs.uic.edu/data/bitcoin/
Bluetooth devices
Dataset: bluetooth.csv
Used in example:
- Forecast the Number of Bluetooth Devices
License terms: CRAWDAD Data License
Dear Licensee:
Thank you for your interest in obtaining and using data from the CRAWDAD archive, hereinafter referred to as "Data". CRAWDAD is the Community Resource for Archiving Wireless Data At Dartmouth, and is operated by Dartmouth College under a grant from the National Science Foundation. Data Licensing Information:
Dartmouth College hereby grants a nonexclusive, nontransferable license to use the Data for commercial, educational, and research purposes only. The Data shall not be redistributed without the express written prior approval of Dartmouth College.
Licensee agrees to respect the privacy of those human subjects whose wireless-network activity is captured by the Data. Do not attempt to reverse the anonymization process to identify specific MAC addresses, IP address, telephone number, or other identifiers, or to identify their actual location. Use only the header information in packet traces; do not attempt to extract further information. (Header information specifies the type of information that is being transferred over the network, and specifically excludes the contents of the data, such as usernames, passwords, filenames, files, or URLs.)
Licensee agrees to acknowledge the source of the Data in any publications reporting on Licensee's use of it. For example, "We gratefully acknowledge the use of wireless data from the CRAWDAD archive at Dartmouth College."
Dartmouth expressly reserves the right to use the Data by its faculty, staff and researchers, for educational and research purposes. Dartmouth further reserves the right to provide Data Providers with statistical information regarding licensee's access to and use of the Provider's Data and with the Licensee's name and address.
Dartmouth College provides the Data "AS IS," without any warranty or promise of technical support, and disclaims any liability of any kind for any damages whatsoever resulting from use of Data.
DARTMOUTH MAKES NO WARRANTIES, EXPRESS OR IMPLIED WITH RESPECT TO THE DATA, INCLUDING ANY IMPLIED WARRANTY OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, WHICH ARE HEREBY EXPRESSLY DISCLAIMED.
Your acceptance and use of the Data binds you to the terms and conditions of this License as stated herein.
Trustees of Dartmouth College David F. Kotz, Ph.D. Professor of Computer Science 6211 Sudikoff Lab Hanover, NH 03755 USA
E-mail: kotz@cs.dartmouth.edu http://crawdad.org/nus/bluetooth/20070903/sql/
Call center data
Dataset: call_center.csv
Used in example:
- Detect Cyclical Outliers in Call Center Data
License terms: Free to use, collected by Splunk.
Churn
Dataset: churn.csv
Used in example:
- Predict Telecom Customer Churn
License terms: Free to use, with citation request: http://www.sgi.com/tech/mlc/db/churn.all
Lichman, M. (2013). UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science.
Cyclical business process
Dataset: cyclical_business_process.csv
Used in examples:
- Detect Cyclical Outliers in Logins
- Predict Future Logins
- Cluster Business Anomalies to Reduce Noise
License terms: Free to use, collected by Splunk.
Cyclical business process with anomalies
Dataset: cyclical_business_process_with_external_anomalies.csv
Used in example:
- Predict External Anomalies
License terms: Free to use, collected by Splunk.
Diabetes
Dataset: diabetes.csv
Used in example:
- Predict Incidence of Diabetes from Health Metrics
License terms: Free to use with citation: http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.names.
Database originally owned by National Institute of Diabetes and Digestive and Kidney Diseases; Database Donor: Sigillito Vincent, Research Center RMI Group Leader, The Johns Hopkins University. 9 May 1990.
Lichman, M. (2013). UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science.
Diabetic data
Dataset: diabetic.csv
Used in example:
- Detect Outliers in Diabetes Patient Records
License terms: Free to use with citation:
Beata Strack, Jonathan P. DeShazo, Chris Gennings, Juan L. Olmo, Sebastian Ventura, Krzysztof J. Cios, and John N. Clore, "Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records," BioMed Research International, vol. 2014, Article ID 781670, 11 pages, 2014.
Lichman, M. (2013). UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science
Disk failures
Dataset: disk_failures.csv
Used in examples:
- Detect Outliers in Disk Failure Events
- Predict the Failure of Hard Drives using SMART Metrics
- Cluster Hard Drives by SMART Metrics
- Find Anomalies in Hard Drives
License terms: Free to use with the following constraints:
A. We are encouraged to cite Backblaze as the source (not a mandatory requirement).
B. We accept that we are solely responsible for how we use the data.
C. We do not sell this data to anyone, it is free.
https://www.backblaze.com/hard-drive-test-data.html
Employee logins
Dataset: logins.csv
Used in examples:
- Forecast the Number of Employee Logins
- Detect Outliers in Number of Logins (vs. Predicted Value)
License terms: Free to use, collected by Splunk.
Exchange rate TWI
Dataset: exchange.csv
Used in example:
- Forecast Exchange Rate TWI using ARIMA
License terms: DataMarket, Default Open License
Source: https://datamarket.com/data/set/22tb/exchange-rate-twi-may-1970-aug-1995
This data release is licensed as follows:
- You may copy and redistribute the data.
- You may make derivative works from the data.
- You may use the data for commercial purposes.
- You may not sublicense the data when redistributing it.
- You may not redistribute the data under a different license.
- Source attribution on any use of this data: Must refer source.
Firewall traffic
Dataset: firewall_traffic.csv
Used in examples:
- Predict the Presence of Malware from Firewall Traffic
- Predict Presence of Known Vulnerability in Data
License terms: Free to use, collected by Splunk.
Housing
Dataset: housing.csv
Used in examples:
- Predict Median House Value
- Cluster Neighborhoods by Properties
- Cluster Houses by Property Descriptions
License terms: Free to use with citation: https://archive.ics.uci.edu/ml/datasets/Housing
Lichman, M. (2013). UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science.
Internet traffic
Dataset: internet_traffic.csv
Used in example:
- Forecast Internet Traffic
License terms: Free to use with citation: P. Cortez, M. Rio, M. Rocha and P. Sousa. Multiscale Internet Traffic Forecasting using Neural Networks and Time Series Methods. In Expert Systems, Wiley-Blackwell, In press.
This data release is licensed as follows:
- You may copy and redistribute the data.
- You may make derivative works from the data.
- You may use the data for commercial purposes.
- You may not sublicense the data when redistributing it.
- You may not redistribute the data under a different license.
- Source attribution on any use of this data: Must refer source.
Milk 2 dataset
Dataset: milk2.csv
Used in examples:
- Documentation for StateSpaceForecast algorithm
- Integration tests
- conftest.py
Mortgage loans
Dataset: mortgage_loan_ny.csv
Used in example:
- Detect Outliers in Mortgage Contract Data
- Cluster Mortgage Loans
License terms: http://www.fhfa.gov/AboutUs/Policies/Pages/API.aspx
This product uses FHFA Data but is neither endorsed nor certified by FHFA.
PDF demo data
Dataset: pdfdemo.csv
Used in example:
- Detect Outliers on the fitted density functions onto measurements by two different cities in PDF Demo Data
License terms: Free to use, synthesized by Splunk.
Phone usage
Dataset: phone_usage.csv
Used in example:
- Detect Outliers in Mobile Phone Activity
License terms: CRAWDAD Data License Dear Licensee:
Thank you for your interest in obtaining and using data from the CRAWDAD archive, hereinafter referred to as "Data". CRAWDAD is the Community Resource for Archiving Wireless Data At Dartmouth, and is operated by Dartmouth College under a grant from the National Science Foundation. Data Licensing Information:
Dartmouth College hereby grants a nonexclusive, nontransferable license to use the Data for commercial, educational, and research purposes only. The Data shall not be redistributed without the express written prior approval of Dartmouth College.
Licensee agrees to respect the privacy of those human subjects whose wireless-network activity is captured by the Data. Do not attempt to reverse the anonymization process to identify specific MAC addresses, IP address, telephone number, or other identifiers, or to identify their actual location. Use only the header information in packet traces; do not attempt to extract further information. (Header information specifies the type of information that is being transferred over the network, and specifically excludes the contents of the data, such as usernames, passwords, filenames, files, or URLs.)
Licensee agrees to acknowledge the source of the Data in any publications reporting on Licensee's use of it. For example, "We gratefully acknowledge the use of wireless data from the CRAWDAD archive at Dartmouth College."
Dartmouth expressly reserves the right to use the Data by its faculty, staff and researchers, for educational and research purposes. Dartmouth further reserves the right to provide Data Providers with statistical information regarding licensee's access to and use of the Provider's Data and with the Licensee's name and address.
Dartmouth College provides the Data "AS IS," without any warranty or promise of technical support, and disclaims any liability of any kind for any damages whatsoever resulting from use of Data.
DARTMOUTH MAKES NO WARRANTIES, EXPRESS OR IMPLIED WITH RESPECT TO THE DATA, INCLUDING ANY IMPLIED WARRANTY OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, WHICH ARE HEREBY EXPRESSLY DISCLAIMED.
Your acceptance and use of the Data binds you to the terms and conditions of this License as stated herein.
Trustees of Dartmouth College David F. Kotz, Ph.D. Professor of Computer Science 6211 Sudikoff Lab Hanover, NH 03755 USA
E-mail: kotz@cs.dartmouth.edu
http://crawdad.org/ctu/personal/20120315/
Power plant humidity
Dataset: power_plant.csv
Used in examples:
- Predict Power Plant Energy Output
- Detect Outliers in Power Plant Humidity
- Cluster Power Plant Operating Regimes
License terms: Free to use, with citation request: UCI Machine Learning Repository http://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant
Pınar Tüfekci, Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods, International Journal of Electrical Power & Energy Systems, Volume 60, September 2014, Pages 126-140, ISSN 0142-0615, [Web Link]. ([Web Link])
Heysem Kaya, Pınar Tüfekci , Sadık Fikret Gürgen: Local and Global Learning Methods for Predicting Power of a Combined Gas & Steam Turbine, Proceedings of the International Conference on Emerging Trends in Computer and Electronics Engineering ICETCEE 2012, pp. 13-18 (Mar. 2012, Dubai)
Server power
Dataset: server_power.csv
Used in example:
- Predict Server Power Consumption
- Predict Disk Utilization
License terms: Free to use, with citation request: https://www.usenix.org/legacy/event/hotpower08/tech/full_papers/rivoire/rivoire.pdf
Server response time
Dataset: hostperf.csv
Used in example:
- Detect Outliers in Server Response Time
License terms: Free to use, collected by Splunk.
Souvenir sales
Dataset: souvenir_sales.csv
Used in example:
- Forecast Monthly Sales for a Souvenir Shop
License terms: Default open license. This data release is licensed as follows:
- You may copy and redistribute the data.
- You may make derivative works from the data.
- You may use the data for commercial purposes.
- You may not sublicense the data when redistributing it.
- You may not redistribute the data under a different license.
- Source attribution on any use of this data: Must refer to source:
Special days
Dataset: specialdays.csv
Used in example:
- Forecast App Logons with Special Days
License terms: Free to use, collected by Splunk.
Supermarket purchases
Dataset: supermarket.csv
Used in example:
- Detect Outliers in Supermarket Purchases
- Find Anomalies in Supermarket Purchases
License terms: Free to use with citation: Pennacchioli, D., Coscia, M., Rinzivillo, S., Pedreschi, D. and Giannotti, F., 'Explaining the Product Range Effect in Purchase Data'. In BigData, 2013.
Track day
Dataset: track_day.csv
Used in examples:
- Predict Vehicle Type from Onboard Metrics
- Cluster Vehicles by Onboard Metrics
License terms: Free to use, collected by Splunk.
Third-party software |
This documentation applies to the following versions of Splunk® Machine Learning Toolkit: 5.2.0, 5.2.1, 5.2.2, 5.3.0, 5.3.1, 5.3.3, 5.4.0, 5.4.1, 5.4.2, 5.5.0
Feedback submitted, thanks!