How source types work
This documentation does not apply to the most recent version of Splunk. Click here for the latest version.
Contents
- Sources and source types
- How sourcetype:: values are set
- Automatic source type classification
- Alias source types
- Train the sourcetype auto-classifier
- Hard-coded source type assignment
- Custom indexing linked to source types
- Tweak default processing
- Mask sensitive data
- Change indexing density
- Eliminate processing steps
- Field extraction linked to source types
- Configuration files for source types
How source types work
Sources and source types
A source is a file, stream, or other data input from which Splunk indexes events. For data coming from files and directories, the value of source is the full path, such as /archive/server1/var/log/messages.0 or /var/log/. The value of source for network-based data sources is the protocol and port, such as UDP:514. source:: is a core field that is indexed and stored with every event.
Source type refers to any common format of data produced by a group of sources. Sourcetype:: is also a core field that is indexed and stored with every event. It provides an easy way to find all sources that have the same kind of data but may have different source values because the data was accessed using a different input method, or because an application uses a different configuration or logging path on different systems. For example, you might look for all data of sourcetype::weblogic_stdout even though weblogic might be logging from two different domains in both source::/opt/bea92/user_projects/domains/petstore/servers/myserver/logs/myserver.log and source::/opt/bea92/user_projects/domains/invoicing/servers/myserver/logs/myserver.log.
Also, you may get linux syslog by tailing source::/var/log/messages on some servers while you are receiving direct syslog input from udp::514 from others and you'd like to be able to find both by searching for sourcetype::linux_syslog.
How sourcetype:: values are set
Automatic source type classification
During indexing, Splunk will try to classify source types automatically by calculating signatures for patterns in the first few thousand lines of any file or stream of network input. These signatures will pick up things like repeating patterns of words, punctuation patterns, line length, etc.
Once Splunk has calculated a signature, it compares the signature to previously seen signatures - if it's a radically new pattern, Splunk will establish a new name for the source type. These sourcetypes will be based on the filename of the first source of that pattern, prefixed by an underscore. Hence a file named error_log that appears to be a significantly new format to Splunk will be sourcetyped sourcetype::_error_log. If another file named error_log comes along that is radically new again and different from the log that Splunk sourcetyped as sourcetype::_error_log Splunk will qualify the new one as _error_log_u1.
However, when the next file or stream comes along that has a signature that Splunk sees as being very close to a signature it's already seen, it will re-use the sourcetype name it previously established, regardless of what the filename is. So if a file comes along called foobar it will be sourcetyped sourcetype::_error_log if Splunk determines the pattern is very similar to the first file named error_log.
If you want to configure your own automatic source-type recognition, you can use Splunk's rule-based source type feature. Rule-based source types are automatically assigned based on regular expressions you specify in props.conf. Learn more about how to configure rule-based source types.
Alias source types
Generally, Splunk's automatic source type classification will err on the side of being too granular rather than too generic. So it is likely that you will end up with some source types named sourcetype::_access_log from a farm of web servers with one web application, sourcetype::_access_log_u1 from a farm of web servers with another application because URL patterns common to each application vary. You may want to collapse both into something like sourcetype::access_common to reflect that they are all access logs in NCSA common format.
This is when you will want to create an alias for a source type.. However, note that aliasing source types is merely cosmetic so that events are grouped correctly in SplunkWeb and so that users can search for sourcetype: values that make sense. If you have sourcetype-specific indexing properties or extracted fields (both discussed further down this page) you'll need to be sure that the actual stored sourcetype value is set correctly at index time through either training or manual source type assignment.
Train the sourcetype auto-classifier
Splunk's auto-classifier can be trained with a set of representative example files so that it is pre-loaded with sourcetype names that are more descriptive than sourcetype::_error_log-u13. If you train it with a wide enough range of files that you'd like to see given the same sourcetype name, it will both learn more good rules and unlearn bad rules, so the proportion of new files that are indexed of that sourcetype that are recognized will improve. Pre-training is how Splunk ships with the ability to assign sourcetype::syslog to most syslog files. While you can bypass Splunk's auto-clasification, skip the training step and simply hardcode a sourcetype for each data input, training may still be more effective if you plan to have Splunk index entire directories of mixed sourcetypes such as /var/log.
If Splunk fails to recognize a common format, or worse yet misclassifies it, we encourage you to report the problem and send us a sample file so we can improve the product. You can anonymize your file using Splunk's built in anonymizer too.
Learn how to train Splunk to recognize sourcetypes.
Hard-coded source type assignment
You can bypass automatic source type classification entirely and simply set a source type yourself when you configure a data input. See setting sourcetype for an input. However, this method is not very granular -- all will data from the same host or source will be assigned the same source type name.
If you need to give different sources with in a single directory input different names, you can try setting source type for a source.
Custom indexing linked to source types
You can tie custom indexing properties to any source type via props.conf. Just set the source type as the <spec> above a props.conf stanza.
Here are a few things you can do:
Tweak default processing
When Splunk indexes a data source, it automatically breaks the input into distinct events and extracts a host and timestamp for the event. The event boundaries, host, and timestamps are important for analysis. If Splunk is not setting the event boundaries or extracting timestamps and hosts correctly, you can easily modify these settings. See timestamp recognition, how host is assigned, and how events are recognized for more information.
Mask sensitive data
Your logs may contain sensitive personal data. For example, there may be social security numbers or passwords in your data that you may wish to cover up. You can create event configuration that masks sensitive data as it is being processed on input.
Change indexing density
When Splunk indexes data, it segments events via major and minor breakers. To save storage space on the indexer, you can Splunk's default segmentation settings. For example, web proxy logs may contain lengthy URLs that Splunk breaks into many different minor segments. You may wish to change this setting to eliminate unnecessary overhead.
Eliminate processing steps
Certain processing steps can be eliminated to provide faster indexing and better throughput. For example, if you don't need Splunk to search for timestamps within events, you can turn off timestamp extraction. You can also tune down or eliminate event type auto-discovery.
Field extraction linked to source types
You can also associate extracted field rules with source types. Like custom indexing properties, field extraction rules are based on the stored sourcetype: value set at index time, so aliasing a source type won't cause it to pick up new rules. You'll need to either hardcode a correct sourcetype: value for the source or input, or train Splunk.
Configuration files for source types
Source type can be set in inputs.conf. Custom indexing properties and rule-based associations of source types are configured through props.conf. Before manually modifying any configuration file, please read about bundle files.
This documentation applies to the following versions of Splunk: 3.0 , 3.0.1 , 3.0.2 , 3.1 , 3.1.1 , 3.1.2 , 3.1.3 , 3.1.4 View the Article History for its revisions.