Character set
This documentation does not apply to the most recent version of Splunk. Click here for the latest version.
Contents
Character set
You can use Splunk's built-in character set specification to support internationalization. When set, Splunk will assume input from that source is in the specified encoding. A list of valid encodings can be retrieved using the command "iconv -l" on most Unix systems. If an invalid encoding is specified, a warning will be logged during initial configuration and further input from that spec will be discarded. If the source encoding is valid, but some characters from the spec are not valid in the specified encoding, then the characters will be escaped as hex (e.g, "\xF3").
Character set specification
By default, Splunk sets characters to UTF-8 in $SPLUNK_HOME/etcbundles/default/props.conf. If you are dealing with non-UTF-8 and non-ASCII files, Splunk needs to convert them into UTF-8.
Manual Specification
You can manually specify the CHARSET for a source in $SPLUNK_HOME/etc/bundles/local/props.conf:
[source::$SOURCE] CHARSET=ISO-8859-7
Automatic Specification
Splunk has a new sophisticated language and character set encoding algorithm. Currently, 71 languages are detected, 20 being non-UTF-8. Set CHARSET to AUTO in props.conf to have Splunk automatically determine the language and character encoding of the sources.
For example:
If Splunk sees a Greek document in the ISO-8859-7 character set, it will detect that and automatically convert it to UTF-8.
In props.conf set the CHARSET key to AUTO:
[my-foreign-docs] CHARSET=AUTO
If Splunk doesn't recognize your character set
If you are using an encoding that Splunk does not support, add a sample file in the form
{{SPLUNK_HOME/etc/ngram-models/_<language>-<encoding>.txt
to train Splunk to recognize a new character set.
When you restart, Splunk will recognize files of the above form and automatically convert them to UTF-8 format.
Sample file format
/SPLUNK_HOME/etc/ngram-models/_vulcan-ISO-12345.txt
Supported character sets
arabic in CP1256
arabic in ISO-8859-6
armenian in ARMSCII-8
belarus in CP1251
bulgarian in ISO-8859-5
czech in ISO-8859-2
georgian in Georgian-Academy
greek in ISO-8859-7
hebrew in ISO-8859-8
japanese in EUC-JP
japanese in SHIFT-JIS
korean in EUC-KR
russian in CP1251
russian in ISO-8859-5
russian in KOI8-R
slovak in CP1250
slovenian in ISO-8859-2
thai in TIS-620
ukrainian in KOI8-U
vietnamese in VISCII
This documentation applies to the following versions of Splunk: 3.1.2 , 3.1.3 , 3.1.4 View the Article History for its revisions.