Admin Manual

 


How Splunk Works

Character set

This documentation does not apply to the most recent version of Splunk. Click here for the latest version.

Character set

You can use Splunk's built-in character set specification to support internationalization. When set, Splunk will assume input from that source is in the specified encoding. A list of valid encodings can be retrieved using the command "iconv -l" on most Unix systems. If an invalid encoding is specified, a warning will be logged during initial configuration and further input from that spec will be discarded. If the source encoding is valid, but some characters from the spec are not valid in the specified encoding, then the characters will be escaped as hex (e.g, "\xF3").


Character set specification

By default, Splunk sets characters to UTF-8 in $SPLUNK_HOME/etcbundles/default/props.conf. If you are dealing with non-UTF-8 and non-ASCII files, Splunk needs to convert them into UTF-8.


Manual Specification

You can manually specify the CHARSET for a source in $SPLUNK_HOME/etc/bundles/local/props.conf:


[source::$SOURCE]
CHARSET=ISO-8859-7

Automatic Specification

Splunk has a new sophisticated language and character set encoding algorithm. Currently, 71 languages are detected, 20 being non-UTF-8. Set CHARSET to AUTO in props.conf to have Splunk automatically determine the language and character encoding of the sources.


For example:


If Splunk sees a Greek document in the ISO-8859-7 character set, it will detect that and automatically convert it to UTF-8.


In props.conf set the CHARSET key to AUTO:


[my-foreign-docs]
CHARSET=AUTO

If Splunk doesn't recognize your character set

If you are using an encoding that Splunk does not support, add a sample file in the form


{{SPLUNK_HOME/etc/ngram-models/_<language>-<encoding>.txt

to train Splunk to recognize a new character set.


When you restart, Splunk will recognize files of the above form and automatically convert them to UTF-8 format.


Sample file format

/SPLUNK_HOME/etc/ngram-models/_vulcan-ISO-12345.txt 

Supported character sets

arabic in CP1256


arabic in ISO-8859-6


armenian in ARMSCII-8


belarus in CP1251


bulgarian in ISO-8859-5


czech in ISO-8859-2


georgian in Georgian-Academy


greek in ISO-8859-7


hebrew in ISO-8859-8


japanese in EUC-JP


japanese in SHIFT-JIS


korean in EUC-KR


russian in CP1251


russian in ISO-8859-5


russian in KOI8-R


slovak in CP1250


slovenian in ISO-8859-2


thai in TIS-620


ukrainian in KOI8-U


vietnamese in VISCII

This documentation applies to the following versions of Splunk: 3.1.2 , 3.1.3 , 3.1.4 View the Article History for its revisions.


You must be logged into splunk.com in order to post comments. Log in now.

Was this documentation topic helpful?

If you'd like to hear back from us, please provide your email address:

We'd love to hear what you think about this topic or the documentation as a whole. Feedback you enter here will be delivered to the documentation team.