Admin Manual

 


How Splunk Works

Character set

This documentation does not apply to the most recent version of Splunk. Click here for the latest version.

Character set

Splunk has a built-in character set specification to support the display of international characters. When set, Splunk will assume input from that source is in the specified encoding. A list of valid encodings can be retrieved using the command "iconv -l" on most Unix systems. If you specify an invalid encoding, Splunk logs a warning during initial configuration and further input from that spec is discarded. If the source encoding is valid, but some characters from the spec are not valid in the specified encoding, then the characters are escaped as hex (e.g, "\xF3"). While Splunk will properly display these characters you will not be able to search for them.


Character set specification

By default, Splunk sets characters to UTF-8 in $SPLUNK_HOME/etcbundles/default/props.conf. If you are dealing with non-UTF-8 and non-ASCII files, Splunk needs to convert them into UTF-8.


Manual Specification

You can manually specify the CHARSET for a source in $SPLUNK_HOME/etc/bundles/local/props.conf:


[source::$SOURCE]
CHARSET=ISO-8859-7

Automatic Specification

Splunk has a sophisticated language and character set encoding algorithm. Currently, 71 languages are detected, including 20 that are not UTF-8. Set CHARSET to AUTO in props.conf to have Splunk automatically determine the language and character encoding of the sources.


For example:


If Splunk sees a Greek document in the ISO-8859-7 character set, it detects that and automatically convert it to UTF-8.


In props.conf set the CHARSET key to AUTO:


[my-foreign-docs]
CHARSET=AUTO

If Splunk doesn't recognize your character set

If you are using an encoding that Splunk does not support, add a sample file in the form


{{SPLUNK_HOME/etc/ngram-models/_<language>-<encoding>.txt

to train Splunk to recognize a new character set.


When you restart, Splunk recognizes files of the above form and automatically converts them to UTF-8 format.


Sample file format

/SPLUNK_HOME/etc/ngram-models/_vulcan-ISO-12345.txt 

Supported character sets

Language Code
Arabic CP1256
Arabic ISO-8859-6
Armenian i ARMSCII-8
Belarus CP1251
Bulgarian ISO-8859-5
Czech ISO-8859-2
Georgian Georgian-Academy
Greek ISO-8859-7
Hebrew ISO-8859-8
Japanese EUC-JP
Japanese SHIFT-JIS
Korean EUC-KR
Russian CP1251
Russian ISO-8859-5
Russian KOI8-R
Slovak CP1250
Slovenian ISO-8859-2
Thai TIS-620
Ukrainian KOI8-U
Vietnamese VISCII

This documentation applies to the following versions of Splunk: 3.2 , 3.2.1 , 3.2.2 , 3.2.3 , 3.2.4 , 3.2.5 , 3.2.6 View the Article History for its revisions.


You must be logged into splunk.com in order to post comments. Log in now.

Was this documentation topic helpful?

If you'd like to hear back from us, please provide your email address:

We'd love to hear what you think about this topic or the documentation as a whole. Feedback you enter here will be delivered to the documentation team.

Feedback submitted, thanks!