Splunk® Enterprise

Getting Data In

Download manual as PDF

Splunk version 4.x reached its End of Life on October 1, 2013. Please see the migration information.
This documentation does not apply to the most recent version of Splunk. Click here for the latest version.
Download topic as PDF

Configure character set encoding

Splunk allows you to configure character set encoding for your data sources. Splunk has built-in character set specifications to support internationalization of your Splunk deployment. Splunk supports 71 languages (including 20 that aren't UTF-8 encoded). You can retrieve a list of Splunk's valid character encoding specifications by using the iconv -l command on most *nix systems.

Splunk attempts to apply UTF-8 encoding to your sources by default. If a source doesn't use UTF-8 encoding or is a non-ASCII file, Splunk will try to convert data from the source to UTF-8 encoding unless you specify a character set to use by setting the CHARSET key in props.conf.

Supported character sets

Splunk supports an extremely wide range of character sets, including such key ones as UTF-8, UTF-16LE, Latin-1, BIG5, and SHIFT-JIS. See "Comprehensive list of supported character sets" at the end of this topic for the exhaustive list.

Here's a short list of some of the main character sets that Splunk supports, along with the languages they correspond to.

Language Code
Arabic CP1256
Arabic ISO-8859-6
Armenian ARMSCII-8
Belarus CP1251
Bulgarian ISO-8859-5
Czech ISO-8859-2
Georgian Georgian-Academy
Greek ISO-8859-7
Hebrew ISO-8859-8
Japanese EUC-JP
Japanese SHIFT-JIS
Korean EUC-KR
Russian CP1251
Russian ISO-8859-5
Russian KOI8-R
Slovak CP1250
Slovenian ISO-8859-2
Thai TIS-620
Ukrainian KOI8-U
Vietnamese VISCII

Manually specify a character set

To manually specify a character set to apply to an input, set the CHARSET key in props.conf:

[spec]
CHARSET=<string>

For example, if you have a host that is generating data in Greek (called "GreekSource" in this example) and that uses ISO-8859-7 encoding, set CHARSET=ISO-8859-7 for that host in props.conf:

[host::GreekSource]
CHARSET=ISO-8859-7

Note: Splunk will only parse character encodings that have UTF-8 mappings. Some EUC-JP characters do not have a mapped UTF-8 encoding.

Automatically specify a character set

Splunk can automatically detect languages and proper character sets using its sophisticated character set encoding algorithm.

Configure Splunk to automatically detect the proper language and character set encoding for a particular input by setting CHARSET=AUTO for the input in props.conf. For example, if you want Splunk to automatically detect character set encoding for the host "my-foreign-docs", set CHARSET=AUTO for that host in props.conf:

[host::my-foreign-docs]
CHARSET=AUTO

If Splunk doesn't recognize a character set

If you want to use an encoding that Splunk doesn't recognize, train Splunk to recognize the character set by adding a sample file to the following directory:

$SPLUNK_HOME/etc/ngram-models/_<language>-<encoding>.txt

Once you add the character set specification file, you must restart Splunk. After you restart, Splunk can recognize sources that use the new character set, and will automatically convert them to UTF-8 format at index time.

For example, if you want to use the "vulcan-ISO-12345" character set, copy the specification file to the following path:

/SPLUNK_HOME/etc/ngram-models/_vulcan-ISO-12345.txt 

Comprehensive list of supported character sets

The common character sets described earlier are just the tip of Splunk's CHARSET iceberg. Splunk actually supports a long list of character sets and aliases, identical to the list supported by the *nix iconv utility. Here's the full list, with aliases indicated in parantheses:

  • utf-8 (aka, CESU-8, ANSI_X3.4-1968, ANSI_X3.4-1986, ASCII, CP367, IBM367, ISO-IR-6, ISO646-US ISO_646.IRV:1991, US, US-ASCII, CSASCII)
  • utf-16le (aka, UCS-2LE, UNICODELITTLE)
  • utf-16be (aka, ISO-10646-UCS-2, UCS-2, CSUNICODE, UCS-2BE, UNICODE-1-1, UNICODEBIG, CSUNICODE11, UTF-16)
  • utf-32le (aka, UCS-4LE)
  • utf-32be (aka, ISO-10646-UCS-4, UCS-4, CSUCS4, UCS-4BE, UTF-32)
  • utf-7 (aka, UNICODE-1-1-UTF-7, CSUNICODE11UTF7)
  • c99 (aka, java)
  • utf-ebcdic
  • latin-1 (aka, CP819, IBM819, ISO-8859-1, ISO-IR-100, ISO_8859-1:1987, L1, CSISOLATIN1)
  • latin-2 (aka, ISO-8859-2, ISO-IR-101, ISO_8859-2:1987, L2, CSISOLATIN2)
  • latin-3 (aka, ISO-8859-3, ISO-IR-109, ISO_8859-3:1988, L3, CSISOLATIN3)
  • latin-4 (aka, ISO-8859-4, ISO-IR-110, ISO_8859-4:1988, L4, CSISOLATIN4)
  • latin-5 (aka, ISO-8859-9, ISO-IR-148, ISO_8859-9:1989, L5, CSISOLATIN5)
  • latin-6 (aka, ISO-8859-10, ISO-IR-157 ,ISO_8859-10:1992, L6, CSISOLATIN6)
  • latin-7 (aka, ISO-8859-13, ISO-IR-179 ,L7)
  • latin-8 (aka, ISO-8859-14, ISO-CELTIC, ISO-IR-199, ISO_8859-14:1998, L8)
  • latin-9 (aka, ISO-8859-15, ISO-IR-203, ISO_8859-15:1998)
  • latin-10 (aka, ISO-8859-16, ISO-IR-226, ISO_8859-16:2001, L10, LATIN10)
  • ISO-8859-5 (aka, CYRILLIC, ISO-IR-144, ISO_8859-5:198,8 CSISOLATINCYRILLIC)
  • ISO-8859-6(aka, ARABIC, ASMO-708, ECMA-114, ISO-IR-127, ISO_8859-6:1987, CSISOLATINARABIC, MACARABIC)
  • ISO-8859-7 (aka, ECMA-118, ELOT_928, GREEK, GREEK8, ISO-IR-126, ISO_8859-7:1987, ISO_8859-7:2003, CSISOLATINGREEK)
  • ISO-8859-8 (aka, HEBREW, ISO-8859-8, ISO-IR-138, ISO8859-8, ISO_8859-8:1988, CSISOLATINHEBREW)
  • ISO-8859-11
  • roman-8 (aka, HP-ROMAN8, R8, CSHPROMAN8)
  • KOI8-R (aka, CSKOI8R)
  • KOI8-U
  • KOI8-T
  • GEORGIAN-ACADEMY
  • GEORGIAN-PS
  • ARMSCII-8
  • MACINTOSH (aka, MAC, MACROMAN, CSMACINTOSH) [Note: these MAC* charsets are for MacOS 9; OS/X uses unicode]
  • MACGREEK
  • MACCYRILLIC
  • MACUKRAINE
  • MACCENTRALEUROPE
  • MACTURKISH
  • MACCROATIAN
  • MACICELAND
  • MACROMANIA
  • MACHEBREW
  • MACTHAI
  • NEXTSTEP
  • CP850 (aka, 850, IBM850, CSPC850MULTILINGUAL)
  • CP862 (aka, 862, IBM862, CSPC862LATINHEBREW)
  • CP866 (aka, 866, IBM866, CSIBM866)
  • CP874 (aka, WINDOWS-874)
  • CP932
  • CP936 (aka, MS936, WINDOWS-936)
  • CP949 (aka, UHC)
  • CP950
  • CP1250 (aka, MS-EE, WINDOWS-1250)
  • CP1251 (aka, MS-CYRL, WINDOWS-1251)
  • CP1252 (aka, MS-ANSI, WINDOWS-1252)
  • CP1253 (aka, MS-GREEK, WINDOWS-1253)
  • CP1254 (aka, MS-TURK, WINDOWS-1254)
  • CP1255 (aka, MS-HEBR, WINDOWS-1255)
  • CP1256 (aka, MS-ARAB, WINDOWS-1256)
  • CP1257 (aka, WINBALTRIM, WINDOWS-1257)
  • CP1258 (aka, WINDOWS-1258)
  • CP1361 (aka, JOHAB)
  • BIG-5 (aka, BIG-FIVE, CN-BIG5, CSBIG5)
  • BIG5-HKSCS(aka, BIG5-HKSCS:2001)
  • CN-GB (aka, EUC-CN, EUCCN, GB2312, CSGB2312)
  • EUC-JP (aka, EXTENDED_UNIX_CODE_PACKED_FORMAT_FOR_JAPANESE, CSEUCPKDFMTJAPANESE)
  • EUC-KR (aka, CSEUCKR)
  • EUC-TW (aka, CSEUCTW)
  • GB18030
  • GBK
  • GB_1988-80 (aka, ISO-IR-57, ISO646-CN, CSISO57GB1988, CN)
  • HZ (aka, HZ-GB-2312)
  • GB_2312-80 (aka, CHINESE, ISO-IR-58, CSISO58GB231280)
  • SHIFT-JIS (aka, MS_KANJI, SJIS, CSSHIFTJIS)
  • ISO-IR-87 (aka, JIS0208 JIS_C6226-1983, JIS_X0208 JIS_X0208-1983, JIS_X0208-1990, X0208, CSISO87JISX0208, ISO-IR-159, JIS_X0212, JIS_X0212-1990, JIS_X0212.1990-0, X0212, CSISO159JISX02121990)
  • ISO-IR-14 (aka, ISO646-JP, JIS_C6220-1969-RO, JP, CSISO14JISC6220RO)
  • JISX0201-1976 (aka, JIS_X0201, X0201, CSHALFWIDTHKATAKANA)
  • ISO-IR-149 (aka, KOREAN, KSC_5601, KS_C_5601-1987, KS_C_5601-1989, CSKSC56011987)
  • VISCII (aka, VISCII1.1-1, CSVISCII)
  • ISO-IR-166 (aka, TIS-620, TIS620-0, TIS620.2529-1, TIS620.2533-0, TIS620.2533-1)

Note: Splunk ignores punctuation and case when matching CHARSET, so, for example, "utf-8", "UTF-8", and "utf8" are all considered identical.

PREVIOUS
Overview of event processing
  NEXT
Configure event linebreaking

This documentation applies to the following versions of Splunk® Enterprise: 4.3, 4.3.1, 4.3.2, 4.3.3, 4.3.4, 4.3.5, 4.3.6, 4.3.7, 5.0, 5.0.1, 5.0.2, 5.0.3, 5.0.4, 5.0.5, 5.0.6, 5.0.7, 5.0.8, 5.0.9, 5.0.10, 5.0.11, 5.0.12, 5.0.13, 5.0.14, 5.0.15, 5.0.16, 5.0.17, 5.0.18


Was this documentation topic helpful?

Enter your email address, and someone from the documentation team will respond to you:

Please provide your comments here. Ask a question or make a suggestion.

You must be logged into splunk.com in order to post comments. Log in now.

Please try to keep this discussion focused on the content covered in this documentation topic. If you have a more general question about Splunk functionality or are experiencing a difficulty with Splunk, consider posting a question to Splunkbase Answers.

0 out of 1000 Characters