Configure character set encoding

You can configure character set encoding for your data sources. Splunk software has built-in character set specifications to support internationalization of your deployment. Splunk software supports many languages, including some that don't use Universal Coded Character Set Transformation Format - 8-bit (UTF-8) encoding.

Splunk software attempts to apply UTF-8 encoding to your sources by default. If a source doesn't use UTF-8 encoding or is a non-ASCII file, Splunk software tries to convert data from the source to UTF-8 encoding unless you specify a character set to use by setting the CHARSET key in the props.conf file.

You can retrieve a list of the valid character encoding specifications by using the iconv -l command on most *nix systems. A port for iconv on Windows is available.

Supported character sets

Splunk software supports a wide range of character sets, including the following key character sets:

UTF-8
UTF-16LE
Latin-1
BIG5
SHIFT-JIS `

The following table shows a short list of common supported character sets and the languages they correspond to.

Language	Code
Arabic	CP1256
Arabic	ISO-8859-6
Armenian	ARMSCII-8
Belarus	CP1251
Bulgarian	ISO-8859-5
Czech	ISO-8859-2
Georgian	Georgian-Academy
Greek	ISO-8859-7
Hebrew	ISO-8859-8
Japanese	EUC-JP
Japanese	SHIFT-JIS
Korean	EUC-KR
Russian	CP1251
Russian	ISO-8859-5
Russian	KOI8-R
Slovak	CP1250
Slovenian	ISO-8859-2
Thai	TIS-620
Ukrainian	KOI8-U
Vietnamese	VISCII

For more supported character sets, see the Comprehensive list of supported character sets section later in this topic.

Manually specify a character set

To manually specify a character set, you need to edit the props.conf file. If you have a Splunk Cloud Platform deployment, edit your props.conf file on your forwarder. If you have a Splunk Enterprise deployment, you can edit this file in your Splunk Enterprise deployment.

To manually specify a character set to apply to an input, set the CHARSET key in the props.conf file:

[spec]
CHARSET=<string>

For example, if you have a host that generates data in Greek and uses ISO-8859-7 encoding, set CHARSET=ISO-8859-7 for that host in the props.conf file. The host is called "GreekSource" in this example:

[host::GreekSource]
CHARSET=ISO-8859-7

Splunk software parses only character encodings that have UTF-8 mappings.

Automatically specify a character set

Splunk software can automatically detect languages and proper character sets using its character set encoding algorithm.

To configure Splunk software to automatically detect the proper language and character set encoding for a particular input, set CHARSET=AUTO for the input in the props.conf file. If you have a Splunk Cloud Platform deployment, you can edit this file on your forwarder. If you have a Splunk Enterprise deployment, you can edit this file in your Splunk Enterprise deployment.

For example, to automatically detect character set encoding for the host "my-foreign-docs", set CHARSET=AUTO for that host in the props.conf file:

[host::my-foreign-docs]
CHARSET=AUTO

Train Splunk Enterprise to recognize a character set

If you have Splunk Cloud Platform and want to add a character set encoding to your Splunk deployment, file a Splunk Support ticket. If you have a Splunk Enterprise deployment, you can train Splunk software to recognize the character set.

You can train Splunk Enterprise to recognize the character set by adding a sample file to the following path and restarting Splunk Enterprise:

$SPLUNK_HOME/etc/ngram-models/_<language>-<encoding>.txt

For example, if you want to use the "vulcan-ISO-12345" character set, copy the specification file to the following path:

/SPLUNK_HOME/etc/ngram-models/_vulcan-ISO-12345.txt

After the sample file is added to the specified path, Splunk software recognizes sources that use the new character set and automatically converts them to UTF-8 format at index time.

Comprehensive list of supported character sets

The common character sets described earlier in the Supported character sets section are a small subset of what the CHARSET attribute can support. Splunk software also supports a long list of character sets. Of the character sets that the Splunk platform supports, it also supports their aliases. identical to the list supported by the *nix iconv utility.

Splunk software ignores punctuation and case when matching CHARSET. For example, utf-8, UTF-8, and utf8 are all considered identical.

The following list shows all supported character sets with their aliases indicated in parentheses:

utf-8 (CESU-8, ANSI_X3.4-1968, ANSI_X3.4-1986, ASCII, CP367, IBM367, ISO-IR-6, ISO646-US ISO_646.IRV:1991, US, US-ASCII, CSASCII)
utf-16le (UCS-2LE, UNICODELITTLE)
utf-16be (ISO-10646-UCS-2, UCS-2, CSUNICODE, UCS-2BE, UNICODE-1-1, UNICODEBIG, CSUNICODE11, UTF-16)
utf-32le (UCS-4LE)
utf-32be (ISO-10646-UCS-4, UCS-4, CSUCS4, UCS-4BE, UTF-32)
utf-7 (UNICODE-1-1-UTF-7, CSUNICODE11UTF7)
c99 (java)
utf-ebcdic
latin-1 (CP819, IBM819, ISO-8859-1, ISO-IR-100, ISO_8859-1:1987, L1, CSISOLATIN1)
latin-2 (ISO-8859-2, ISO-IR-101, ISO_8859-2:1987, L2, CSISOLATIN2)
latin-3 (ISO-8859-3, ISO-IR-109, ISO_8859-3:1988, L3, CSISOLATIN3)
latin-4 (ISO-8859-4, ISO-IR-110, ISO_8859-4:1988, L4, CSISOLATIN4)
latin-5 (ISO-8859-9, ISO-IR-148, ISO_8859-9:1989, L5, CSISOLATIN5)
latin-6 (ISO-8859-10, ISO-IR-157, ISO_8859-10:1992, L6, CSISOLATIN6)
latin-7 (ISO-8859-13, ISO-IR-179, L7)
latin-8 (ISO-8859-14, ISO-CELTIC, ISO-IR-199, ISO_8859-14:1998, L8)
latin-9 (ISO-8859-15, ISO-IR-203, ISO_8859-15:1998)
latin-10 (ISO-8859-16, ISO-IR-226, ISO_8859-16:2001, L10, LATIN10)
ISO-8859-5 (CYRILLIC, ISO-IR-144, ISO_8859-5:198,8 CSISOLATINCYRILLIC)
ISO-8859-6(ARABIC, ASMO-708, ECMA-114, ISO-IR-127, ISO_8859-6:1987, CSISOLATINARABIC, MACARABIC)
ISO-8859-7 (ECMA-118, ELOT_928, GREEK, GREEK8, ISO-IR-126, ISO_8859-7:1987, ISO_8859-7:2003, CSISOLATINGREEK)
ISO-8859-8 (HEBREW, ISO-8859-8, ISO-IR-138, ISO8859-8, ISO_8859-8:1988, CSISOLATINHEBREW)
ISO-8859-11
roman-8 (HP-ROMAN8, R8, CSHPROMAN8)
KOI8-R (CSKOI8R)
KOI8-U
KOI8-T
GEORGIAN-ACADEMY
GEORGIAN-PS
ARMSCII-8
MACINTOSH (MAC, MACROMAN, CSMACINTOSH)
These MAC* character sets are for MacOS 9. Higher versions, like macOS X, use unicode.
MACGREEK
MACCYRILLIC
MACUKRAINE
MACCENTRALEUROPE
MACTURKISH
MACCROATIAN
MACICELAND
MACROMANIA
MACHEBREW
MACTHAI
NEXTSTEP
CP850 (850, IBM850, CSPC850MULTILINGUAL)
CP862 (862, IBM862, CSPC862LATINHEBREW)
CP866 (866, IBM866, CSIBM866)
CP874 (WINDOWS-874)
CP932
CP936 (MS936, WINDOWS-936)
CP949 (UHC)
CP950
CP1250 (MS-EE, WINDOWS-1250)
CP1251 (MS-CYRL, WINDOWS-1251)
CP1252 (MS-ANSI, WINDOWS-1252)
CP1253 (MS-GREEK, WINDOWS-1253)
CP1254 (MS-TURK, WINDOWS-1254)
CP1255 (MS-HEBR, WINDOWS-1255)
CP1256 (MS-ARAB, WINDOWS-1256)
CP1257 (WINBALTRIM, WINDOWS-1257)
CP1258 (WINDOWS-1258)
CP1361 (JOHAB)
BIG-5 (BIG-FIVE, CN-BIG5, CSBIG5)
BIG5-HKSCS(BIG5-HKSCS:2001)
CN-GB (EUC-CN, EUCCN, GB2312, CSGB2312)
EUC-JP (EXTENDED_UNIX_CODE_PACKED_FORMAT_FOR_JAPANESE, CSEUCPKDFMTJAPANESE)
EUC-KR (CSEUCKR)
EUC-TW (CSEUCTW)
GB18030
GBK
GB_1988-80 (ISO-IR-57, ISO646-CN, CSISO57GB1988, CN)
HZ (HZ-GB-2312)
GB_2312-80 (CHINESE, ISO-IR-58, CSISO58GB231280)
SHIFT-JIS (MS_KANJI, SJIS, CSSHIFTJIS)
ISO-IR-87 (JIS0208 JIS_C6226-1983, JIS_X0208 JIS_X0208-1983, JIS_X0208-1990, X0208, CSISO87JISX0208, ISO-IR-159, JIS_X0212, JIS_X0212-1990, JIS_X0212.1990-0, X0212, CSISO159JISX02121990)
ISO-IR-14 (ISO646-JP, JIS_C6220-1969-RO, JP, CSISO14JISC6220RO)
JISX0201-1976 (JIS_X0201, X0201, CSHALFWIDTHKATAKANA)
ISO-IR-149 (KOREAN, KSC_5601, KS_C_5601-1987, KS_C_5601-1989, CSKSC56011987)
VISCII (VISCII1.1-1, CSVISCII)
ISO-IR-166 (TIS-620, TIS620-0, TIS620.2529-1, TIS620.2533-0, TIS620.2533-1)
UCS-2-INTERNAL, UCS-2-SWAPPED, UCS-4-INTERNAL, UCS-4-SWAPPED

Related answers from Splunk Community

Configure character set encoding

Supported character sets

Manually specify a character set

Automatically specify a character set

Train Splunk Enterprise to recognize a character set

Comprehensive list of supported character sets

Comments

Configure character set encoding

Was this topic useful?