
Configure character set encoding
You can configure character set encoding for your data sources. Splunk software has built-in character set specifications to support internationalization of your deployment. Splunk software supports many languages (including some that do not use Universal Coded Character Set Transformation Format - 8-bit (UTF-8) encoding).
Splunk software attempts to apply UTF-8 encoding to your sources by default. If a source does not use UTF-8 encoding or is a non-ASCII file, Splunk software tries to convert data from the source to UTF-8 encoding unless you specify a character set to use by setting the CHARSET
key in props.conf
.
You can retrieve a list of the valid character encoding specifications by using the iconv -l
command on most *nix systems. A port for iconv
on Windows is available.
Supported character sets
Splunk software supports an extremely wide range of character sets, including such key ones as:
- UTF-8
- UTF-16LE
- Latin-1
- BIG5
- SHIFT-JIS
See "Comprehensive list of supported character sets" at the end of this topic for the exhaustive list.
Here is a short list of the main supported character sets and the languages they correspond to.
Language | Code |
Arabic | CP1256 |
Arabic | ISO-8859-6 |
Armenian | ARMSCII-8 |
Belarus | CP1251 |
Bulgarian | ISO-8859-5 |
Czech | ISO-8859-2 |
Georgian | Georgian-Academy |
Greek | ISO-8859-7 |
Hebrew | ISO-8859-8 |
Japanese | EUC-JP |
Japanese | SHIFT-JIS |
Korean | EUC-KR |
Russian | CP1251 |
Russian | ISO-8859-5 |
Russian | KOI8-R |
Slovak | CP1250 |
Slovenian | ISO-8859-2 |
Thai | TIS-620 |
Ukrainian | KOI8-U |
Vietnamese | VISCII |
Manually specify a character set
To manually specify a character set to apply to an input, set the CHARSET
key in props.conf
:
[spec] CHARSET=<string>
For example, if you have a host that generates data in Greek (called "GreekSource" in this example) and that uses ISO-8859-7 encoding, set CHARSET=ISO-8859-7
for that host in props.conf
:
[host::GreekSource] CHARSET=ISO-8859-7
Note: Splunk software parses only character encodings that have UTF-8 mappings. Some EUC-JP characters do not have a mapped UTF-8 encoding.
Automatically specify a character set
Splunk software can automatically detect languages and proper character sets using its sophisticated character set encoding algorithm.
To configure Splunk software to automatically detect the proper language and character set encoding for a particular input, set CHARSET=AUTO
for the input in props.conf
. For example, to automatically detect character set encoding for the host "my-foreign-docs", set CHARSET=AUTO
for that host in props.conf
:
[host::my-foreign-docs] CHARSET=AUTO
Train Splunk software to recognize a character set
If you want to use a character set encoding that Splunk software does not recognize, train it to recognize the character set by adding a sample file to the following path and restarting Splunk Enterprise:
$SPLUNK_HOME/etc/ngram-models/_<language>-<encoding>.txt
For example, if you want to use the "vulcan-ISO-12345" character set, copy the specification file to the following path:
/SPLUNK_HOME/etc/ngram-models/_vulcan-ISO-12345.txt
After the sample file is added to the specified path, Splunk software recognizes sources that use the new character set, and automatically converts them to UTF-8 format at index time.
If you have Splunk Cloud and want to add a character set encoding to your Splunk deployment, file a Support ticket.
Comprehensive list of supported character sets
The common character sets described earlier are a small subset of what the CHARSET attribute can support. Splunk software also supports a long list of character sets and aliases, identical to the list supported by the *nix iconv
utility.
Note: Splunk software ignores punctuation and case when matching CHARSET, so, for example, "utf-8", "UTF-8", and "utf8" are all considered identical.
Here is the full list, with aliases indicated in parantheses:
- utf-8 (aka, CESU-8, ANSI_X3.4-1968, ANSI_X3.4-1986, ASCII, CP367, IBM367, ISO-IR-6, ISO646-US ISO_646.IRV:1991, US, US-ASCII, CSASCII)
- utf-16le (aka, UCS-2LE, UNICODELITTLE)
- utf-16be (aka, ISO-10646-UCS-2, UCS-2, CSUNICODE, UCS-2BE, UNICODE-1-1, UNICODEBIG, CSUNICODE11, UTF-16)
- utf-32le (aka, UCS-4LE)
- utf-32be (aka, ISO-10646-UCS-4, UCS-4, CSUCS4, UCS-4BE, UTF-32)
- utf-7 (aka, UNICODE-1-1-UTF-7, CSUNICODE11UTF7)
- c99 (aka, java)
- utf-ebcdic
- latin-1 (aka, CP819, IBM819, ISO-8859-1, ISO-IR-100, ISO_8859-1:1987, L1, CSISOLATIN1)
- latin-2 (aka, ISO-8859-2, ISO-IR-101, ISO_8859-2:1987, L2, CSISOLATIN2)
- latin-3 (aka, ISO-8859-3, ISO-IR-109, ISO_8859-3:1988, L3, CSISOLATIN3)
- latin-4 (aka, ISO-8859-4, ISO-IR-110, ISO_8859-4:1988, L4, CSISOLATIN4)
- latin-5 (aka, ISO-8859-9, ISO-IR-148, ISO_8859-9:1989, L5, CSISOLATIN5)
- latin-6 (aka, ISO-8859-10, ISO-IR-157 ,ISO_8859-10:1992, L6, CSISOLATIN6)
- latin-7 (aka, ISO-8859-13, ISO-IR-179 ,L7)
- latin-8 (aka, ISO-8859-14, ISO-CELTIC, ISO-IR-199, ISO_8859-14:1998, L8)
- latin-9 (aka, ISO-8859-15, ISO-IR-203, ISO_8859-15:1998)
- latin-10 (aka, ISO-8859-16, ISO-IR-226, ISO_8859-16:2001, L10, LATIN10)
- ISO-8859-5 (aka, CYRILLIC, ISO-IR-144, ISO_8859-5:198,8 CSISOLATINCYRILLIC)
- ISO-8859-6(aka, ARABIC, ASMO-708, ECMA-114, ISO-IR-127, ISO_8859-6:1987, CSISOLATINARABIC, MACARABIC)
- ISO-8859-7 (aka, ECMA-118, ELOT_928, GREEK, GREEK8, ISO-IR-126, ISO_8859-7:1987, ISO_8859-7:2003, CSISOLATINGREEK)
- ISO-8859-8 (aka, HEBREW, ISO-8859-8, ISO-IR-138, ISO8859-8, ISO_8859-8:1988, CSISOLATINHEBREW)
- ISO-8859-11
- roman-8 (aka, HP-ROMAN8, R8, CSHPROMAN8)
- KOI8-R (aka, CSKOI8R)
- KOI8-U
- KOI8-T
- GEORGIAN-ACADEMY
- GEORGIAN-PS
- ARMSCII-8
- MACINTOSH (aka, MAC, MACROMAN, CSMACINTOSH) [Note: these MAC* charsets are for MacOS 9; OS/X uses unicode]
- MACGREEK
- MACCYRILLIC
- MACUKRAINE
- MACCENTRALEUROPE
- MACTURKISH
- MACCROATIAN
- MACICELAND
- MACROMANIA
- MACHEBREW
- MACTHAI
- NEXTSTEP
- CP850 (aka, 850, IBM850, CSPC850MULTILINGUAL)
- CP862 (aka, 862, IBM862, CSPC862LATINHEBREW)
- CP866 (aka, 866, IBM866, CSIBM866)
- CP874 (aka, WINDOWS-874)
- CP932
- CP936 (aka, MS936, WINDOWS-936)
- CP949 (aka, UHC)
- CP950
- CP1250 (aka, MS-EE, WINDOWS-1250)
- CP1251 (aka, MS-CYRL, WINDOWS-1251)
- CP1252 (aka, MS-ANSI, WINDOWS-1252)
- CP1253 (aka, MS-GREEK, WINDOWS-1253)
- CP1254 (aka, MS-TURK, WINDOWS-1254)
- CP1255 (aka, MS-HEBR, WINDOWS-1255)
- CP1256 (aka, MS-ARAB, WINDOWS-1256)
- CP1257 (aka, WINBALTRIM, WINDOWS-1257)
- CP1258 (aka, WINDOWS-1258)
- CP1361 (aka, JOHAB)
- BIG-5 (aka, BIG-FIVE, CN-BIG5, CSBIG5)
- BIG5-HKSCS(aka, BIG5-HKSCS:2001)
- CN-GB (aka, EUC-CN, EUCCN, GB2312, CSGB2312)
- EUC-JP (aka, EXTENDED_UNIX_CODE_PACKED_FORMAT_FOR_JAPANESE, CSEUCPKDFMTJAPANESE)
- EUC-KR (aka, CSEUCKR)
- EUC-TW (aka, CSEUCTW)
- GB18030
- GBK
- GB_1988-80 (aka, ISO-IR-57, ISO646-CN, CSISO57GB1988, CN)
- HZ (aka, HZ-GB-2312)
- GB_2312-80 (aka, CHINESE, ISO-IR-58, CSISO58GB231280)
- SHIFT-JIS (aka, MS_KANJI, SJIS, CSSHIFTJIS)
- ISO-IR-87 (aka, JIS0208 JIS_C6226-1983, JIS_X0208 JIS_X0208-1983, JIS_X0208-1990, X0208, CSISO87JISX0208, ISO-IR-159, JIS_X0212, JIS_X0212-1990, JIS_X0212.1990-0, X0212, CSISO159JISX02121990)
- ISO-IR-14 (aka, ISO646-JP, JIS_C6220-1969-RO, JP, CSISO14JISC6220RO)
- JISX0201-1976 (aka, JIS_X0201, X0201, CSHALFWIDTHKATAKANA)
- ISO-IR-149 (aka, KOREAN, KSC_5601, KS_C_5601-1987, KS_C_5601-1989, CSKSC56011987)
- VISCII (aka, VISCII1.1-1, CSVISCII)
- ISO-IR-166 (aka, TIS-620, TIS620-0, TIS620.2529-1, TIS620.2533-0, TIS620.2533-1)
PREVIOUS Overview of event processing |
NEXT Configure event line breaking |
This documentation applies to the following versions of Splunk® Enterprise: 6.0, 6.0.1, 6.0.2, 6.0.3, 6.0.4, 6.0.5, 6.0.6, 6.0.7, 6.0.8, 6.0.9, 6.0.10, 6.0.11, 6.0.12, 6.0.13, 6.0.14, 6.0.15, 6.1, 6.1.1, 6.1.2, 6.1.3, 6.1.4, 6.1.5, 6.1.6, 6.1.7, 6.1.8, 6.1.9, 6.1.10, 6.1.11, 6.1.12, 6.1.13, 6.1.14, 6.2.0, 6.2.1, 6.2.2, 6.2.3, 6.2.4, 6.2.5, 6.2.6, 6.2.7, 6.2.8, 6.2.9, 6.2.10, 6.2.11, 6.2.12, 6.2.13, 6.2.14, 6.2.15, 6.3.0, 6.3.1, 6.3.2, 6.3.3, 6.3.4, 6.3.5, 6.3.6, 6.3.7, 6.3.8, 6.3.9, 6.3.10, 6.3.11, 6.3.12, 6.3.13, 6.3.14, 6.4.0, 6.4.1, 6.4.2, 6.4.3, 6.4.4, 6.4.5, 6.4.6, 6.4.8, 6.4.10, 6.4.11, 6.5.0, 6.5.1, 6.5.1612 (Splunk Cloud only), 6.5.2, 6.5.3, 6.5.4, 6.5.5, 6.5.6, 6.5.7, 6.5.8, 6.5.9, 6.5.10, 6.6.0, 6.6.1, 6.6.2, 6.6.3, 6.6.4, 6.6.5, 6.6.6, 6.6.7, 6.6.8, 6.6.9, 6.6.10, 6.6.11, 6.6.12, 7.0.0, 7.0.1, 7.0.2, 7.0.3, 7.0.4, 7.0.5, 7.0.6, 7.0.7, 7.0.8, 7.0.9, 7.0.10, 7.0.11, 7.0.13, 7.1.0, 7.1.1, 7.1.2, 7.1.3, 7.1.4, 7.1.5, 7.1.6, 7.1.7, 7.1.8, 7.1.9, 7.1.10, 7.2.0, 7.2.1, 7.2.2, 7.2.3, 7.2.4, 7.2.5, 7.2.6, 7.2.7, 7.2.8, 7.2.9, 7.3.0, 7.3.1, 7.3.2, 7.3.3, 8.0.0, 8.0.1, 6.4.7, 6.4.9
Feedback submitted, thanks!