Anonymize data
You might need to anonymize, or mask, sensitive personal information from the data that you index into Splunk Enterprise, such as credit card or Social Security numbers. You can anonymize parts of confidential fields in events to protect privacy while providing enough remaining data for use in event tracking. You can configure Splunk Enterprise indexers or heavy forwarders to anonymize data as it arrives and before the software indexes it.
There are two ways to anonymize data with Splunk Enterprise:
- With a
sed
script. This method is easier to do, takes less time to configure, and is slightly faster, but has limits in how many times you can invoke it and what it can do. For instructions on this method, see Anonymize data with a sed script - With a regular expression (regex) transform. This method takes longer to configure, but is easier to modify after the initial configuration and can be assigned to multiple data inputs more easily. For instructions on this method, see Anonymize data with a regular expression transform
To anonymize data with Splunk Cloud, you must configure a Splunk Enterprise instance as a heavy forwarder and anonymize the incoming data with that instance before sending it to Splunk Cloud. You can follow the instructions in this topic on the heavy forwarder.
Key points to anonymizing data
Before you can anonymize data, you must select a set of events to anonymize.
- You use
props.conf
to select the events to anonymize - You then use
props.conf
to anonymize the events with a sed script - Or, you use
props.conf
andtransforms.conf
to anonymize the events with a regular expression transform
Select events to anonymize
You can anonymize event data based on whether the data comes from a specific source or host, or is tagged with a specific source type. You must specify which method to use in props.conf
. The stanza name that you specify in props.conf
determines how Splunk Enterprise selects and processes events for anonymization.
[host::<host>]
matches events that contain the specified hostsource::<source>
matches events with the specified source<sourcetype>
matches events with the specified source type. You must specify the source type ininputs.conf
for this stanza type to work. This option is a Splunk best practice.
Replace strings in events with SEDCMD
You can use the SEDCMD
method to replace strings or substitute characters.
The syntax for a sed
replace is:
SEDCMD-<class> = s/<regex>/<replacement>/flags
The SEDCMD command has the following components:
regex
is a Perl language regular expressionreplacement
is a string to replace the regular expression match.flags
can be either the letterg
to replace all matches or a number to replace a specified match.
Substitute characters in events with SEDCMD
The syntax for a sed
character substitution is:
SEDCMD-<class> = y/<string1>/<string2>/
This substitutes each occurrence of the characters in string1
with the characters in string2
.
Use a regular expression transform with transforms.conf to anonymize events
Each stanza in transforms.conf
defines a transform class that you can reference from props.conf
for a given source type, source, or host.
Transforms have several settings and variables that let you specify what changes and where, but the following are the most important:
- The
REGEX
setting specifies the regular expression that points to the string in the event that you want to anonymize - The
FORMAT
setting specifies the masked values - The
$1
variable represents the text of the event before the regular expression that represents the string in the event that you want to mask - The
$2
variable represents the text of the event after the regular expression DEST_KEY = _raw
says to write the value fromFORMAT
to the raw value in the log. This anonymizes the event.
The regular expression processor does not handle multiline events. In cases where events span multiple lines, specify that the event is multiline by placing (?m)
before the regular expression in transforms.conf
.
Anonymize data with a sed script
You can anonymize data by using a sed
script to replace or substitute strings in events.
Sed
is a *nix utility that reads a file and modifies the input based on commands that you use within or arguments that you supply to the utility. Many *nix users use the utility for fast transformation of incoming because the utility is so versatile. Splunk Enterprise lets you use a sed
-like syntax in props.conf
to script the masking of your data.
Prerequisites for anonymizing data with a sed script
You must have the following to anonymize data with a sed
script:
- Data that you want to anonymize
- An understanding of how regular expressions work. See regular-expressions.info for details on regular expressions
- An
inputs.conf
file with a configuration that tells Splunk Enterprise where this data is located - A
props.conf
file that references thesed
script that anonymizes the data
For example, if you have a log file called accounts.log
that contains Social Security and credit card numbers:
... ss=123456789, cc=1234-5678-9012-3456 ss=123456790, cc=2234-5678-9012-3457 ss=123456791, cc=3234-5678-9012-3458 ss=123456792, cc=4234-5678-9012-3459 ...
And you want to mask the fields, so that they appear like this:
... ss=XXXXX6789, cc=XXXX-XXXX-XXXX-3456 ss=XXXXX6790, cc=XXXX-XXXX-XXXX-3457 ss=XXXXX6791, cc=XXXX-XXXX-XXXX-3458 ss=XXXXX6792, cc=XXXX-XXXX-XXXX-3459 ...
You can use inputs.conf
and props.conf
to change the data that comes in from accounts.log
as Splunk Enterprise accesses it. These configuration files reside in the $SPLUNK_HOME/etc/system/local/
directory.
Configure inputs.conf to use a sed script
In this example, you create the source type SSN-CC-Anon
, and assign it to the data input for accounts.log
. The transform that you create uses this source type to know what data to transform. While there are other options available for using SEDCMD
to transform incoming data from a log file, best practice is to create a source type, then assign the transform to that source type in props.conf
.
- On the machine that runs Splunk Enterprise, create an inputs.conf file in the
$SPLUNK_HOME/etc/system/local
directory. If the file already exists, proceed to the next step. - Open
$SPLUNK_HOME/etc/system/local/inputs.conf
with a text editor. - Add the following stanza to reference
MyAppServer.log
and assign a source type to theMyAppServer.log
data.[monitor:///opt/appserver/logs/accounts.log] sourcetype = SSN-CC-Anon
- Save the file and close it.
Define the sed script in props.conf
In this example, props.conf
uses the SEDCMD
setting to perform the transformation directly.
The "-Anon
" clause after the "SEDCMD
" stem can be any string that helps you identify what the transformation script does. The clause must exist because it and the SEDCMD
stem form the class name for the script. The text after the =
is the regular expression that invokes the transformation.
- On the machine that runs Splunk Enterprise, create a
props.conf
in the$SPLUNK_HOME/etc/system/local
directory. If the file already exists, proceed to the next step. - Open
$SPLUNK_HOME/etc/system/local/props.conf
with a text editor. - Add the following stanza to reference the transforms that you created in
transforms.conf
to do the masking transformation.[SSN-CC-Anon] SEDCMD-Anon = s/ssn=\d{5}(\d{4})/ssn=xxxxx\1/g s/cc=(\d{4}-){3}(\d{4})/cc=xxxx-xxxx-xxxx-\2/g
- Save the file and close it.
- Restart Splunk Enterprise.
Anonymize data with a regular expression transform
You can mask data by creating a transform. Transforms take incoming data and change it based on configurations you supply. In this case, the transformation is the replacement of portions of the data with characters that obscure the real, sensitive data, while retaining the original data format.
Prerequisites for anonymizing data with a regular expression transform
To mask sensitive data, you need the following items:
- Data that you want to anonymize
- An understanding of how regular expressions work
- An
inputs.conf
file, with a configuration that tells Splunk Enterprise where this data is located - A
transforms.conf
file that does the data masking - A
props.conf
file that references thetransforms.conf
file for the data that you want to mask
For example, if you have an application server log file called MyAppServer.log
that contains events like the following:
"2006-09-21, 02:57:11.58", 122, 11, "Path=/LoginUser Query=CrmId=ClientABC& ContentItemId=TotalAccess&SessionId=3A1785URH117BEA&Ticket=646A1DA4STF896EE& SessionTime=25368&ReturnUrl=http://www.clientabc.com, Method=GET,IP=209.51.249.195, Content=", "" "2006-09-21, 02:57:11.60", 122, 15, "UserData:<User CrmId="clientabc" UserId="p12345678"><EntitlementList></EntitlementList></User>", "" "2006-09-21, 02:57:11.60", 122, 15, "New Cookie: SessionId=3A1785URH117BEA& Ticket=646A1DA4STF896EE&CrmId=clientabcUserId=p12345678&AccountId=&AgentHost=man& AgentId=man, MANUser: Version=1&Name=&Debit=&Credit=&AccessTime=&BillDay=&Status= &Language=&Country=&Email=&EmailNotify=&Pin=&PinPayment=&PinAmount=&PinPG= &PinPGRate=&PinMenu=&", ""
And you want to change the data so that the "sessionID" and "Ticket" fields are masked, and the events appear as follows:
"2006-09-21, 02:57:11.58", 122, 11, "Path=/LoginUser Query=CrmId=ClientABC& ContentItemId=TotalAccess&SessionId=###########7BEA&Ticket=############96EE& SessionTime=25368&ReturnUrl=http://www.clientabc.com, Method=GET,IP=209.51.249.195, Content=", ""
Use the inputs.conf
, props.conf
, and transforms.conf
files to change the data that comes in from MyAppServer.log
as Splunk Enterprise accesses it. All of these configuration files reside in the $SPLUNK_HOME/etc/system/local/
directory.
Configure inputs.conf
In this example, you create the MyAppServer-Anon
source type. The transform you create uses this source type to know what data to transform. There are other options for selecting the data to transform, that will be explained later in this topic.
- On the machine that runs Splunk Enterprise, create an inputs.conf file in the
$SPLUNK_HOME/etc/system/local
directory. If the file already exists, proceed to the next step. - Open
$SPLUNK_HOME/etc/system/local/inputs.conf
with a text editor. - Add the following stanza to reference
MyAppServer.log
and assign a source type to theMyAppServer.log
data.[monitor:///opt/MyAppServer/logs/MyAppServer.log] sourcetype = MyAppServer-Anon
- Save the file and close it.
Configure transforms.conf
Transforms.conf
is the file that Splunk Enterprise uses to perform the transformation of the data.
- On the machine that runs Splunk Enterprise, create a transforms.conf file in the
$SPLUNK_HOME/etc/system/local
directory. If the file already exists, proceed to the next step. - Open
$SPLUNK_HOME/etc/system/local/transforms.conf
with a text editor. - Add the following text to define the transform that anonymizes the
sessionID
field so that only the last four characters in the field are exposed:[session-anonymizer] REGEX = (?m)^(.*)SessionId=\w+(\w{4}[&"].*)$ FORMAT = $1SessionId=########$2 DEST_KEY = _raw
- Add the following text directly underneath the
session-anonymizer
stanza to define the transform for theTicket
field, similar to thesessionID
field:[ticket-anonymizer] REGEX = (?m)^(.*)Ticket=\w+(\w{4}&.*)$ FORMAT = $1Ticket=########$2 DEST_KEY = _raw
- Save the file and close it.
Configure props.conf
Props.conf
specifies the transforms to use to anonymize your data. It references one or more transform classes that you define in a transforms.conf
file.
In this example, session-anonymizer
and ticket-anonymizer
are the transform class names that you defined in the transforms.conf
file.
- On the machine that runs Splunk Enterprise, create a
props.conf
file in the$SPLUNK_HOME/etc/system/local
directory. If the file already exists, proceed to the next step. - Open
$SPLUNK_HOME/etc/system/local/props.conf
with a text editor. - Add the following stanza to reference the transforms that you created in
transforms.conf
to do the masking transformation.[MyAppServer-Anon] TRANSFORMS-anonymize = session-anonymizer, ticket-anonymizer
- Save the file and close it.
- Restart Splunk Enterprise.
Example
You have a file you want to index, abc.log
, and you want to substitute the capital letters "A", "B", and "C" for every lowercase "a", "b", or "c" in your events. Add the following stanza and settings to your props.conf
:
[source::.../abc.log] SEDCMD-abc = y/abc/ABC/
Splunk Enterprise substituted "A" for each "a", "B" for each "b", and "C" for each "c'. When you search for source="*/abc.log"
, the lowercase letters "a", "b", and "c" do not appear in your data.
Caveats for anonymizing data
Restrictions for using the sed script to anonymize data
If you use the SEDCMD
method to anonymize the data, the following restrictions apply:
- The
SEDCMD
script applies only to the_raw
field at index time. With the regular expression transform, you can apply changes to other fields. - You cannot use more than one
SEDCMD
type transformation for the same host, source, or source type in a singleprops.conf
file.
Restrictions for using the regular expression transform to anonymize data
If you use the regular expression transform to anonymize data, the following restrictions apply, include the LOOKAHEAD
setting when you define the transform and set it to a number that is larger than the largest expected event. Otherwise, anonymization could fail.
Splunk indexers do not parse structured data
When you forward structured data to an indexer, the indexer does not parse it, even if you configured props.conf
on that indexer with the INDEXED_EXTRACTIONS
setting. Forwarded data skips the following queues on the indexer, which precludes data parsing:
parsing
aggregation
typing
The forwarded data must arrive at the indexer already parsed. To achieve this, you must set up props.conf
on the forwarder that sends the data. This includes configuring the INDEXED_EXTRACTIONS
setting and any other parsing, filtering, anonymizing, and routing rules.
Universal forwarders can parse structured data only. See Forward data extracted from structured data files.
Configure indexed field extraction | How timestamp assignment works |
This documentation applies to the following versions of Splunk® Enterprise: 7.0.0, 7.0.1, 7.0.2, 7.0.3, 7.0.4, 7.0.5, 7.0.6, 7.0.7, 7.0.8, 7.0.9, 7.0.10, 7.0.11, 7.0.13, 7.1.0, 7.1.1, 7.1.2, 7.1.3, 7.1.4, 7.1.5, 7.1.6, 7.1.7, 7.1.8, 7.1.9, 7.1.10
Feedback submitted, thanks!