Splunk® Enterprise

Getting Data In

Download manual as PDF

Download topic as PDF

Anonymize data

You might need to anonymize, or mask, sensitive personal information from the data that you index into Splunk Enterprise, such as credit card or Social Security numbers. You can anonymize parts of confidential fields in events to protect privacy while providing enough remaining data for use in event tracking. You can configure Splunk Enterprise indexers or heavy forwarders to anonymize data as it arrives and before the software indexes it.

There are two ways to anonymize data with Splunk Enterprise:

  • With a sed script. This method is easier to do, takes less time to configure, and is slightly faster, but has limits in how many times you can invoke it and what it can do. For instructions on this method, see Anonymize data with a sed script
  • With a regular expression (regex) transform. This method takes longer to configure, but is easier to modify after the initial configuration and can be assigned to multiple data inputs more easily. For instructions on this method, see Anonymize data with a regular expression transform

To anonymize data with Splunk Cloud, you must configure a Splunk Enterprise instance as a heavy forwarder and anonymize the incoming data with that instance before sending it to Splunk Cloud. You can follow the instructions in this topic on the heavy forwarder.

Key points to anonymizing data

Before you can anonymize data, you must select a set of events to anonymize.

  • You use props.conf to select the events to anonymize
  • You then use props.conf to anonymize the events with a sed script
  • Or, you use props.conf and transforms.conf to anonymize the events with a regular expression transform

Select events to anonymize

You can anonymize event data based on whether the data comes from a specific source or host, or is tagged with a specific source type. You must specify which method to use in props.conf. The stanza name that you specify in props.conf determines how Splunk Enterprise selects and processes events for anonymization.

  • [host::<host>] matches events that contain the specified host
  • source::<source> matches events with the specified source
  • <sourcetype> matches events with the specified source type. You must specify the source type in inputs.conf for this stanza type to work. This option is a Splunk best practice.

Replace strings in events with SEDCMD

You can use the SEDCMD method to replace strings or substitute characters. The syntax for a sed replace is:

SEDCMD-<class> = s/<regex>/<replacement>/flags

The SEDCMD command has the following components:

  • regex is a Perl language regular expression
  • replacement is a string to replace the regular expression match.
  • flags can be either the letter g to replace all matches or a number to replace a specified match.

Substitute characters in events with SEDCMD

The syntax for a sed character substitution is:

SEDCMD-<class> = y/<string1>/<string2>/

This substitutes each occurrence of the characters in string1 with the characters in string2.

Use a regular expression transform with transforms.conf to anonymize events

Each stanza in transforms.conf defines a transform class that you can reference from props.conf for a given source type, source, or host.

Transforms have several settings and variables that let you specify what changes and where, but the following are the most important:

  • The REGEX setting specifies the regular expression that points to the string in the event that you want to anonymize
  • The FORMAT setting specifies the masked values
  • The $1 variable represents the text of the event before the regular expression that represents the string in the event that you want to mask
  • The $2 variable represents the text of the event after the regular expression
  • DEST_KEY = _raw says to write the value from FORMAT to the raw value in the log. This anonymizes the event.

The regular expression processor does not handle multiline events. In cases where events span multiple lines, specify that the event is multiline by placing (?m) before the regular expression in transforms.conf.

Anonymize data with a sed script

You can anonymize data by using a sed script to replace or substitute strings in events.

Sed is a *nix utility that reads a file and modifies the input based on commands that you use within or arguments that you supply to the utility. Many *nix users use the utility for fast transformation of incoming because the utility is so versatile. Splunk Enterprise lets you use a sed-like syntax in props.conf to script the masking of your data.

Prerequisites for anonymizing data with a sed script

You must have the following to anonymize data with a sed script:

  • Data that you want to anonymize
  • An understanding of how regular expressions work. See regular-expressions.info for details on regular expressions
  • An inputs.conf file with a configuration that tells Splunk Enterprise where this data is located
  • A props.conf file that references the sed script that anonymizes the data

For example, if you have a log file called accounts.log that contains Social Security and credit card numbers:

...
ss=123456789, cc=1234-5678-9012-3456
ss=123456790, cc=2234-5678-9012-3457
ss=123456791, cc=3234-5678-9012-3458
ss=123456792, cc=4234-5678-9012-3459
...

And you want to mask the fields, so that they appear like this:

...
ss=XXXXX6789, cc=XXXX-XXXX-XXXX-3456
ss=XXXXX6790, cc=XXXX-XXXX-XXXX-3457
ss=XXXXX6791, cc=XXXX-XXXX-XXXX-3458
ss=XXXXX6792, cc=XXXX-XXXX-XXXX-3459
...

You can use inputs.conf and props.conf to change the data that comes in from accounts.log as Splunk Enterprise accesses it. These configuration files reside in the $SPLUNK_HOME/etc/system/local/ directory.

Configure inputs.conf to use a sed script

In this example, you create the source type SSN-CC-Anon, and assign it to the data input for accounts.log. The transform that you create uses this source type to know what data to transform. While there are other options available for using SEDCMD to transform incoming data from a log file, best practice is to create a source type, then assign the transform to that source type in props.conf.

  1. On the machine that runs Splunk Enterprise, create an inputs.conf file in the $SPLUNK_HOME/etc/system/local directory. If the file already exists, proceed to the next step.
  2. Open $SPLUNK_HOME/etc/system/local/inputs.conf with a text editor.
  3. Add the following stanza to reference MyAppServer.log and assign a source type to the MyAppServer.log data.
    [monitor:///opt/appserver/logs/accounts.log]
    sourcetype = SSN-CC-Anon
    
  4. Save the file and close it.

Define the sed script in props.conf

In this example, props.conf uses the SEDCMD setting to perform the transformation directly.

The "-Anon" clause after the "SEDCMD" stem can be any string that helps you identify what the transformation script does. The clause must exist because it and the SEDCMD stem form the class name for the script. The text after the = is the regular expression that invokes the transformation.

  1. On the machine that runs Splunk Enterprise, create a props.conf in the $SPLUNK_HOME/etc/system/local directory. If the file already exists, proceed to the next step.
  2. Open $SPLUNK_HOME/etc/system/local/props.conf with a text editor.
  3. Add the following stanza to reference the transforms that you created in transforms.conf to do the masking transformation.
    [SSN-CC-Anon]
    SEDCMD-Anon = s/ssn=\d{5}(\d{4})/ssn=xxxxx\1/g s/cc=(\d{4}-){3}(\d{4})/cc=xxxx-xxxx-xxxx-\2/g
    
  4. Save the file and close it.
  5. Restart Splunk Enterprise.

Anonymize data with a regular expression transform

You can mask data by creating a transform. Transforms take incoming data and change it based on configurations you supply. In this case, the transformation is the replacement of portions of the data with characters that obscure the real, sensitive data, while retaining the original data format.

Prerequisites for anonymizing data with a regular expression transform

To mask sensitive data, you need the following items:

  • Data that you want to anonymize
  • An understanding of how regular expressions work
  • An inputs.conf file, with a configuration that tells Splunk Enterprise where this data is located
  • A transforms.conf file that does the data masking
  • A props.conf file that references the transforms.conf file for the data that you want to mask

For example, if you have an application server log file called MyAppServer.log that contains events like the following:

"2006-09-21, 02:57:11.58",  122, 11, "Path=/LoginUser Query=CrmId=ClientABC&
ContentItemId=TotalAccess&SessionId=3A1785URH117BEA&Ticket=646A1DA4STF896EE&
SessionTime=25368&ReturnUrl=http://www.clientabc.com, Method=GET,IP=209.51.249.195,
Content=", ""
"2006-09-21, 02:57:11.60",  122, 15, "UserData:<User CrmId="clientabc" 
UserId="p12345678"><EntitlementList></EntitlementList></User>", ""
"2006-09-21, 02:57:11.60",  122, 15, "New Cookie: SessionId=3A1785URH117BEA&
Ticket=646A1DA4STF896EE&CrmId=clientabcUserId=p12345678&AccountId=&AgentHost=man&
AgentId=man, MANUser: Version=1&Name=&Debit=&Credit=&AccessTime=&BillDay=&Status=
&Language=&Country=&Email=&EmailNotify=&Pin=&PinPayment=&PinAmount=&PinPG=
&PinPGRate=&PinMenu=&", ""

And you want to change the data so that the "sessionID" and "Ticket" fields are masked, and the events appear as follows:

"2006-09-21, 02:57:11.58",  122, 11, "Path=/LoginUser Query=CrmId=ClientABC&
ContentItemId=TotalAccess&SessionId=###########7BEA&Ticket=############96EE&
SessionTime=25368&ReturnUrl=http://www.clientabc.com, Method=GET,IP=209.51.249.195,
Content=", ""

Use the inputs.conf, props.conf, and transforms.conf files to change the data that comes in from MyAppServer.log as Splunk Enterprise accesses it. All of these configuration files reside in the $SPLUNK_HOME/etc/system/local/ directory.

Configure inputs.conf

In this example, you create the MyAppServer-Anon source type. The transform you create uses this source type to know what data to transform. There are other options for selecting the data to transform, that will be explained later in this topic.

  1. On the machine that runs Splunk Enterprise, create an inputs.conf file in the $SPLUNK_HOME/etc/system/local directory. If the file already exists, proceed to the next step.
  2. Open $SPLUNK_HOME/etc/system/local/inputs.conf with a text editor.
  3. Add the following stanza to reference MyAppServer.log and assign a source type to the MyAppServer.log data.
    [monitor:///opt/MyAppServer/logs/MyAppServer.log]
    sourcetype = MyAppServer-Anon
    
  4. Save the file and close it.

Configure transforms.conf

Transforms.conf is the file that Splunk Enterprise uses to perform the transformation of the data.

  1. On the machine that runs Splunk Enterprise, create a transforms.conf file in the $SPLUNK_HOME/etc/system/local directory. If the file already exists, proceed to the next step.
  2. Open $SPLUNK_HOME/etc/system/local/transforms.conf with a text editor.
  3. Add the following text to define the transform that anonymizes the sessionID field so that only the last four characters in the field are exposed:
    [session-anonymizer]
    REGEX = (?m)^(.*)SessionId=\w+(\w{4}[&"].*)$
    FORMAT = $1SessionId=########$2
    DEST_KEY = _raw
    
  4. Add the following text directly underneath the session-anonymizer stanza to define the transform for the Ticket field, similar to the sessionID field:
    [ticket-anonymizer]
    REGEX = (?m)^(.*)Ticket=\w+(\w{4}&.*)$
    FORMAT = $1Ticket=########$2
    DEST_KEY = _raw
    
  5. Save the file and close it.

Configure props.conf

Props.conf specifies the transforms to use to anonymize your data. It references one or more transform classes that you define in a transforms.conf file.

In this example, session-anonymizer and ticket-anonymizer are the transform class names that you defined in the transforms.conf file.

  1. On the machine that runs Splunk Enterprise, create a props.conf file in the $SPLUNK_HOME/etc/system/local directory. If the file already exists, proceed to the next step.
  2. Open $SPLUNK_HOME/etc/system/local/props.conf with a text editor.
  3. Add the following stanza to reference the transforms that you created in transforms.conf to do the masking transformation.
    [MyAppServer-Anon]
    TRANSFORMS-anonymize = session-anonymizer, ticket-anonymizer
    
  4. Save the file and close it.
  5. Restart Splunk Enterprise.

Example

You have a file you want to index, abc.log, and you want to substitute the capital letters "A", "B", and "C" for every lowercase "a", "b", or "c" in your events. Add the following stanza and settings to your props.conf:

[source::.../abc.log]
SEDCMD-abc = y/abc/ABC/

Splunk Enterprise substituted "A" for each "a", "B" for each "b", and "C" for each "c'. When you search for source="*/abc.log", the lowercase letters "a", "b", and "c" do not appear in your data.

Caveats for anonymizing data

Restrictions for using the sed script to anonymize data

If you use the SEDCMD method to anonymize the data, the following restrictions apply:

  • The SEDCMD script applies only to the _raw field at index time. With the regular expression transform, you can apply changes to other fields.
  • You cannot use more than one SEDCMD type transformation for the same host, source, or source type in a single props.conf file.

Restrictions for using the regular expression transform to anonymize data

If you use the regular expression transform to anonymize data, the following restrictions apply, include the LOOKAHEAD setting when you define the transform and set it to a number that is larger than the largest expected event. Otherwise, anonymization could fail.

Splunk indexers do not parse structured data

When you forward structured data to an indexer, the indexer does not parse it, even if you configured props.conf on that indexer with the INDEXED_EXTRACTIONS setting. Forwarded data skips the following queues on the indexer, which precludes data parsing:

  • parsing
  • aggregation
  • typing

The forwarded data must arrive at the indexer already parsed. To achieve this, you must set up props.conf on the forwarder that sends the data. This includes configuring the INDEXED_EXTRACTIONS setting and any other parsing, filtering, anonymizing, and routing rules.

Universal forwarders can parse structured data only. See Forward data extracted from structured data files.

PREVIOUS
Configure indexed field extraction
  NEXT
How timestamp assignment works

This documentation applies to the following versions of Splunk® Enterprise: 6.3.0, 6.3.1, 6.3.2, 6.3.3, 6.3.4, 6.3.5, 6.3.6, 6.3.7, 6.3.8, 6.3.9, 6.3.10, 6.3.11, 6.3.12, 6.3.13, 6.4.0, 6.4.1, 6.4.2, 6.4.3, 6.4.4, 6.4.5, 6.4.6, 6.4.7, 6.4.8, 6.4.9, 6.4.10, 6.5.0, 6.5.1, 6.5.1612 (Splunk Cloud only), 6.5.2, 6.5.3, 6.5.4, 6.5.5, 6.5.6, 6.5.7, 6.5.8, 6.5.9, 6.6.0, 6.6.1, 6.6.2, 6.6.3, 6.6.4, 6.6.5, 6.6.6, 6.6.7, 6.6.8, 6.6.9, 6.6.10, 6.6.11, 7.0.0, 7.0.1, 7.0.2, 7.0.3, 7.0.4, 7.0.5, 7.0.6, 7.0.7, 7.1.0, 7.1.1, 7.1.2, 7.1.3, 7.1.4


Was this documentation topic helpful?

Enter your email address, and someone from the documentation team will respond to you:

Please provide your comments here. Ask a question or make a suggestion.

You must be logged into splunk.com in order to post comments. Log in now.

Please try to keep this discussion focused on the content covered in this documentation topic. If you have a more general question about Splunk functionality or are experiencing a difficulty with Splunk, consider posting a question to Splunkbase Answers.

0 out of 1000 Characters