Splunk® Enterprise

Getting Data In

Anonymize data

You might need to anonymize, or mask, sensitive personal information from the data that you index into the Splunk platform, such as credit card or Social Security numbers. You can anonymize parts of confidential fields in events to protect privacy while providing enough remaining data for use in event tracking.

To anonymize data with Splunk Enterprise, you can configure a Splunk Enterprise instance as a heavy forwarder and anonymize the incoming data with that instance before sending it to Splunk Enterprise. If you have access to the Edge Processor solution, you can also use Edge Processors to anonymize data before sending it to your destination, see Filter and mask data using an Edge Processor and About the Edge Processor solution in the Use Edge Processors manual.

There are two ways to anonymize data with a heavy forwarder:

  • Use the SEDCMD setting. This setting exists in the props.conf configuration file, which you configure on the heavy forwarder. It acts like a sed *nix script to do replacements and substitutions. This method is more straightforward, takes less time to configure, and is slightly faster than a regular expression transform. But there are limits to how many times you can invoke the SEDCMD setting and what it can do. For instructions on this method, see Anonymize data with a sed script.
  • Use a regular expression (regex) transform. This method takes longer to configure, but less complex to modify after the initial configuration. You can also assign this method to multiple data inputs more flexibly. For instructions on this method, see Anonymize data with a regular expression transform.

Both of these options are also available in Splunk Enterprise, where you can complete the configuration on either a heavy forwarder or an indexer.

You can use field filters instead of the SEDCMD setting to anonymize raw data at index time. Just migrate the sed expression used with the SEDCMD setting by copying and pasting the sed expression into a new _raw field filter in Splunk Web. For information about field filters, see Protect PII, PHI, and other sensitive data with field filters in Securing the Splunk Platform.

Prerequisites to anonymize data

Before you can anonymize data, you must select a set of events to anonymize.

  • First, you select the events to anonymize
  • Then, you either:
    • Use the props.conf configuration file to anonymize the events with a sed script
    • Use the props.conf and transforms.conf configuration files to anonymize the events with a regular expression transform

Select events to anonymize

You can anonymize event data based on whether the data comes from a specific source or host, or whether the data is tagged with a specific source type. You must specify which method to select the data in the props.conf configuration file. The stanza name that you specify in the props.conf file determines how the Splunk platform selects and processes events for anonymization.

Refer to the following stanza specifications:

  • The [host::<host>] stanza matches events that contain the specified host
  • The [source::<source>] stanza matches events with the specified source
  • The [<sourcetype>] stanza matches events with the specified source type
    • As a best practice, you must subsequently specify the source type in the inputs.conf file for this stanza type to work. This option is a Splunk best practice.

Replace strings in events with a sed script

You can use both a sed script and the SEDCMD method to replace strings or substitute characters. Refer to the following syntax for a sed-style replacement:

SEDCMD-<class> = s/<regex>/<replacement>/flags

The SEDCMD setting has the following components:

  • regex is a regular expression written in the Perl programming language. It represents what you want to replace.
  • replacement is the string you want to replace whatever the regular expression matches.
  • flags can be either the letter g to replace all matches or a number to replace a specified match.

Anonymize multiline mode using sed expressions

The Splunk platform doesn't support applying sed expressions in multiline mode. To use a sed expression to anonymize multiline events, use 2 sed expressions in succession by first removing the newlines and then performing additional replacements. For example, the following search uses the rex command to replace all newline characters in a multiline event containing HTML content, and then redacts all of the HTML content.

index=main html | rex mode=sed field=_raw "s/\\n/NEWLINE_REMOVED/g" | rex mode=sed field=_raw "s/<html.*html>/REDACTED/g"

Substitute characters in events with a sed script

Refer to the following syntax for a sed character substitution:

SEDCMD-<class> = y/<string1>/<string2>/

This substitutes each occurrence of the characters in string1 with the characters in string2.

Use a regular expression transform with transforms.conf to anonymize events

Each stanza in the transforms.conf configuration file defines a transform class that you can reference from the props.conf file for a given source type, source, or host.

Transforms have several settings and variables that let you specify what changes and where, but the following settings are the most important:

  • The REGEX setting specifies the regular expression that points to the string in the event that you want to anonymize
  • The FORMAT setting specifies the masked values
  • The $1 variable represents the text of the event before the regular expression that represents the string in the event that you want to mask
  • The $2 variable represents the text of the event after the regular expression
  • DEST_KEY = _raw writes the value from FORMAT to the raw value in the log. This anonymizes the event.

The regular expression processor does not handle multiline events. In cases where events span multiple lines, specify that the event is multiline by placing the string (?m) before the regular expression in the transforms.conf file.

Anonymize data with a sed script

You can anonymize data by using a sed script to replace or substitute strings in events.

Sed is a *nix utility that reads a file and modifies the input based on commands that you use within or arguments that you supply to the utility. Many *nix users use the utility for its versatility and fast transformation of incoming data. You can use a sed-like syntax in the props.conf file to script the masking of your data in the Splunk platform.

The following is an example of how you would mask files.

Suppose you have a log file called accounts.log that contains Social Security and credit card numbers:

...
ss=123456789, cc=1234-5678-9012-3456
ss=123456790, cc=2234-5678-9012-3457
ss=123456791, cc=3234-5678-9012-3458
ss=123456792, cc=4234-5678-9012-3459
...

You want to mask the fields, so that they appear like this:

...
ss=XXXXX6789, cc=XXXX-XXXX-XXXX-3456
ss=XXXXX6790, cc=XXXX-XXXX-XXXX-3457
ss=XXXXX6791, cc=XXXX-XXXX-XXXX-3458
ss=XXXXX6792, cc=XXXX-XXXX-XXXX-3459
...

You can use the inputs.conf and props.conf configuration files to change the data that comes in from the accounts.log file as the Splunk platform accesses it. These configuration files reside in the $SPLUNK_HOME/etc/system/local/ directory on a heavy forwarder or on a Splunk Enterprise indexer.

Requirements for anonymizing data with a sed script

You must meet the following requirements to anonymize data with a sed script:

  • Have data that you want to anonymize
  • Have an understanding of how regular expressions work
  • Have an inputs.conf configuration file that points to where the data you want to anonymize is located
  • Have a props.conf configuration file that references the sed script that anonymizes the data

Configure the inputs.conf file to use a sed script

In this example, you create the source type SSN-CC-Anon and assign it to the data input for the accounts.log file. The transform that you create uses this source type to know what data to transform. While there are other options available for using SEDCMD to transform incoming data from a log file, as best practice, create a source type, then assign the transform to that source type in the props.conf file.

  1. On the machine that runs the heavy forwarder, create an inputs.conf file in the $SPLUNK_HOME/etc/system/local directory if it doesn't already exist.
  2. Open $SPLUNK_HOME/etc/system/local/inputs.conf with a text editor.
  3. Add the following stanza to reference the accounts.log file and assign a source type to the accounts.log data.
    [monitor:///opt/appserver/logs/accounts.log]
    sourcetype = SSN-CC-Anon
    
  4. Save the file and close it.

Define the sed script in props.conf

In this example, props.conf uses the SEDCMD setting to perform the transformation directly.

The -Anon clause after the SEDCMD stem can be any string that helps you identify what the transformation script does. The clause must exist because it and the SEDCMD stem form the class name for the script. The text after the equal sign (=) is the regular expression that invokes the transformation.

  1. On the machine that runs the heavy forwarder, create a props.conf file in the $SPLUNK_HOME/etc/system/local directory if it doesn't already exist.
  2. Open $SPLUNK_HOME/etc/system/local/props.conf with a text editor.
  3. Add the following stanza to reference the source type that you created in the inputs.conf file to do the masking transformation.
    [SSN-CC-Anon]
    SEDCMD-Anon = s/ss=\d{5}(\d{4})/ss=xxxxx\1/g s/cc=(\d{4}-){3}(\d{4})/cc=xxxx-xxxx-xxxx-\2/g
    
  4. Save the file and close it.
  5. Restart the heavy forwarder.

Anonymize data with a regular expression transform

You can mask data by creating a transform. Transforms take incoming data and change it based on configurations you supply. In this case, the transformation is the replacement of portions of the data with characters that obscure the real, sensitive data, while retaining the original data format.

Suppose you have an application server log file called MyAppServer.log that contains events like the following:

"2006-09-21, 02:57:11.58",  122, 11, "Path=/LoginUser Query=CrmId=ClientABC&
ContentItemId=TotalAccess&SessionId=3A1785URH117BEA&Ticket=646A1DA4STF896EE&
SessionTime=25368&ReturnUrl=http://www.clientabc.com, Method=GET,IP=209.51.249.195,
Content=", ""
"2006-09-21, 02:57:11.60",  122, 15, "UserData:<User CrmId="clientabc" 
UserId="p12345678"><EntitlementList></EntitlementList></User>", ""
"2006-09-21, 02:57:11.60",  122, 15, "New Cookie: SessionId=3A1785URH117BEA&
Ticket=646A1DA4STF896EE&CrmId=clientabcUserId=p12345678&AccountId=&AgentHost=man&
AgentId=man, MANUser: Version=1&Name=&Debit=&Credit=&AccessTime=&BillDay=&Status=
&Language=&Country=&Email=&EmailNotify=&Pin=&PinPayment=&PinAmount=&PinPG=
&PinPGRate=&PinMenu=&", ""

You want to change the data so that the sessionID and Ticket fields are masked and the events appear as follows:

"2006-09-21, 02:57:11.58",  122, 11, "Path=/LoginUser Query=CrmId=ClientABC&
ContentItemId=TotalAccess&SessionId=###########7BEA&Ticket=############96EE&
SessionTime=25368&ReturnUrl=http://www.clientabc.com, Method=GET,IP=209.51.249.195,
Content=", ""

You can use the inputs.conf, props.conf, and transforms.conf files to change the data that comes in from the MyAppServer.log file as the Splunk platform accesses it. All of these configuration files reside in the $SPLUNK_HOME/etc/system/local/ directory on a heavy forwarder or on a Splunk Enterprise indexer.

Requirements for anonymizing data with a regular expression transform

To mask sensitive data, you must meet the following requirements:

  • Have data that you want to anonymize
  • Have an understanding of how regular expressions work
  • Have an inputs.conf configuration file that points to where this data is located
  • Have a transforms.conf configuration file that does the data masking
  • Have a props.conf configuration file that references the transforms.conf file for the data that you want to mask

Configure inputs.conf

In this example, you create the MyAppServer-Anon source type. The transform you create uses this source type to know what data to transform. You can choose from other options for selecting the data to transform.

Follow these steps to configure the inputs.conf file for this example:

  1. On the machine that runs the heavy forwarder, create an inputs.conf file in the $SPLUNK_HOME/etc/system/local directory if the file doesn't already exist.
  2. Open $SPLUNK_HOME/etc/system/local/inputs.conf with a text editor.
  3. Add the following stanza to reference the MyAppServer.log file and assign a source type to the MyAppServer.log data.
    [monitor:///opt/MyAppServer/logs/MyAppServer.log]
    sourcetype = MyAppServer-Anon
    
  4. Save the file and close it.

Configure the transforms.conf file

Splunk Enterprise uses the transforms.conf file to perform the transformation of the data. Follow these steps to configure the transforms.conf file for this example:

  1. On the machine that runs the heavy forwarder, create a transforms.conf file in the $SPLUNK_HOME/etc/system/local directory if the file doesn't already exist.
  2. Open $SPLUNK_HOME/etc/system/local/transforms.conf with a text editor.
  3. Add the following text to define the transform that anonymizes the sessionID field so that only the last four characters in the field are exposed:
    [session-anonymizer]
    REGEX = (?m)^(.*)SessionId=\w+(\w{4}[&"].*)$
    FORMAT = $1SessionId=########$2
    DEST_KEY = _raw
    
  4. Add the following text directly underneath the session-anonymizer stanza to define the transform for the Ticket field, similar to the sessionID field:
    [ticket-anonymizer]
    REGEX = (?m)^(.*)Ticket=\w+(\w{4}&.*)$
    FORMAT = $1Ticket=########$2
    DEST_KEY = _raw
    
  5. Save the file and close it.

Configure the props.conf configuration file

Props.conf specifies the transforms to use to anonymize your data. It references one or more transform classes that you define in a transforms.conf file.

In this example, session-anonymizer and ticket-anonymizer are the transform class names that you defined in the transforms.conf file.

Follow these steps to configure the props.conf file for this example:

  1. On the machine that runs the heavy forwarder, create a props.conf file in the $SPLUNK_HOME/etc/system/local directory if the file doesn't already exist.
  2. Open $SPLUNK_HOME/etc/system/local/props.conf with a text editor.
  3. Add the following stanza to reference the transforms that you created in the transforms.conf file to do the masking transformation.
    [MyAppServer-Anon]
    TRANSFORMS-anonymize = session-anonymizer, ticket-anonymizer
    
  4. Save the file and close it.
  5. Restart the heavy forwarder.

Example of substitution using a sed/SEDCMD script

Suppose you want to index the file abc.log, and you want to substitute the capital letters "A", "B", and "C" for every lowercase "a", "b", or "c" in your events.

Add the following stanza and settings to your props.conf file:

[source::.../abc.log]
SEDCMD-abc = y/abc/ABC/

The Splunk platform substitutes "A" for each "a", "B" for each "b", and "C" for each "c". When you search for source="*/abc.log", the lowercase letters "a", "b", and "c" do not appear in your data.

Caveats for anonymizing data

Anonymizing data can come with the following caveats.

Restrictions for using the sed script to anonymize data

If you use the SEDCMD method to anonymize the data, the following restrictions apply:

  • The SEDCMD script applies only to the _raw field at index time. With the regular expression transform, you can apply changes to other fields.

Restrictions for using the regular expression transform to anonymize data

If you use the regular expression transform to anonymize data, include the LOOKAHEAD setting when you define the transform and set it to a number that is larger than the largest expected event. Otherwise, anonymization might fail.

Splunk Platform and Splunk Enterprise indexers do not parse structured data

When you forward structured data to the Splunk platform or a Splunk Enterprise indexer, the platform does not parse it, even if you configured a props.conf file on that indexer with the INDEXED_EXTRACTIONS setting. Forwarded data skips the following processing queues on the indexer, which precludes data parsing:

  • parsing
  • aggregation
  • typing

The forwarder must parse the data before it sends that data onward to the Splunk platform or the Splunk Enterprise indexer. To achieve this, you must set up a props.conf file on the forwarder that sends the data. This includes configuring the INDEXED_EXTRACTIONS setting and any other parsing, filtering, anonymizing, and routing rules.

Universal forwarders can only parse structured data. See Forward data extracted from structured data files.

Last modified on 13 March, 2024
Configure indexed field extraction   How timestamp assignment works

This documentation applies to the following versions of Splunk® Enterprise: 9.3.0, 9.3.1, 9.3.2, 9.4.0


Was this topic useful?







You must be logged into splunk.com in order to post comments. Log in now.

Please try to keep this discussion focused on the content covered in this documentation topic. If you have a more general question about Splunk functionality or are experiencing a difficulty with Splunk, consider posting a question to Splunkbase Answers.

0 out of 1000 Characters