Anonymize data
You might need to anonymize, or mask, sensitive personal information from the data that you index into the Splunk platform, such as credit card or Social Security numbers. You can anonymize parts of confidential fields in events to protect privacy while providing enough remaining data for use in event tracking.
To anonymize data with Splunk Enterprise, you must configure a Splunk Enterprise instance as a heavy forwarder and anonymize the incoming data with that instance before sending it to Splunk Enterprise.
There are two ways to anonymize data with a heavy forwarder:
- Use the
SEDCMD
setting. This setting exists in the props.conf configuration file, which you configure on the heavy forwarder. It acts like ased
*nix script to do replacements and substitutions. This method is more straightforward, takes less time to configure, and is slightly faster than a regular expression transform. But there are limits to how many times you can invoke theSEDCMD
setting and what it can do. For instructions on this method, see Anonymize data with a sed script.
- Use a regular expression (regex) transform. This method takes longer to configure, but less complex to modify after the initial configuration. You can also assign this method to multiple data inputs more flexibly. For instructions on this method, see Anonymize data with a regular expression transform.
Both of these options are also available in Splunk Enterprise, where you can complete the configuration on either a heavy forwarder or an indexer.
Prerequisites to anonymize data
Before you can anonymize data, you must select a set of events to anonymize.
- First, you select the events to anonymize
- Then, you either:
- Use the props.conf configuration file to anonymize the events with a sed script
- Use the props.conf and transforms.conf configuration files to anonymize the events with a regular expression transform
Select events to anonymize
You can anonymize event data based on whether the data comes from a specific source or host, or whether the data is tagged with a specific source type. You must specify which method to select the data in the props.conf configuration file. The stanza name that you specify in the props.conf file determines how the Splunk platform selects and processes events for anonymization.
Refer to the following stanza specifications:
- The
[host::<host>]
stanza matches events that contain the specified host - The
[source::<source>]
stanza matches events with the specified source - The
[<sourcetype>]
stanza matches events with the specified source type- As a best practice, you must subsequently specify the source type in the inputs.conf file for this stanza type to work. This option is a Splunk best practice.
Replace strings in events with a sed script
You can use both a sed script and the SEDCMD
method to replace strings or substitute characters.
Refer to the following syntax for a sed
-style replacement:
SEDCMD-<class> = s/<regex>/<replacement>/flags
The SEDCMD setting has the following components:
regex
is a regular expression written in the Perl programming language. It represents what you want to replace.replacement
is the string you want to replace whatever the regular expression matches.flags
can be either the letterg
to replace all matches or a number to replace a specified match.
Anonymize multiline mode using sed expressions
The Splunk platform doesn't support applying sed
expressions in multiline mode. To use a sed
expression to anonymize multiline events, use 2 sed
expressions in succession by first removing the newlines and then performing additional replacements. For example, the following search uses the rex
command to replace all newline characters in a multiline event containing HTML content, and then redacts all of the HTML content.
index=main html
| rex mode=sed field=_raw "s/\\n/NEWLINE_REMOVED/g"
| rex mode=sed field=_raw "s/<html.*html>/REDACTED/g"
Substitute characters in events with a sed script
Refer to the following syntax for a sed
character substitution:
SEDCMD-<class> = y/<string1>/<string2>/
This substitutes each occurrence of the characters in string1
with the characters in string2
.
Use a regular expression transform with transforms.conf to anonymize events
Each stanza in the transforms.conf configuration file defines a transform class that you can reference from the props.conf file for a given source type, source, or host.
Transforms have several settings and variables that let you specify what changes and where, but the following settings are the most important:
- The
REGEX
setting specifies the regular expression that points to the string in the event that you want to anonymize - The
FORMAT
setting specifies the masked values - The
$1
variable represents the text of the event before the regular expression that represents the string in the event that you want to mask - The
$2
variable represents the text of the event after the regular expression DEST_KEY = _raw
writes the value fromFORMAT
to the raw value in the log. This anonymizes the event.
The regular expression processor does not handle multiline events. In cases where events span multiple lines, specify that the event is multiline by placing the string (?m)
before the regular expression in the transforms.conf file.
Anonymize data with a sed script
You can anonymize data by using a sed
script to replace or substitute strings in events.
Sed
is a *nix utility that reads a file and modifies the input based on commands that you use within or arguments that you supply to the utility. Many *nix users use the utility for its versatility and fast transformation of incoming data. You can use a sed
-like syntax in the props.conf file to script the masking of your data in the Splunk platform.
The following is an example of how you would mask files.
Suppose you have a log file called accounts.log that contains Social Security and credit card numbers:
... ss=123456789, cc=1234-5678-9012-3456 ss=123456790, cc=2234-5678-9012-3457 ss=123456791, cc=3234-5678-9012-3458 ss=123456792, cc=4234-5678-9012-3459 ...
You want to mask the fields, so that they appear like this:
... ss=XXXXX6789, cc=XXXX-XXXX-XXXX-3456 ss=XXXXX6790, cc=XXXX-XXXX-XXXX-3457 ss=XXXXX6791, cc=XXXX-XXXX-XXXX-3458 ss=XXXXX6792, cc=XXXX-XXXX-XXXX-3459 ...
You can use the inputs.conf and props.conf configuration files to change the data that comes in from the accounts.log file as the Splunk platform accesses it. These configuration files reside in the $SPLUNK_HOME/etc/system/local/ directory on a heavy forwarder or on a Splunk Enterprise indexer.
Requirements for anonymizing data with a sed script
You must meet the following requirements to anonymize data with a sed
script:
- Have data that you want to anonymize
- Have an understanding of how regular expressions work
- Have an inputs.conf configuration file that points to where the data you want to anonymize is located
- Have a props.conf configuration file that references the
sed
script that anonymizes the data
Configure the inputs.conf file to use a sed script
In this example, you create the source type SSN-CC-Anon
and assign it to the data input for the accounts.log file. The transform that you create uses this source type to know what data to transform. While there are other options available for using SEDCMD
to transform incoming data from a log file, as best practice, create a source type, then assign the transform to that source type in the props.conf file.
- On the machine that runs the heavy forwarder, create an inputs.conf file in the
$SPLUNK_HOME/etc/system/local
directory if it doesn't already exist. - Open $SPLUNK_HOME/etc/system/local/inputs.conf with a text editor.
- Add the following stanza to reference the accounts.log file and assign a source type to the accounts.log data.
[monitor:///opt/appserver/logs/accounts.log] sourcetype = SSN-CC-Anon
- Save the file and close it.
Define the sed script in props.conf
In this example, props.conf
uses the SEDCMD
setting to perform the transformation directly.
The -Anon
clause after the SEDCMD
stem can be any string that helps you identify what the transformation script does. The clause must exist because it and the SEDCMD
stem form the class name for the script. The text after the equal sign (=
) is the regular expression that invokes the transformation.
- On the machine that runs the heavy forwarder, create a props.conf file in the
$SPLUNK_HOME/etc/system/local
directory if it doesn't already exist. - Open $SPLUNK_HOME/etc/system/local/props.conf with a text editor.
- Add the following stanza to reference the source type that you created in the inputs.conf file to do the masking transformation.
[SSN-CC-Anon] SEDCMD-Anon = s/ss=\d{5}(\d{4})/ss=xxxxx\1/g s/cc=(\d{4}-){3}(\d{4})/cc=xxxx-xxxx-xxxx-\2/g
- Save the file and close it.
- Restart the heavy forwarder.
Anonymize data with a regular expression transform
You can mask data by creating a transform. Transforms take incoming data and change it based on configurations you supply. In this case, the transformation is the replacement of portions of the data with characters that obscure the real, sensitive data, while retaining the original data format.
Suppose you have an application server log file called MyAppServer.log
that contains events like the following:
"2006-09-21, 02:57:11.58", 122, 11, "Path=/LoginUser Query=CrmId=ClientABC& ContentItemId=TotalAccess&SessionId=3A1785URH117BEA&Ticket=646A1DA4STF896EE& SessionTime=25368&ReturnUrl=http://www.clientabc.com, Method=GET,IP=209.51.249.195, Content=", "" "2006-09-21, 02:57:11.60", 122, 15, "UserData:<User CrmId="clientabc" UserId="p12345678"><EntitlementList></EntitlementList></User>", "" "2006-09-21, 02:57:11.60", 122, 15, "New Cookie: SessionId=3A1785URH117BEA& Ticket=646A1DA4STF896EE&CrmId=clientabcUserId=p12345678&AccountId=&AgentHost=man& AgentId=man, MANUser: Version=1&Name=&Debit=&Credit=&AccessTime=&BillDay=&Status= &Language=&Country=&Email=&EmailNotify=&Pin=&PinPayment=&PinAmount=&PinPG= &PinPGRate=&PinMenu=&", ""
You want to change the data so that the sessionID
and Ticket
fields are masked and the events appear as follows:
"2006-09-21, 02:57:11.58", 122, 11, "Path=/LoginUser Query=CrmId=ClientABC& ContentItemId=TotalAccess&SessionId=###########7BEA&Ticket=############96EE& SessionTime=25368&ReturnUrl=http://www.clientabc.com, Method=GET,IP=209.51.249.195, Content=", ""
You can use the inputs.conf, props.conf, and transforms.conf files to change the data that comes in from the MyAppServer.log file as the Splunk platform accesses it. All of these configuration files reside in the $SPLUNK_HOME/etc/system/local/ directory on a heavy forwarder or on a Splunk Enterprise indexer.
Requirements for anonymizing data with a regular expression transform
To mask sensitive data, you must meet the following requirements:
- Have data that you want to anonymize
- Have an understanding of how regular expressions work
- Have an inputs.conf configuration file that points to where this data is located
- Have a transforms.conf configuration file that does the data masking
- Have a props.conf configuration file that references the transforms.conf file for the data that you want to mask
Configure inputs.conf
In this example, you create the MyAppServer-Anon
source type. The transform you create uses this source type to know what data to transform. You can choose from other options for selecting the data to transform.
Follow these steps to configure the inputs.conf file for this example:
- On the machine that runs the heavy forwarder, create an inputs.conf file in the
$SPLUNK_HOME/etc/system/local
directory if the file doesn't already exist. - Open $SPLUNK_HOME/etc/system/local/inputs.conf with a text editor.
- Add the following stanza to reference the MyAppServer.log file and assign a source type to the MyAppServer.log data.
[monitor:///opt/MyAppServer/logs/MyAppServer.log] sourcetype = MyAppServer-Anon
- Save the file and close it.
Configure the transforms.conf file
Splunk Enterprise uses the transforms.conf file to perform the transformation of the data. Follow these steps to configure the transforms.conf file for this example:
- On the machine that runs the heavy forwarder, create a transforms.conf file in the
$SPLUNK_HOME/etc/system/local
directory if the file doesn't already exist. - Open
$SPLUNK_HOME/etc/system/local/transforms.conf
with a text editor. - Add the following text to define the transform that anonymizes the
sessionID
field so that only the last four characters in the field are exposed:[session-anonymizer] REGEX = (?m)^(.*)SessionId=\w+(\w{4}[&"].*)$ FORMAT = $1SessionId=########$2 DEST_KEY = _raw
- Add the following text directly underneath the
session-anonymizer
stanza to define the transform for theTicket
field, similar to thesessionID
field:[ticket-anonymizer] REGEX = (?m)^(.*)Ticket=\w+(\w{4}&.*)$ FORMAT = $1Ticket=########$2 DEST_KEY = _raw
- Save the file and close it.
Configure the props.conf configuration file
Props.conf specifies the transforms to use to anonymize your data. It references one or more transform classes that you define in a transforms.conf file.
In this example, session-anonymizer
and ticket-anonymizer
are the transform class names that you defined in the transforms.conf file.
Follow these steps to configure the props.conf file for this example:
- On the machine that runs the heavy forwarder, create a props.conf file in the
$SPLUNK_HOME/etc/system/local
directory if the file doesn't already exist. - Open
$SPLUNK_HOME/etc/system/local/props.conf
with a text editor. - Add the following stanza to reference the transforms that you created in the transforms.conf file to do the masking transformation.
[MyAppServer-Anon] TRANSFORMS-anonymize = session-anonymizer, ticket-anonymizer
- Save the file and close it.
- Restart the heavy forwarder.
Example of substitution using a sed/SEDCMD script
Suppose you want to index the file abc.log
, and you want to substitute the capital letters "A", "B", and "C" for every lowercase "a", "b", or "c" in your events.
Add the following stanza and settings to your props.conf
file:
[source::.../abc.log] SEDCMD-abc = y/abc/ABC/
The Splunk platform substitutes "A" for each "a", "B" for each "b", and "C" for each "c". When you search for source="*/abc.log"
, the lowercase letters "a", "b", and "c" do not appear in your data.
Caveats for anonymizing data
Anonymizing data can come with the following caveats.
Restrictions for using the sed script to anonymize data
If you use the SEDCMD
method to anonymize the data, the following restrictions apply:
- The
SEDCMD
script applies only to the_raw
field at index time. With the regular expression transform, you can apply changes to other fields.
Restrictions for using the regular expression transform to anonymize data
If you use the regular expression transform to anonymize data, include the LOOKAHEAD
setting when you define the transform and set it to a number that is larger than the largest expected event. Otherwise, anonymization might fail.
Splunk Platform and Splunk Enterprise indexers do not parse structured data
When you forward structured data to the Splunk platform or a Splunk Enterprise indexer, the platform does not parse it, even if you configured a props.conf file on that indexer with the INDEXED_EXTRACTIONS
setting. Forwarded data skips the following processing queues on the indexer, which precludes data parsing:
parsing
aggregation
typing
The forwarder must parse the data before it sends that data onward to the Splunk platform or the Splunk Enterprise indexer. To achieve this, you must set up a props.conf file on the forwarder that sends the data. This includes configuring the INDEXED_EXTRACTIONS
setting and any other parsing, filtering, anonymizing, and routing rules.
Universal forwarders can only parse structured data. See Forward data extracted from structured data files.
Configure indexed field extraction | How timestamp assignment works |
This documentation applies to the following versions of Splunk Cloud Platform™: 9.1.2308, 8.2.2201, 8.2.2202, 8.2.2203, 9.0.2205, 9.0.2208, 9.0.2209, 9.0.2303, 9.0.2305, 8.2.2112
Feedback submitted, thanks!