Admin Manual

 


About the Splunk Admin Manual
How Splunk Works

Crawl

This documentation does not apply to the most recent version of Splunk. Click here for the latest version.

Crawl

Use crawl to search your filesystem for new data sources to add to your index. Configure one or more types of crawlers in crawl.conf to define the type of data sources to include in or exclude from your results.


Configuration

Edit $SPLUNK_HOME/etc/system/local/crawl.conf to configure one or more crawlers that browse your data sources when you run the crawl command. Define each crawler by specifying values for each of the crawl attributes. Enable the crawler by adding it to crawlers_list.

Crawl logging

The crawl command produces a log of crawl activity that's stored in $SPLUNK_HOME/var/log/splunk/crawl.log. Set the logging level with the logging key in the [default] stanza of crawl.conf:

[default]
logging = <warn | error | info | debug>

Enable crawlers

Enable a crawler by listing the crawler specification stanza name in the crawlers_list key of the [crawlers] stanza.

Use a comma-separated list to specify multiple crawlers.

Enable crawlers that are defined in the stanzas: [file_crawler], [port_crawler], and [db_crawler].

[crawlers]
crawlers_list = file_crawler, port_crawler, db_crawler

Define crawlers

Define a crawler by adding a definition stanza in crawl.conf. Add additional crawler definitions by adding additional stanzas.

Example crawler stanzas in crawl.conf:

[Example_crawler_name]
....
[Another_crawler_name]
....

Add key/value pairs to crawler definition stanzas to set a crawler's behavior. The following keys are available for defining a file_crawler:

Argument Description
bad_directories_list Specify directories to exclude.
bad_extensions_list Specify file extensions to exclude.
bad_file_matches_list Specify a string, or a comma-separated list of strings that filenames must contain to be excluded. You can use wildcards (examples: foo*.*,foo*bar, *baz*).
packed_extensions_list Specify extensions of common archive filetypes to include. Splunk unpacks compressed files before it reads them. It can handle tar, gz, bz2, tar.gz, tgz, tbz, tbz2, zip, and z files. Leave this empty if you don't want to add any archive filetypes.
collapse_threshold Specify the minimum number of files a source must have to be considered a directory.
days_sizek_pairs_list Specify a comma-separated list of age (days) and size (kb) pairs to constrain what files are crawled. For example: days_sizek_pairs_list = 7-0, 30-1000 tells Splunk to crawl only files last modified within 7 days and at least 0kb in size, or modified within the last 30 days and at least 1000kb in size.
big_dir_filecount Set the maximum number of files a directory can have in order to be crawled. crawl excludes directories that contain more than the maximum number you specify.
index Specify the name of the index to which you want to add crawled file and directory contents.
max_badfiles_per_dir Specify how far to crawl into a directory for files. If Splunk crawls a directory and doesn't find valid files within the specified max_badfiles_per_dir, then Splunk excludes the directory.
root Specify directories for a crawler to crawl through.


Example

Here's an example crawler called simple_file_crawler may look like:

[simple_file_crawler]
bad_directories_list= bin, sbin, boot, mnt, proc, tmp, temp, home, mail, .thumbnails, cache, old
bad_extensions_list= mp3, mpg, jpeg, jpg,  m4, mcp, mid
bad_file_matches_list= *example*, *makefile, core.*
packed_extensions_list= gz, tgz, tar, zip
collapse_threshold= 10
days_sizek_pairs_list= 3-0,7-1000, 30-10000
big_dir_filecount= 100
index=main
max_badfiles_per_dir=100

This documentation applies to the following versions of Splunk: 3.3 , 3.3.1 , 3.3.2 , 3.3.3 , 3.3.4 , 3.4 , 3.4.1 , 3.4.2 , 3.4.3 , 3.4.5 , 3.4.6 , 3.4.8 , 3.4.9 , 3.4.10 , 3.4.11 , 3.4.12 , 3.4.13 , 3.4.14 View the Article History for its revisions.


You must be logged into splunk.com in order to post comments. Log in now.

Was this documentation topic helpful?

If you'd like to hear back from us, please provide your email address:

We'd love to hear what you think about this topic or the documentation as a whole. Feedback you enter here will be delivered to the documentation team.