Crawl
This documentation does not apply to the most recent version of Splunk. Click here for the latest version.
Crawl
Use crawl to search your filesystem for new data sources to add to your index. Configure one or more types of crawlers in crawl.conf to define the type of data sources to include in or exclude from your results.
Configuration
Edit $SPLUNK_HOME/etc/system/local/crawl.conf to configure one or more crawlers that browse your data sources when you run the crawl command. Define each crawler by specifying values for each of the crawl attributes. Enable the crawler by adding it to crawlers_list.
Crawl logging
The crawl command produces a log of crawl activity that's stored in $SPLUNK_HOME/var/log/splunk/crawl.log. Set the logging level with the logging key in the [default] stanza of crawl.conf:
[default] logging = <warn | error | info | debug>
Enable crawlers
Enable a crawler by listing the crawler specification stanza name in the crawlers_list key of the [crawlers] stanza.
Use a comma-separated list to specify multiple crawlers.
Enable crawlers that are defined in the stanzas: [file_crawler], [port_crawler], and [db_crawler].
[crawlers] crawlers_list = file_crawler, port_crawler, db_crawler
Define crawlers
Define a crawler by adding a definition stanza in crawl.conf. Add additional crawler definitions by adding additional stanzas.
Example crawler stanzas in crawl.conf:
[Example_crawler_name] .... [Another_crawler_name] ....
Add key/value pairs to crawler definition stanzas to set a crawler's behavior. The following keys are available for defining a file_crawler:
| Argument | Description |
|---|---|
bad_directories_list
| Specify directories to exclude. |
bad_extensions_list
| Specify file extensions to exclude. |
bad_file_matches_list
| Specify a string, or a comma-separated list of strings that filenames must contain to be excluded. You can use wildcards (examples: foo*.*,foo*bar, *baz*). |
packed_extensions_list
| Specify extensions of common archive filetypes to include. Splunk unpacks compressed files before it reads them. It can handle tar, gz, bz2, tar.gz, tgz, tbz, tbz2, zip, and z files. Leave this empty if you don't want to add any archive filetypes. |
collapse_threshold
| Specify the minimum number of files a source must have to be considered a directory. |
days_sizek_pairs_list
| Specify a comma-separated list of age (days) and size (kb) pairs to constrain what files are crawled. For example: days_sizek_pairs_list = 7-0, 30-1000 tells Splunk to crawl only files last modified within 7 days and at least 0kb in size, or modified within the last 30 days and at least 1000kb in size. |
big_dir_filecount
| Set the maximum number of files a directory can have in order to be crawled. crawl excludes directories that contain more than the maximum number you specify. |
index
| Specify the name of the index to which you want to add crawled file and directory contents. |
max_badfiles_per_dir
| Specify how far to crawl into a directory for files. If Splunk crawls a directory and doesn't find valid files within the specified max_badfiles_per_dir, then Splunk excludes the directory. |
root
| Specify directories for a crawler to crawl through. |
Example
Here's an example crawler called simple_file_crawler may look like:
[simple_file_crawler] bad_directories_list= bin, sbin, boot, mnt, proc, tmp, temp, home, mail, .thumbnails, cache, old bad_extensions_list= mp3, mpg, jpeg, jpg, m4, mcp, mid bad_file_matches_list= *example*, *makefile, core.* packed_extensions_list= gz, tgz, tar, zip collapse_threshold= 10 days_sizek_pairs_list= 3-0,7-1000, 30-10000 big_dir_filecount= 100 index=main max_badfiles_per_dir=100
This documentation applies to the following versions of Splunk: 3.3 , 3.3.1 , 3.3.2 , 3.3.3 , 3.3.4 , 3.4 , 3.4.1 , 3.4.2 , 3.4.3 , 3.4.5 , 3.4.6 , 3.4.8 , 3.4.9 , 3.4.10 , 3.4.11 , 3.4.12 , 3.4.13 , 3.4.14 View the Article History for its revisions.