Index data
This documentation does not apply to the most recent version of Splunk. Click here for the latest version.
Contents
Index data
This topic makes the assumption that you are using syslog-ng to write all your log files to a directory structure on your Splunk machine. This is not the "best" deployment (which is to use Splunk forwarding agents on each logging machine) or the quickest and dirtiest (which is to use plain syslog and have Splunk grab the data directly from the UDP port). It's good for this example because it gets the data with a reasonable amount of structure and reliability without throwing the whole forwarder structure at you when you may be just beginning.
If you use syslog-ng over a TCP port, you can configure syslog-ng to write to a directory structure on the Splunk machine. It's easiest for Splunk if you split this out by source type and host -- for example, /var/log/j2eelog/j2eeserver1/, /var/log/j2eelog/j2eeserver2/, /var/log/weblog/weblogserver1/, /var/log/weblog/weblogserver2/, etc.
Note: In 4.0.x, you may have some problems trying to input files from /var/log due to contention with the *Nix app. The *Nix app is pretty cool -- it lets you see all kinds of operations data for your Unix box in Splunk -- but it generates a lot of data and uses /var/log for itself. You can get rid of it by going to $SPLUNK_HOME/etc/apps and deleting the unix directory or moving it out of the apps directory completely and backing it up somewhere else.
If you want to know a little more about the considerations around these different ways of inputting data, read the next section. If you just want to see how to input data, skip ahead.
Collecting data from remote machines
To get data into Splunk, you must add each data source as an input. The recommended method for a production environment is to install a Splunk forwarder on each machine where you want to collect data. However, if you are new to Splunk, or if you are running a test deployment and do not want to set up Splunk forwarders, you can configure syslog or syslog-ng to forward input data to a central Splunk index.
If you use syslog, you send the all logs from your servers over UDP to port 514 and set Splunk up to grab that stream directly. There are some problems with this.
- UDP has no handshaking and doesn't care if it drops a few packets every now and then. That means your logs can end up with holes in them.
- You don't get the full value of Splunk's information handling. Normally Splunk detects the host that sends the data, but if you use syslog, the default is to assume that the host is the syslog box. There are ways around this.
- Splunk also determines the source type -- the type of log the data came from, such as weblog or j2ee -- and uses this to detect useful patterns in the data. If you are using syslog, unless you configure it explicitly, Splunk is going to look at all your data & say "Hmm, it's from syslog. Must be syslog data (
sourcetype=syslog.)" There are also ways around this -- Splunk is incredibly configurable -- but it's a little much all at once.
Grab the web log data
If you are writing your logs to a central location, the easiest way to index the data is to monitor the log directory. When you select a file or directory to monitor, Splunk uploads and indexes the specified data, then continues to index new data in the file or directory as it comes in. You can specify a mounted or shared directory as long as the Splunk server has permission to read from the directory. Splunk detects log file rotation and does not process renamed files it has already indexed. See Monitor files and directories for more information.
This section steps through how to monitor the files in the example setup. Here's a couple of sample entries from the web logs:
2010-03-17 10:13:46,918 [WEB] INFO messageType = POST, messageStatus = INIT, accountNumber = COT4808718813, host = 10.52.60.56, messageDetails = Begin posting message to content store 2010-03-17 10:13:46,954 [WEB] INFO messageType = POST, messageStatus = TASK, accountNumber = COT4808718813, host = 10.52.60.56, messageDetails = Opening connection to host: [ www.contentstore.com:80 ]
Splunk loves these files. It eats them like jam (or chocolate). Each log record is on its own line and because of the nice way the key/value pairs are set up with the = sign, they are easy to deal with even after indexing. There isn't a lot of fuss or complication here.
To monitor the web log files:
1. Click Add Data in the Launcher.
OR
2. Go to the Search app, navigate to Manager (link at top right of the screen) and select Data inputs.
The Data input window is displayed.
3. Click Add New at the right of the Files & Directories row.
4. Select Monitor a file or directory.
5. For Full path on the server, enter /var/log/weblog/.... This syntax uses the ellipsis (...) wildcard to represent an arbitrary directory.
6. Select segment in path from the Set host menu and enter 4 for the segment #. This tells Splunk to use the name of the fourth directory (e.g., weblogserver1) in the path as the host name.
7. Select Manual from the Set sourcetype menu and enter weblog.
Note: Splunk recognizes a number of standard log formats. To assign a one of Splunk's pretrained source types to a log, select from From list and choose the correct source type, for example, log4j or weblogic_stdout. If you have logs in multiple formats in the directory you are monitoring and they are all pretrained source types, you can use Automatic to make your life easier.
8. Select test from the Index menu.
9. Make sure the Advanced options are blank. You can use these options if you are monitoring a directory with lots of subdirectories -- for example all of /var/log -- and you want to be selective about which subdirectories you bring into Splunk. You can also use these to tell Splunk to ignore existing data in the directory and only eat new data.
10. Click Save.
You need to restart Splunk for it to actually start loading data. But let's add the other logs first.
Get the J2EE logs
Now get the J2EE logs in the same way. Here are some sample entries from those logs:
<TRANSACTION date="2010-03-17 10:13:49,756" activityCode="1060" subscriberID="103298280" accountNumber="COT2167944500" callerID="MAR10159LA" transactionStatus="COMPLETE" result="SUCCESS" host="10.34.51.89" comment="Invocation of Content API for sequenceNumber 103298280 Successful" /> <TRANSACTION date="2010-03-17 10:13:52,008" activityCode="1010" subscriberID="109000446" accountNumber="COT9138634144" callerID="MAR10249LA" transactionStatus="COMPLETE" result="SUCCESS" host="10.25.50.49" comment="Invocation of Content API for sequenceNumber 109000446 Successful" />
To monitor these logs:
1. In the Monitor a file or directory window, locate your web log data input and click Clone.
2. Change Full path on the server, enter /var/log/j2eelog/....
3. Change source type to j2eelog.
4. Click Save.
Get the API logs
Now for the API logs. Here's a couple of sample entries:
#### 2010-03-17 10:13:47,543
nameSpace: content.static.API
subscriberID: 107018813
callerID: TTCOV104435254-7305027
driver: content.jdbc.ContentDriver
callerAction: MAR10354LA
host: 10.52.60.28
connectionResult: SUCCESS
Details: Successfully updated contentDB
#### 2010-03-17 10:13:48,626
nameSpace: content.static.API
subscriberID: 3238231843
callerID: TTCOV106842965-5744617
driver: content.jdbc.ContentDriver
callerAction: MAR10899LA
host: 10.52.60.27
connectionResult: SUCCESS
Details: Successfully updated contentDB
Note: The next topic, Configure linebreaking, describes how to ensure Splunk figures out the correct boundaries between records.
First, monitor these logs:
1. In the Monitor a file or directory window, locate your web log data input and click Clone.
2. Change Full path on the server to /var/log/apilog/....
3. Change source type to apilog.
4. Click Save to go back to the Data Inputs (Files) window.
Get the database error logs
1. Go back to Splunk Web. The Data Inputs (Files) window should still be available.
2. Once again, locate your web log data input and click Clone.
3. For Full path on the server, enter /var/log/mysqld/....
4. Change source type to mysqld.
5. Click Save.
Other ways to get data
You do not have to configure each input directly. In a production environment, you would probably choose to monitor all of /var/log and use configurations to set the host, source, and source type of the different files. See About default fields for more information.
This documentation applies to the following versions of Splunk: 4.1 , 4.1.1 , 4.1.2 , 4.1.3 , 4.1.4 , 4.1.5 , 4.1.6 , 4.1.7 , 4.1.8 View the Article History for its revisions.












