Troubleshoot Splunk Analytics for Hadoop
This topic describes some of the issues you may have with various components of your configuration and possible ways to resolve those issues.
For more troubleshooting questions and answers, and to post questions yourself, search Splunk Answers.
Cluster issues
Issue: The NFS Gateway does not come up when you first bring up the cluster
Check the logs to see if it's a license issue. It's possible for the NFS Gateway to try to come up before you are able to apply your license and fail as a result. In such a case, bring up the cluster again once your license is installed.
Issue: Services fail to come up on a node
This could be a network problem, try disabling the IPTables.
ZooKeeper
Issue: ZooKeeper is in a bad state
Try to reset the ZooKeeper using the following steps:
1. Shut down the service
2. Create a temporary backup:
mkdir /tmp/zkdata_backup</br>mv $MAPR_HOME/zkdata/version-2/* /tmp/zkdata_backup
3. Start up the MapR ZooKeeper service again
Search Issues
Issue: Searches run very slowly
For example, searches take much longer than a Hadoop Job normally would. For example, a Hive job takes 6 minutes to complete, but Splunk Analytics for Hadoop takes 30 minutes to complete a similar Job.
To resolve this, make sure Splunk Analytics for Hadoop is running an actual MapReduce Job and not simply streaming the results back from Hadoop:
- Splunk streaming results from Hadoop (Not MapReduce Jobs)
Index=xyz
- Splunk streaming results from Hadoop (Not MapReduce Jobs)
Index=xyz | stats count and using Verbose Mode
- MapReduce Jobs the leverage the report from Hadoop
Index=xyz | stats count and using Smart Mode
Issue: A reporting search throws an error message
If a reporting search throws the following error:
INFO mapred.JobClient: Cleaning up the staging area hdfs://qa-centos-amd64-26.sv.splunk.com:8020/user/apatil/.staging/job_201303061716_0033 ERROR security.UserGroupInformation: PriviledgedActionException as:apatil cause:org.apache.hadoop.ipc.RemoteException: java.io.IOException: job_201303061716_0033(-1 memForMapTasks -1 memForReduceTasks): Invalid job requirements. at org.apache.hadoop.mapred.JobTracker.checkMemoryRequirements(JobTracker.java:5019)
Try adding the following parameters to indexes.conf
:
vix.mapred.job.map.memory.mb = 2048 vix.mapred.job.reduce.memory.mb = 256
Issue: Splunk throws a failure message
For example:
[APACHE] External result provider name=APACHE asked to finalize the search [APACHE] MapReduce job id=job_201303081521_0020 failed, state=FAILED, message=# of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201303081521_0020_m_000000
This sort of error appears because Java child processes are also running. Check the MapReduce logs for something like the following:
TaskTree [pid=7535,tipID=attempt_201303061716_0093_m_000000_0] is running beyond memory-limits. Current usage : 2467721216bytes. Limit : 2147483648bytes. Killing task.
To resolve this, edit indexes.conf
as follows:
vix.mapred.child.java.opts = -server -Xmx1024m
Debugging
Issue: You need to debug searches
For example, you run a search and do not receive Hadoop results.
This is an indication that Splunk Ananlytics for Hadoop is not properly configured. To resolve this, enable debugging to find any configurations errors then open the Job inspector:
1. In the menu select 'Provider then click edit for the provider.
2. Enable debugging by changing the value of vix.splunk.search.debug =
to 1
3. Rerun your search.
4. Open the Job inspector and click the link to the search.log
file.
5. Search for the word DEBUG in the search.log
file to find your error.
Authentication
Issue: Authentication errors occur when using Splunk Analytics for Hadoop with Kerberos
For example, you run MapReduce Jobs or try to access Hadoop data and get Kerberos exceptions.
Examples of errors:
java.lang.IllegalArgumentException: Server has invalid Kerberos principal
- In the
Server.log
you see:
SplunkMR - Failed to start MapReduce job. Please consult search.log for more information. Message: [ Failed to start MapReduce job, name=SPLK_<hunk_server>_1435283732.28_0 ] and [ Failed on local exception: java.io.IOException: java.lang.IllegalArgumentException: Server has invalid Kerberos principal: rm/<master_server>@<REALM>; Host Details : local host is: "<hunk_server>/192.X.X.X"; destination host is: "<master_server>":8050; ]
To resolve this:
1. make sure that after Kerberos has been installed on the Splunk node:
2. Configure user permissions
As the root user, add Splunk User to the Kerberos database. Start the kadmin service by typing following:
kadmin.local "
Type following command to create a Splunk User principal:
" kadmin.local: addprinc -randkey splunk@EXAMPLE.COM "
3. Create a keytab file in the /etc/security/keytabs
directory using the following command:
" kadmin.local: xst -norandkey -k /etc/security/keytabs/splunk.headless.keytab splunk@EXAMPLE.COM kadmin.local: exit "
4. Set permissions for the keytab file for the Splunk user:
# chown splunk:hadoop /etc/security/keytabs/splunk.headless.keytab # chmod 440 /etc/security/keytabs/splunk.headless.keytab
5. As the Splunk user, initialize the keytab file:
# su - splunk $ kinit -kt /etc/security/keytabs/splunk.headless.keytab splunk@EXAMPLE.COM
6. Use the Splunk documentation to configure Kerberos Configure Kerberos Authentication
7. If this does not help, try using the DEBUG command.
Hadoop-specific issues
Issue: Too many results from Hadoop
Example A: You examine HDFS and discover cases where the Month and Day are sometimes Single Digit and sometimes Double Digit:
/some/path/customer2/year=2016/month=3/day=23/somefiles
or
/some/path/customer2/year=2016/month=10/day=4/somefiles
This can be caused by HDFS time partitioning. The Splunk Analytics for Hadoop documentation describes how to capture the time by building a regex for a known number of digits.
For majority of the cases you can simply use the Regex as shown in the documentations: Add a virtual index.
You can resolve this bu capturing more then just the digits in the Regex +and use only a single digit in the format. For example:
vix.input.1.accept = \.avro$ vix.input.1.et.format = y/M/d vix.input.1.et.regex = .*?/customer2/Year=(\d+/)Month=(\d+/)Day=(\d+)/.* vix.input.1.lt.format = y/M/d vix.input.1.lt.offset = 86400 vix.input.1.lt.regex = .*?/customer2/Year=(\d+/)Month=(\d+/)Day=(\d+)/.* vix.input.1.path = /some/path/customer2/... vix.provider = hdp23provider
Example B: You discover cases where Multiple Epoch for HDFS Time Capturing Regex and Time format, including cases where the timestamp is marked as Epoch time and the LT and ET are both in the path:
/user/root/myarchive/db_1359855960_1357027260_0/journal.gz
In inputs.conf
check whether you have two epoch timestamps set in HDFS Time partitioning. indexes.conf
captures the first and second part of the epoch timestamp:
vix.input.1.et.format = epoch vix.input.1.et.regex = /user/root/myarchive/db_\d+_(\d+)_.* vix.input.1.lt.format = epoch vix.input.1.lt.regex = /user/root/myarchive/db_(\d+)_\d+_.* vix.input.1.path = /user/root/myarchive/... vix.provider = 62hdprovider
Issue: Hadoop fails to start
Make sure that the user account has proper permission to the needed Hadoop directories.
Issue: Hadoop Server Job fails
Here are some examples that indicate that a Job crashed on the Hadoop Server side, and not on the Splunk Analytics for Hadoop side:
- A MapReduce error:
Error while waiting for MapReduce job to complete, job_id=XYZ
- Splunk Web errors while running a search.
- Error while waiting for MapReduce job to complete,
job_id=
To find the errors on the Hadoop Server side:
1. Click the link to the Job inspector.
2. In the Job inspector, click the link to the Hadoop Server logs.
3. Look for any Hadoop Server log errors. For example, if the Task Map Job timed out:
task_1465506338167_0007_m_000000 [MAP] AttemptID:attempt_1465506338167_0007_m_000000_0 Info:Error: java.io.IOException: Hunk timed out while waiting for package=/notvaliddirectory/splunk-6.4.1-debde650d26e-linux-2.6-x86_64.tgz to be installed.
If Hadoop is limited on resources, it may make sense to run more jobs, but reduce the amount of files each job process.
By default the first job will process 100 blocks (Hadoop files), the second job will process 1,000 blocks, and all the remaining jobs will process 10,000 blocks.
Changing vix.splunk.search.mr.maxsplits
to 5000 doubles the number of Jobs Splunk Analytics for Hadoop will produce, but each job takes fewer resources.
Issue: Hadoop jobs are not running fast enough or Splunk Analytics for Hadoop is processing too many files
Use the Job inspector to view Duration, Component, Invocations, Input Count, and Output Count for every phase of the search process.
Examine some of the following key components to find your performance issues.
Description of overall performance.Fields | Comments | Examples |
---|---|---|
"This search has completed and has returned 5 results by scanning 5,438 events in 33.045 seconds" | ||
Command.stdin. | The number of events considered for the analytics based on time range | Duration: 23.15
Output Count (Event Count): 3,124 |
command.stdin.cpd2sr | The total number of events. Ideally this number should be the same as Command.stdin. If it is not, that may indicate a problem with the Time Capture Regex | Duration: 2.20
Output Count (Event Count): 5,438 |
Erp.<provider>.cache.bytes | The total number of Bytes that returned from the HDFS Cache | Duration: 0.02
Input Count (split Length): 4,696 Output Count (stream Length in bytes - The important value): 19,973 |
Erp.<provider>.report.bytes | The total number of Bytes that returned from the HDFS MapReduce Job | Duration: 0.01
Input Count (split Length): 1,538 Output Count (stream Length in bytes - The important value): 4,111 |
Erp.<provider>.stream.bytes | The total number of Bytes that returned from Splunk Analytics for Hadoop streaming. Normally the first few events are based on stream.bytes and the remaining are based on report.bytes | Duration: 26.95
Input Count (split Length): 59,367,278 Output Count (stream Length in bytes - The important value): 671,088,641 |
Erp.<provider>.vix.<vix>.dirs.listed | The total number of Hadoop directories Splunk Anaytics for Hadoop must scan | Invocations: 365 (Number of Dirs) |
Erp.<provider>.vix.<vix>.files.listed | The total number of Hadoop Files Splunk Analytics for Hadoop must scan | Invocations: 8,760 (Number of Files) |
Erp.<provider>.vix.<vix>.dir.filter.time | The total number of Hadoop files removed from consideration due to time range. The total number of files Actual use is dirs.listed – dir.filter.time
Invocations: 211 (Number of Dirs) |
|
Erp.<provider>.MR.SPLK_<host>_<SID> | The MapReduce Job generating in Hadoop and the time it took to run it. | Duration: 43.94 |
Erp.<provider>.vix.<vix>.splits.generation.time | The time it takes to calculate the splits
Also see the Provider flags maxsplits and minsplits |
Duration: 0.18 |
Hadoop database is out of space
When archiving data to HDFS, Splunk Analytics for Hadoop does not delete the files from HDFS. File cleanup must be done by the Hadoop Administrator. The structure of the HDFS Splunk creates can help the Hadoop Administrator to build their own purging scripts
For more information about purging file, see Archive cold buckets to frozen
Configure and run unified search | Performance best practices |
This documentation applies to the following versions of Splunk® Enterprise: 7.0.0, 7.0.1, 7.0.2, 7.0.3, 7.0.4, 7.0.5, 7.0.6, 7.0.7, 7.0.8, 7.0.9, 7.0.10, 7.0.11, 7.0.13, 7.1.0, 7.1.1, 7.1.2, 7.1.3, 7.1.4, 7.1.5, 7.1.6, 7.1.7, 7.1.8, 7.1.9, 7.1.10, 7.2.0, 7.2.1, 7.2.2, 7.2.3, 7.2.4, 7.2.5, 7.2.6, 7.2.7, 7.2.8, 7.2.9, 7.2.10, 7.3.0, 7.3.1, 7.3.2, 7.3.3, 8.0.0, 8.0.1
Feedback submitted, thanks!