Troubleshoot Splunk Analytics for Hadoop
This topic describes some of the issues you may have with various components of your configuration and possible ways to resolve those issues.
For more troubleshooting questions and answers, and to post questions yourself, search Splunk Answers.
Issue: The NFS Gateway does not come up when you first bring up the cluster
Check the logs to see if it's a license issue. It's possible for the NFS Gateway to try to come up before you are able to apply your license and fail as a result. In such a case, bring up the cluster again once your license is installed.
Issue: Services fail to come up on a node
This could be a network problem, try disabling the IPTables.
Issue: ZooKeeper is in a bad state
Try to reset the ZooKeeper using the following steps:
1. Shut down the service
2. Create a temporary backup:
mkdir /tmp/zkdata_backup</br>mv $MAPR_HOME/zkdata/version-2/* /tmp/zkdata_backup
3. Start up the MapR ZooKeeper service again
Issue: Searches run very slowly
For example, searches take much longer than a Hadoop Job normally would. For example, a Hive job takes 6 minutes to complete, but Splunk Analytics for Hadoop takes 30 minutes to complete a similar Job.
To resolve this, make sure Splunk Analytics for Hadoop is running an actual MapReduce Job and not simply streaming the results back from Hadoop:
- Splunk streaming results from Hadoop (Not MapReduce Jobs)
- Splunk streaming results from Hadoop (Not MapReduce Jobs)
Index=xyz | stats count and using Verbose Mode
- MapReduce Jobs the leverage the report from Hadoop
Index=xyz | stats count and using Smart Mode
Issue: A reporting search throws an error message
If a reporting search throws the following error:
INFO mapred.JobClient: Cleaning up the staging area hdfs://qa-centos-amd64-26.sv.splunk.com:8020/user/apatil/.staging/job_201303061716_0033 ERROR security.UserGroupInformation: PriviledgedActionException as:apatil cause:org.apache.hadoop.ipc.RemoteException: java.io.IOException: job_201303061716_0033(-1 memForMapTasks -1 memForReduceTasks): Invalid job requirements. at org.apache.hadoop.mapred.JobTracker.checkMemoryRequirements(JobTracker.java:5019)
Try adding the following parameters to
vix.mapred.job.map.memory.mb = 2048 vix.mapred.job.reduce.memory.mb = 256
Issue: Splunk throws a failure message
[APACHE] External result provider name=APACHE asked to finalize the search [APACHE] MapReduce job id=job_201303081521_0020 failed, state=FAILED, message=# of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201303081521_0020_m_000000
This sort of error appears because Java child processes are also running. Check the MapReduce logs for something like the following:
TaskTree [pid=7535,tipID=attempt_201303061716_0093_m_000000_0] is running beyond memory-limits. Current usage : 2467721216bytes. Limit : 2147483648bytes. Killing task.
To resolve this, edit
indexes.conf as follows:
vix.mapred.child.java.opts = -server -Xmx1024m
Issue: You need to debug searches
For example, you run a search and do not receive Hadoop results.
This is an indication that Splunk Ananlytics for Hadoop is not properly configured. To resolve this, enable debugging to find any configurations errors then open the Job inspector:
1. In the menu select 'Provider then click edit for the provider.
2. Enable debugging by changing the value of
vix.splunk.search.debug = to
3. Rerun your search.
4. Open the Job inspector and click the link to the
5. Search for the word DEBUG in the
search.log file to find your error.
Issue: Authentication errors occur when using Splunk Analytics for Hadoop with Kerberos
For example, you run MapReduce Jobs or try to access Hadoop data and get Kerberos exceptions.
Examples of errors:
java.lang.IllegalArgumentException: Server has invalid Kerberos principal
- In the
SplunkMR - Failed to start MapReduce job. Please consult search.log for more information. Message: [ Failed to start MapReduce job, name=SPLK_<hunk_server>_1435283732.28_0 ] and [ Failed on local exception: java.io.IOException: java.lang.IllegalArgumentException: Server has invalid Kerberos principal: rm/<master_server>@<REALM>; Host Details : local host is: "<hunk_server>/192.X.X.X"; destination host is: "<master_server>":8050; ]
To resolve this:
1. make sure that after Kerberos has been installed on the Splunk node:
2. Configure user permissions
As the root user, add Splunk User to the Kerberos database. Start the kadmin service by typing following:
Type following command to create a Splunk User principal:
" kadmin.local: addprinc -randkey splunk@EXAMPLE.COM "
3. Create a keytab file in the
/etc/security/keytabs directory using the following command:
" kadmin.local: xst -norandkey -k /etc/security/keytabs/splunk.headless.keytab splunk@EXAMPLE.COM kadmin.local: exit "
4. Set permissions for the keytab file for the Splunk user:
# chown splunk:hadoop /etc/security/keytabs/splunk.headless.keytab # chmod 440 /etc/security/keytabs/splunk.headless.keytab
5. As the Splunk user, initialize the keytab file:
# su - splunk $ kinit -kt /etc/security/keytabs/splunk.headless.keytab splunk@EXAMPLE.COM
6. Use the Splunk documentation to configure Kerberos Configure Kerberos Authentication
7. If this does not help, try using the DEBUG command.
Issue: Too many results from Hadoop
Example A: You examine HDFS and discover cases where the Month and Day are sometimes Single Digit and sometimes Double Digit:
This can be caused by HDFS time partitioning. The Splunk Analytics for Hadoop documentation describes how to capture the time by building a regex for a known number of digits.
For majority of the cases you can simply use the Regex as shown in the documentations: Add a virtual index.
You can resolve this bu capturing more then just the digits in the Regex +and use only a single digit in the format. For example:
vix.input.1.accept = \.avro$ vix.input.1.et.format = y/M/d vix.input.1.et.regex = .*?/customer2/Year=(\d+/)Month=(\d+/)Day=(\d+)/.* vix.input.1.lt.format = y/M/d vix.input.1.lt.offset = 86400 vix.input.1.lt.regex = .*?/customer2/Year=(\d+/)Month=(\d+/)Day=(\d+)/.* vix.input.1.path = /some/path/customer2/... vix.provider = hdp23provider
Example B: You discover cases where Multiple Epoch for HDFS Time Capturing Regex and Time format, including cases where the timestamp is marked as Epoch time and the LT and ET are both in the path:
inputs.conf check whether you have two epoch timestamps set in HDFS Time partitioning.
indexes.conf captures the first and second part of the epoch timestamp:
vix.input.1.et.format = epoch vix.input.1.et.regex = /user/root/myarchive/db_\d+_(\d+)_.* vix.input.1.lt.format = epoch vix.input.1.lt.regex = /user/root/myarchive/db_(\d+)_\d+_.* vix.input.1.path = /user/root/myarchive/... vix.provider = 62hdprovider
Issue: Hadoop fails to start
Make sure that the user account has proper permission to the needed Hadoop directories.
Issue: Hadoop Server Job fails
Here are some examples that indicate that a Job crashed on the Hadoop Server side, and not on the Splunk Analytics for Hadoop side:
- A MapReduce error:
Error while waiting for MapReduce job to complete, job_id=XYZ
- Splunk Web errors while running a search.
- Error while waiting for MapReduce job to complete,
To find the errors on the Hadoop Server side:
1. Click the link to the Job inspector.
2. In the Job inspector, click the link to the Hadoop Server logs.
3. Look for any Hadoop Server log errors. For example, if the Task Map Job timed out:
task_1465506338167_0007_m_000000 [MAP] AttemptID:attempt_1465506338167_0007_m_000000_0 Info:Error: java.io.IOException: Hunk timed out while waiting for package=/notvaliddirectory/splunk-6.4.1-debde650d26e-linux-2.6-x86_64.tgz to be installed.
If Hadoop is limited on resources, it may make sense to run more jobs, but reduce the amount of files each job process.
By default the first job will process 100 blocks (Hadoop files), the second job will process 1,000 blocks, and all the remaining jobs will process 10,000 blocks.
vix.splunk.search.mr.maxsplits to 5000 doubles the number of Jobs Splunk Analytics for Hadoop will produce, but each job takes fewer resources.
Issue: Hadoop jobs are not running fast enough or Splunk Analytics for Hadoop is processing too many files
Use the Job inspector to view Duration, Component, Invocations, Input Count, and Output Count for every phase of the search process.
Examine some of the following key components to find your performance issues.Description of overall performance.
|"This search has completed and has returned 5 results by scanning 5,438 events in 33.045 seconds"|
|Command.stdin.||The number of events considered for the analytics based on time range||Duration: 23.15
Output Count (Event Count): 3,124
|command.stdin.cpd2sr||The total number of events. Ideally this number should be the same as Command.stdin. If it is not, that may indicate a problem with the Time Capture Regex||Duration: 2.20
Output Count (Event Count): 5,438
|Erp.<provider>.cache.bytes||The total number of Bytes that returned from the HDFS Cache||Duration: 0.02
Input Count (split Length): 4,696 Output Count (stream Length in bytes - The important value): 19,973
|Erp.<provider>.report.bytes||The total number of Bytes that returned from the HDFS MapReduce Job||Duration: 0.01
Input Count (split Length): 1,538 Output Count (stream Length in bytes - The important value): 4,111
|Erp.<provider>.stream.bytes||The total number of Bytes that returned from Splunk Analytics for Hadoop streaming. Normally the first few events are based on stream.bytes and the remaining are based on report.bytes||Duration: 26.95
Input Count (split Length): 59,367,278 Output Count (stream Length in bytes - The important value): 671,088,641
|Erp.<provider>.vix.<vix>.dirs.listed||The total number of Hadoop directories Splunk Anaytics for Hadoop must scan||Invocations: 365 (Number of Dirs)|
|Erp.<provider>.vix.<vix>.files.listed||The total number of Hadoop Files Splunk Analytics for Hadoop must scan||Invocations: 8,760 (Number of Files)|
|Erp.<provider>.vix.<vix>.dir.filter.time||The total number of Hadoop files removed from consideration due to time range. The total number of files Actual use is dirs.listed – dir.filter.time
Invocations: 211 (Number of Dirs)
|Erp.<provider>.MR.SPLK_<host>_<SID>||The MapReduce Job generating in Hadoop and the time it took to run it.||Duration: 43.94|
|Erp.<provider>.vix.<vix>.splits.generation.time||The time it takes to calculate the splits
Also see the Provider flags maxsplits and minsplits
Issue: Unable to save files in Hive file formats
If you get errors when you try to save files in HIve file formats, you may have version mismatches between your Hadoop instances and your Hive instances. Hadoop 2.x or below requires versions of Hive that are 2.x or below. Hadoop 3.x requires versions of HIve that are at least 3.x. If you configure a Hadoop cluster that is version 3.x with a Hive instance that is version 2.x or lower, you will run into connectivity issues when you try to save a file in a Hive file format.
For more information, see Configure Hive connectivity.
Issue: Hadoop database is out of space
Splunk Analytics for Hadoop cannot automatically delete files that have been archived to HDFS. It is up to the Hadoop Administrator to clean up old or obsolete files. The Hadoop Administrator can create file purging scripts that utilize the HDFS structure created by Splunk.
For more information about purging files, see Archive cold buckets to frozen.
Configure and run unified search
Performance best practices
This documentation applies to the following versions of Splunk® Enterprise: 7.3.4, 7.3.5, 7.3.6, 7.3.7, 7.3.8, 7.3.9, 8.0.2, 8.0.3, 8.0.4, 8.0.5, 8.0.6, 8.0.7, 8.0.8, 8.0.9, 8.0.10, 8.1.0, 8.1.1, 8.1.2, 8.1.3, 8.1.4, 8.1.5, 8.1.6, 8.1.7, 8.1.8, 8.1.9, 8.1.10, 8.2.0, 8.2.1, 8.2.2, 8.2.3, 8.2.4, 8.2.5, 8.2.6, 9.0.0