Troubleshoot Splunk Analytics for Hadoop

This topic describes some of the issues you may have with various components of your configuration and possible ways to resolve those issues.

For more troubleshooting questions and answers, and to post questions yourself, search Splunk Answers.

Cluster issues

Issue: The NFS Gateway does not come up when you first bring up the cluster

Check the logs to see if it's a license issue. It's possible for the NFS Gateway to try to come up before you are able to apply your license and fail as a result. In such a case, bring up the cluster again once your license is installed.

Issue: Services fail to come up on a node

This could be a network problem, try disabling the IPTables.

ZooKeeper

Issue: ZooKeeper is in a bad state

Try to reset the ZooKeeper using the following steps:

1. Shut down the service

2. Create a temporary backup:

mkdir /tmp/zkdata_backup</br>mv $MAPR_HOME/zkdata/version-2/* /tmp/zkdata_backup

3. Start up the MapR ZooKeeper service again

Search Issues

Issue: Searches run very slowly

For example, searches take much longer than a Hadoop Job normally would. For example, a Hive job takes 6 minutes to complete, but Splunk Analytics for Hadoop takes 30 minutes to complete a similar Job.

To resolve this, make sure Splunk Analytics for Hadoop is running an actual MapReduce Job and not simply streaming the results back from Hadoop:

Splunk streaming results from Hadoop (Not MapReduce Jobs)

Index=xyz

Splunk streaming results from Hadoop (Not MapReduce Jobs)

Index=xyz | stats count and using Verbose Mode

MapReduce Jobs the leverage the report from Hadoop

Index=xyz | stats count and using Smart Mode

Issue: A reporting search throws an error message

If a reporting search throws the following error:

INFO mapred.JobClient: Cleaning up the staging area hdfs://qa-centos-amd64-26.sv.splunk.com:8020/user/apatil/.staging/job_201303061716_0033 
ERROR security.UserGroupInformation: PriviledgedActionException as:apatil cause:org.apache.hadoop.ipc.RemoteException: java.io.IOException:
job_201303061716_0033(-1 memForMapTasks -1 memForReduceTasks): Invalid job requirements.
at org.apache.hadoop.mapred.JobTracker.checkMemoryRequirements(JobTracker.java:5019)

Try adding the following parameters to indexes.conf:

vix.mapred.job.map.memory.mb = 2048
vix.mapred.job.reduce.memory.mb = 256

Issue: Splunk throws a failure message

For example:

[APACHE] External result provider name=APACHE asked to finalize the search
[APACHE] MapReduce job id=job_201303081521_0020 failed, state=FAILED, message=# of failed Map Tasks exceeded allowed limit. FailedCount: 1.   
LastFailedTask: task_201303081521_0020_m_000000

This sort of error appears because Java child processes are also running. Check the MapReduce logs for something like the following:

TaskTree [pid=7535,tipID=attempt_201303061716_0093_m_000000_0] is running beyond memory-limits. 
Current usage : 2467721216bytes. Limit : 2147483648bytes. Killing task.

To resolve this, edit indexes.conf as follows:

vix.mapred.child.java.opts = -server -Xmx1024m

Debugging

Issue: You need to debug searches

For example, you run a search and do not receive Hadoop results.

This is an indication that Splunk Ananlytics for Hadoop is not properly configured. To resolve this, enable debugging to find any configurations errors then open the Job inspector:

1. In the menu select 'Provider then click edit for the provider.

2. Enable debugging by changing the value of vix.splunk.search.debug = to 1

3. Rerun your search.

4. Open the Job inspector and click the link to the search.log file.

5. Search for the word DEBUG in the search.log file to find your error.

Authentication

Issue: Authentication errors occur when using Splunk Analytics for Hadoop with Kerberos

For example, you run MapReduce Jobs or try to access Hadoop data and get Kerberos exceptions.

Examples of errors:

java.lang.IllegalArgumentException: Server has invalid Kerberos principal

In the Server.log you see:

SplunkMR - Failed to start MapReduce job.  Please consult search.log for more information. Message: [ Failed to start MapReduce job, name=SPLK_<hunk_server>_1435283732.28_0 ] and [ Failed on local exception: java.io.IOException: java.lang.IllegalArgumentException: Server has invalid Kerberos principal: rm/<master_server>@<REALM>; Host Details : local host is: "<hunk_server>/192.X.X.X"; destination host is: "<master_server>":8050;  ]

To resolve this:

1. make sure that after Kerberos has been installed on the Splunk node:

2. Configure user permissions As the root user, add Splunk User to the Kerberos database. Start the kadmin service by typing following: kadmin.local " Type following command to create a Splunk User principal: " kadmin.local: addprinc -randkey splunk@EXAMPLE.COM "

3. Create a keytab file in the /etc/security/keytabs directory using the following command:

" kadmin.local: xst -norandkey -k /etc/security/keytabs/splunk.headless.keytab splunk@EXAMPLE.COM
kadmin.local: exit "

4. Set permissions for the keytab file for the Splunk user:

# chown splunk:hadoop /etc/security/keytabs/splunk.headless.keytab
# chmod 440 /etc/security/keytabs/splunk.headless.keytab

5. As the Splunk user, initialize the keytab file:

# su - splunk
$ kinit -kt /etc/security/keytabs/splunk.headless.keytab splunk@EXAMPLE.COM

6. Use the Splunk documentation to configure Kerberos Configure Kerberos Authentication

7. If this does not help, try using the DEBUG command.

Hadoop-specific issues

Issue: Too many results from Hadoop

Example A: You examine HDFS and discover cases where the Month and Day are sometimes Single Digit and sometimes Double Digit:

/some/path/customer2/year=2016/month=3/day=23/somefiles

or

/some/path/customer2/year=2016/month=10/day=4/somefiles

This can be caused by HDFS time partitioning. The Splunk Analytics for Hadoop documentation describes how to capture the time by building a regex for a known number of digits.

For majority of the cases you can simply use the Regex as shown in the documentations: Add a virtual index.

You can resolve this bu capturing more then just the digits in the Regex +and use only a single digit in the format. For example:

vix.input.1.accept = \.avro$
vix.input.1.et.format = y/M/d
vix.input.1.et.regex = .*?/customer2/Year=(\d+/)Month=(\d+/)Day=(\d+)/.*
vix.input.1.lt.format = y/M/d
vix.input.1.lt.offset = 86400
vix.input.1.lt.regex = .*?/customer2/Year=(\d+/)Month=(\d+/)Day=(\d+)/.*
vix.input.1.path = /some/path/customer2/...
vix.provider = hdp23provider

Example B: You discover cases where Multiple Epoch for HDFS Time Capturing Regex and Time format, including cases where the timestamp is marked as Epoch time and the LT and ET are both in the path:

/user/root/myarchive/db_1359855960_1357027260_0/journal.gz

In inputs.conf check whether you have two epoch timestamps set in HDFS Time partitioning. indexes.conf captures the first and second part of the epoch timestamp:

vix.input.1.et.format = epoch
vix.input.1.et.regex = /user/root/myarchive/db_\d+_(\d+)_.*
vix.input.1.lt.format = epoch
vix.input.1.lt.regex = /user/root/myarchive/db_(\d+)_\d+_.*
vix.input.1.path = /user/root/myarchive/...
vix.provider = 62hdprovider

Issue: Hadoop fails to start

Make sure that the user account has proper permission to the needed Hadoop directories.

Issue: Hadoop Server Job fails

Here are some examples that indicate that a Job crashed on the Hadoop Server side, and not on the Splunk Analytics for Hadoop side:

A MapReduce error: Error while waiting for MapReduce job to complete, job_id=XYZ
Splunk Web errors while running a search.
Error while waiting for MapReduce job to complete, job_id=

To find the errors on the Hadoop Server side:

1. Click the link to the Job inspector.

2. In the Job inspector, click the link to the Hadoop Server logs.

3. Look for any Hadoop Server log errors. For example, if the Task Map Job timed out:

task_1465506338167_0007_m_000000 [MAP]
AttemptID:attempt_1465506338167_0007_m_000000_0 Info:Error: java.io.IOException: Hunk timed out while waiting for package=/notvaliddirectory/splunk-6.4.1-debde650d26e-linux-2.6-x86_64.tgz to be installed.

Issue: Jobs crash Hadoop during long running jobs, or you see many resource related errors on the Hadoop server.

If Hadoop is limited on resources, it may make sense to run more jobs, but reduce the amount of files each job process.

By default the first job will process 100 blocks (Hadoop files), the second job will process 1,000 blocks, and all the remaining jobs will process 10,000 blocks.

Changing vix.splunk.search.mr.maxsplits to 5000 doubles the number of Jobs Splunk Analytics for Hadoop will produce, but each job takes fewer resources.

Issue: Hadoop jobs are not running fast enough or Splunk Analytics for Hadoop is processing too many files

Use the Job inspector to view Duration, Component, Invocations, Input Count, and Output Count for every phase of the search process.

Examine some of the following key components to find your performance issues.

Description of overall performance.

Fields	Comments	Examples
	"This search has completed and has returned 5 results by scanning 5,438 events in 33.045 seconds"
Command.stdin.	The number of events considered for the analytics based on time range	Duration: 23.15 Output Count (Event Count): 3,124
command.stdin.cpd2sr	The total number of events. Ideally this number should be the same as Command.stdin. If it is not, that may indicate a problem with the Time Capture Regex	Duration: 2.20 Output Count (Event Count): 5,438
Erp.<provider>.cache.bytes	The total number of Bytes that returned from the HDFS Cache	Duration: 0.02 Input Count (split Length): 4,696 Output Count (stream Length in bytes - The important value): 19,973
Erp.<provider>.report.bytes	The total number of Bytes that returned from the HDFS MapReduce Job	Duration: 0.01 Input Count (split Length): 1,538 Output Count (stream Length in bytes - The important value): 4,111
Erp.<provider>.stream.bytes	The total number of Bytes that returned from Splunk Analytics for Hadoop streaming. Normally the first few events are based on stream.bytes and the remaining are based on report.bytes	Duration: 26.95 Input Count (split Length): 59,367,278 Output Count (stream Length in bytes - The important value): 671,088,641
Erp.<provider>.vix.<vix>.dirs.listed	The total number of Hadoop directories Splunk Anaytics for Hadoop must scan	Invocations: 365 (Number of Dirs)
Erp.<provider>.vix.<vix>.files.listed	The total number of Hadoop Files Splunk Analytics for Hadoop must scan	Invocations: 8,760 (Number of Files)
Erp.<provider>.vix.<vix>.dir.filter.time	The total number of Hadoop files removed from consideration due to time range. The total number of files Actual use is dirs.listed – dir.filter.time Invocations: 211 (Number of Dirs)
Erp.<provider>.MR.SPLK_<host>_<SID>	The MapReduce Job generating in Hadoop and the time it took to run it.	Duration: 43.94
Erp.<provider>.vix.<vix>.splits.generation.time	The time it takes to calculate the splits Also see the Provider flags maxsplits and minsplits	Duration: 0.18

Hadoop database is out of space

When archiving data to HDFS, Splunk Analytics for Hadoop does not delete the files from HDFS. File cleanup must be done by the Hadoop Administrator. The structure of the HDFS Splunk creates can help the Hadoop Administrator to build their own purging scripts

For more information about purging file, see Archive cold buckets to frozen

Related answers from Splunk Community

Troubleshoot Splunk Analytics for Hadoop

Cluster issues

Issue: The NFS Gateway does not come up when you first bring up the cluster

Issue: Services fail to come up on a node

ZooKeeper

Issue: ZooKeeper is in a bad state

Search Issues

Issue: Searches run very slowly

Issue: A reporting search throws an error message

Issue: Splunk throws a failure message

Debugging

Issue: You need to debug searches

Authentication

Issue: Authentication errors occur when using Splunk Analytics for Hadoop with Kerberos

Hadoop-specific issues

Issue: Too many results from Hadoop

Issue: Hadoop fails to start

Issue: Hadoop Server Job fails

Issue: Jobs crash Hadoop during long running jobs, or you see many resource related errors on the Hadoop server.

Issue: Hadoop jobs are not running fast enough or Splunk Analytics for Hadoop is processing too many files

Hadoop database is out of space

Comments

Troubleshoot Splunk Analytics for Hadoop

Was this topic useful?