Splunk® Hadoop Connect

Deploy and Use Splunk Hadoop Connect

Download manual as PDF

Download topic as PDF

Install Hadoop CLI

Splunk Hadoop Connect communicates with Hadoop clusters through the Hadoop Distributed File System (HDFS) Command-Line Interface, or Hadoop CLI. Before you deploy Hadoop Connect, install Hadoop CLI on each Splunk instance that you want to run Hadoop Connect.

For information on the Hadoop CLI, see "Hadoop Commands Guide" on the Apache Hadoop documentation.

You can configure Splunk Hadoop Connect to communicate with multiple Hadoop clusters of differing distributions and versions. Therefore, you can install multiple Hadoop CLI packages on a single Splunk instance.

Collect Hadoop environment information

For each Hadoop cluster you want to connect to, have the following information.

  • Hadoop distribution and version.
  • HDFS Namenode Uniform Resource Identifier (URI).
  • Namenode HTTP port.
  • Namenode IPC port.
  • Whether the cluster requires secure authentication.

Verify Java version

Hadoop CLI requires Oracle Java 6u31 or later. Before you install Hadoop CLI, verify that Oracle Java 6u31 or later is installed on each Splunk instance in which you plan to run Splunk Hadoop Connect.

Download and install the Oracle Java Development Kit (JDK)

Install the correct Java version.

1. Download the recommended Oracle Standard Edition (SE) JDK from the Oracle Java SE downloads site (http://www.oracle.com/technetwork/java/javase/downloads/index.html).

Go to the archive page located at http://www.oracle.com/technetwork/java/javasebusiness/downloads/java-archive-downloads-javase6-419409.html

Important: Download the JDK, which includes the JRE.

2. Follow the installation instructions on the Oracle web site for Java SE to install the JDK onto your system.

Download the Hadoop package

After you know the specific Hadoop distribution and version for each Hadoop cluster in your environment, determine the correct Hadoop CLI tar file to download.

Download Apache Hadoop

1. Go to the Apache download archive site:

http://archive.apache.org/dist/hadoop/core/

2. Select the correct tar file for your version of Apache Hadoop. For example, version 1.0.3:

http://archive.apache.org/dist/hadoop/core/hadoop-1.0.3/hadoop-1.0.3.tar.gz

Cloudera Distribution including Apache Hadoop (Cloudera CDH)

1. Go to the CDH downloads site: https://www.cloudera.com/downloads/cdh/5-12-1.html

Or use the CDH archives:

For CDH3, use http://archive.cloudera.com/cdh/3/
For CDH4, use http://archive.cloudera.com/cdh4/

2. Locate the correct CDH version, and click the Tarball Download link.

3. Find the correct hadoop-<version> in the Component column, and click Download.

For example, http://archive.cloudera.com/cdh4/cdh/4/hadoop-2.0.0-cdh4.0.1.tar.gz

Hortonworks Data Platform (HDP)

1. Go to the Hortonworks Data Platform Release Repository:

http://s3.amazonaws.com/public-repo-1.hortonworks.com/index.html

2. Select the correct HDP version.

3. Navigate to and download the associated tar file (tar.gz). For example:

http://s3.amazonaws.com/public-repo-1.hortonworks.com/HDP-1.0.3/hadoop-1.0.1.tar.gz

Note: If your system has the wget binaries installed, you can also pull Hadoop CLI packages directly from the Splunk instance with the wget command. For example:

wget http://archive.cloudera.com/cdh/3/hadoop-0.20.2-cdh3u4.tar.gz

Extract the Hadoop package

To install the Hadoop CLI package, open the archive you downloaded. For example, run the command:

tar -xvzf <archive_name>.tar.gz

Download and extract the correct Hadoop CLI for each Hadoop cluster that Splunk Hadoop Connect communicates with. If you have multiple distributions and versions of Hadoop in your environment, install multiple Hadoop CLI packages on one Splunk instance.

Test the Hadoop setup

Test your Hadoop CLI installation to make sure that:

  • There is network connectivity between your Splunk instance and your Hadoop environment.
  • The Hadoop utilities are unpacked and installed correctly.
  • The CLI can properly run Java.

Test network connectivity

To test that your Hadoop CLI is set up properly and can connect to your Hadoop cluster, run the command:

$HADOOP_HOME/bin/hadoop fs -ls <namenode>:<ipc_port>/

where

namenode is the HDFS NameNode of your Hadoop cluster.
ipc_port is the inter-process communications (IPC) port that your Hadoop cluster listens on.

If Hadoop CLI returns a directory listing and does not present an error message, then your setup is correct and you have a successful connection.

Test write access to the Hadoop cluster

To test write access to your Hadoop cluster, run this command in the path where you want to export data:

$HADOOP_HOME/bin/hadoop fs -touchz <namenode>:<ipc_port>/<dir_path>/foo.txt
$HADOOP_HOME/bin/hadoop fs -rm <namenode>:<ipc_port>/<dir_path>/foo.txt

If Hadoop CLI does not return an error message, then your setup is correct.

Last modified on 02 November, 2018
PREVIOUS
Download and install Splunk Hadoop Connect
  NEXT
Install Kerberos client utilities

This documentation applies to the following versions of Splunk® Hadoop Connect: 1.0, 1.1, 1.2, 1.2.1, 1.2.2, 1.2.3, 1.2.4, 1.2.5


Was this documentation topic helpful?

Enter your email address, and someone from the documentation team will respond to you:

Please provide your comments here. Ask a question or make a suggestion.

You must be logged into splunk.com in order to post comments. Log in now.

Please try to keep this discussion focused on the content covered in this documentation topic. If you have a more general question about Splunk functionality or are experiencing a difficulty with Splunk, consider posting a question to Splunkbase Answers.

0 out of 1000 Characters