Of course! The hadoop jar command is a fundamental utility in the Hadoop ecosystem for running Java (and other) applications on a Hadoop cluster. It's the primary way to execute MapReduce, YARN, and other Hadoop-based jobs.
Here's a comprehensive breakdown covering the command itself, its usage, practical examples, and best practices.
The Basic Command Structure
The general syntax of the hadoop jar command is:
hadoop jar <jar_file> [main_class] [args...]
Let's break down each component:
hadoop: This is the Hadoop command-line script. It's located in the$HADOOP_HOME/bindirectory.jar <jar_file>: This tells Hadoop to execute the code packaged inside the specified JAR (Java Archive) file. This is the executable for your job.[main_class](Optional): The fully qualified name of the main class to run (e.g.,com.example.MyDriver). If your JAR file contains aMain-Classattribute in itsMETA-INF/MANIFEST.MFfile, you can omit this argument, and Hadoop will use the one specified in the manifest. It's a best practice to always specify it explicitly to avoid ambiguity.[args...](Optional): Any arguments you want to pass to your application'smainmethod. These are application-specific.
A Practical Example: WordCount
The "Hello, World!" of Hadoop is the WordCount example. Let's see how to run it.
Step 1: Prepare Input Data
First, you need some input data. Let's create a text file.
# Create a directory in HDFS for our input hdfs dfs -mkdir -p /user/hadoop/wordcount/input # Create a local input file echo "Hello Hadoop World Hadoop is great" > my_local_input.txt echo "Hello Java World Java is powerful" >> my_local_input.txt # Copy the local file to HDFS hdfs dfs -put my_local_input.txt /user/hadoop/wordcount/input/
Step 2: Run the WordCount Job
Hadoop comes with a pre-built hadoop-mapreduce-examples.jar that contains the WordCount class.
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar \ wordcount \ /user/hadoop/wordcount/input \ /user/hadoop/wordcount/output
Let's dissect this command:
hadoop jar $HADOOP_HOME/.../hadoop-mapreduce-examples-*.jar: We are telling Hadoop to use the JAR file containing the examples.wordcount: This is the main class (org.apache.hadoop.examples.WordCount). We are using the short name because it's the one defined in the JAR's manifest./user/hadoop/wordcount/input: This is the first argument to ourmainmethod—the input path in HDFS./user/hadoop/wordcount/output: This is the second argument—the output path in HDFS.
Important: The output directory must not exist before running the job. Hadoop will create it and write the results into it. If it exists, the job will fail.
Step 3: Check the Output
After the job completes (you'll see a "Job completed successfully" message), you can view the results.
# View the output directory in HDFS hdfs dfs -ls /user/hadoop/wordcount/output # You should see a file named 'part-r-00000' (or similar) # This file contains the word counts hdfs dfs -cat /user/hadoop/wordcount/output/part-r-00000
Expected Output:
Hadoop 2
Java 2
World 2
great 1
is 2
powerful 1
Common Command-Line Options (Flags)
The hadoop jar command also accepts a set of options to control the execution environment of your job. These are typically placed before the JAR file.
Here are the most important ones:
| Option | Description | Example |
|---|---|---|
-D <property=value> |
Sets a Hadoop configuration property for the job. This is the most common way to customize your job. | -D mapreduce.job.queuename=production |
-conf <config_file> |
Specifies an alternate XML configuration file instead of $HADOOP_HOME/etc/hadoop/core-site.xml. |
-conf my-cluster-config.xml |
-fs <namenode:port> |
Specifies the NameNode to use. Overrides fs.defaultFS from the configuration. |
-fs hdfs://namenode.example.com:8020 |
-jt <jobtracker:port> |
(Legacy) Specifies the JobTracker to use. In YARN, this is replaced by -D properties like mapreduce.jobtracker.address. |
-D mapreduce.jobtracker.address=tracker.example.com:8021 |
-files <local_files> |
Uploads comma-separated local files to the current working directory of the task on the Hadoop cluster. Useful for distributing config files or lookup tables. | -files /path/to/local/config.properties |
-archives <local_archives> |
Similar to -files, but for archives (e.g., .zip, .tar.gz, .jar). The archive is unarchived on the cluster. |
-archives /path/to/local/my_data.zip |
-libjars <local_jars> |
Uploads comma-separated local JAR files to the classpath of the task. Essential if your job depends on third-party libraries. | -libjars /path/to/guava-31.1.jar,/path/to/json-simple.jar |
Example with Options
Let's say your job needs a specific JAR file for JSON parsing and you want to run it on a specific queue.
hadoop jar my-complex-job.jar \ com.mycompany.driver.MyComplexDriver \ -libjars /path/to/json-simple-1.1.1.jar \ -D mapreduce.job.queuename=data_science \ -D mapreduce.map.memory.mb=2048 \ /user/hadoop/input \ /user/hadoop/output_complex
How to Package Your Own Application
To run your own Java code, you need to package it into a JAR file.
Project Structure (using Maven)
A standard Maven project structure is highly recommended.
my-hadoop-project/
├── pom.xml
└── src/
└── main/
├── java/
│ └── com/
│ └── example/
│ └── MyDriver.java (Your main class)
│ └── MyMapper.java
│ └── MyReducer.java
└── resources/
└── log4j.properties (Good for controlling logs)
The pom.xml (Maven Configuration)
This is the crucial part. You need to include Hadoop dependencies and configure the Maven Shade Plugin to create a "fat" JAR.
<project ...>
<properties>
<hadoop.version>3.3.6</hadoop.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.2.4</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>com.example.MyDriver</mainClass>
</transformer>
</transformers>
<!-- This prevents "duplicate" class warnings -->
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
Build and Run
-
Build the JAR:
mvn clean package
This will create a fat JAR in the
target/directory, e.g.,my-hadoop-project-1.0-SNAPSHOT.jar. -
Run the JAR:
hadoop jar target/my-hadoop-project-1.0-SNAPSHOT.jar \ com.example.MyDriver \ /user/hadoop/input \ /user/hadoop/output_myapp
Best Practices
- Use a Fat JAR (Uber-JAR): The Maven Shade Plugin (or Gradle Shadow Plugin) bundles all your dependencies (including Hadoop libraries) into a single JAR. This simplifies deployment because you only need to upload one file to the cluster.
- Always Specify the Main Class: Even if it's in the manifest, explicitly stating the main class (
com.example.MyDriver) makes your command more robust and easier to read. - Use
-libjarsfor External Dependencies: If you have many dependencies or prefer not to create a fat JAR, use the-libjarsoption. Hadoop will distribute these JARs to all nodes for you. - Control Logging: Hadoop's logging can be verbose. Configure
log4j.propertiesin your project's resources and set theHADOOP_ROOT_LOGGERenvironment variable to control the log level from the command line.export HADOOP_ROOT_LOGGER=INFO,console hadoop jar my-app.jar ...
- Use
-Dfor Configuration: Avoid hardcoding configurations like queue names, memory settings, or file paths in your Java code. Use the-Dflag to make your job command flexible and environment-agnostic.
