杰瑞科技汇

java hadoop jar

Of course! The hadoop jar command is a fundamental utility in the Hadoop ecosystem for running Java (and other) applications on a Hadoop cluster. It's the primary way to execute MapReduce, YARN, and other Hadoop-based jobs.

Here's a comprehensive breakdown covering the command itself, its usage, practical examples, and best practices.


The Basic Command Structure

The general syntax of the hadoop jar command is:

hadoop jar <jar_file> [main_class] [args...]

Let's break down each component:

  • hadoop: This is the Hadoop command-line script. It's located in the $HADOOP_HOME/bin directory.
  • jar <jar_file>: This tells Hadoop to execute the code packaged inside the specified JAR (Java Archive) file. This is the executable for your job.
  • [main_class] (Optional): The fully qualified name of the main class to run (e.g., com.example.MyDriver). If your JAR file contains a Main-Class attribute in its META-INF/MANIFEST.MF file, you can omit this argument, and Hadoop will use the one specified in the manifest. It's a best practice to always specify it explicitly to avoid ambiguity.
  • [args...] (Optional): Any arguments you want to pass to your application's main method. These are application-specific.

A Practical Example: WordCount

The "Hello, World!" of Hadoop is the WordCount example. Let's see how to run it.

Step 1: Prepare Input Data

First, you need some input data. Let's create a text file.

# Create a directory in HDFS for our input
hdfs dfs -mkdir -p /user/hadoop/wordcount/input
# Create a local input file
echo "Hello Hadoop World Hadoop is great" > my_local_input.txt
echo "Hello Java World Java is powerful" >> my_local_input.txt
# Copy the local file to HDFS
hdfs dfs -put my_local_input.txt /user/hadoop/wordcount/input/

Step 2: Run the WordCount Job

Hadoop comes with a pre-built hadoop-mapreduce-examples.jar that contains the WordCount class.

hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar \
  wordcount \
  /user/hadoop/wordcount/input \
  /user/hadoop/wordcount/output

Let's dissect this command:

  • hadoop jar $HADOOP_HOME/.../hadoop-mapreduce-examples-*.jar: We are telling Hadoop to use the JAR file containing the examples.
  • wordcount: This is the main class (org.apache.hadoop.examples.WordCount). We are using the short name because it's the one defined in the JAR's manifest.
  • /user/hadoop/wordcount/input: This is the first argument to our main method—the input path in HDFS.
  • /user/hadoop/wordcount/output: This is the second argument—the output path in HDFS.

Important: The output directory must not exist before running the job. Hadoop will create it and write the results into it. If it exists, the job will fail.

Step 3: Check the Output

After the job completes (you'll see a "Job completed successfully" message), you can view the results.

# View the output directory in HDFS
hdfs dfs -ls /user/hadoop/wordcount/output
# You should see a file named 'part-r-00000' (or similar)
# This file contains the word counts
hdfs dfs -cat /user/hadoop/wordcount/output/part-r-00000

Expected Output:

Hadoop  2
Java    2
World   2
great   1
is  2
powerful    1

Common Command-Line Options (Flags)

The hadoop jar command also accepts a set of options to control the execution environment of your job. These are typically placed before the JAR file.

Here are the most important ones:

Option Description Example
-D <property=value> Sets a Hadoop configuration property for the job. This is the most common way to customize your job. -D mapreduce.job.queuename=production
-conf <config_file> Specifies an alternate XML configuration file instead of $HADOOP_HOME/etc/hadoop/core-site.xml. -conf my-cluster-config.xml
-fs <namenode:port> Specifies the NameNode to use. Overrides fs.defaultFS from the configuration. -fs hdfs://namenode.example.com:8020
-jt <jobtracker:port> (Legacy) Specifies the JobTracker to use. In YARN, this is replaced by -D properties like mapreduce.jobtracker.address. -D mapreduce.jobtracker.address=tracker.example.com:8021
-files <local_files> Uploads comma-separated local files to the current working directory of the task on the Hadoop cluster. Useful for distributing config files or lookup tables. -files /path/to/local/config.properties
-archives <local_archives> Similar to -files, but for archives (e.g., .zip, .tar.gz, .jar). The archive is unarchived on the cluster. -archives /path/to/local/my_data.zip
-libjars <local_jars> Uploads comma-separated local JAR files to the classpath of the task. Essential if your job depends on third-party libraries. -libjars /path/to/guava-31.1.jar,/path/to/json-simple.jar

Example with Options

Let's say your job needs a specific JAR file for JSON parsing and you want to run it on a specific queue.

hadoop jar my-complex-job.jar \
  com.mycompany.driver.MyComplexDriver \
  -libjars /path/to/json-simple-1.1.1.jar \
  -D mapreduce.job.queuename=data_science \
  -D mapreduce.map.memory.mb=2048 \
  /user/hadoop/input \
  /user/hadoop/output_complex

How to Package Your Own Application

To run your own Java code, you need to package it into a JAR file.

Project Structure (using Maven)

A standard Maven project structure is highly recommended.

my-hadoop-project/
├── pom.xml
└── src/
    └── main/
        ├── java/
        │   └── com/
        │       └── example/
        │           └── MyDriver.java      (Your main class)
        │           └── MyMapper.java
        │           └── MyReducer.java
        └── resources/
            └── log4j.properties          (Good for controlling logs)

The pom.xml (Maven Configuration)

This is the crucial part. You need to include Hadoop dependencies and configure the Maven Shade Plugin to create a "fat" JAR.

<project ...>
    <properties>
        <hadoop.version>3.3.6</hadoop.version>
    </properties>
    <dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>${hadoop.version}</version>
        </dependency>
    </dependencies>
    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>3.2.4</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <transformers>
                                <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                                    <mainClass>com.example.MyDriver</mainClass>
                                </transformer>
                            </transformers>
                            <!-- This prevents "duplicate" class warnings -->
                            <filters>
                                <filter>
                                    <artifact>*:*</artifact>
                                    <excludes>
                                        <exclude>META-INF/*.SF</exclude>
                                        <exclude>META-INF/*.DSA</exclude>
                                        <exclude>META-INF/*.RSA</exclude>
                                    </excludes>
                                </filter>
                            </filters>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</project>

Build and Run

  1. Build the JAR:

    mvn clean package

    This will create a fat JAR in the target/ directory, e.g., my-hadoop-project-1.0-SNAPSHOT.jar.

  2. Run the JAR:

    hadoop jar target/my-hadoop-project-1.0-SNAPSHOT.jar \
      com.example.MyDriver \
      /user/hadoop/input \
      /user/hadoop/output_myapp

Best Practices

  1. Use a Fat JAR (Uber-JAR): The Maven Shade Plugin (or Gradle Shadow Plugin) bundles all your dependencies (including Hadoop libraries) into a single JAR. This simplifies deployment because you only need to upload one file to the cluster.
  2. Always Specify the Main Class: Even if it's in the manifest, explicitly stating the main class (com.example.MyDriver) makes your command more robust and easier to read.
  3. Use -libjars for External Dependencies: If you have many dependencies or prefer not to create a fat JAR, use the -libjars option. Hadoop will distribute these JARs to all nodes for you.
  4. Control Logging: Hadoop's logging can be verbose. Configure log4j.properties in your project's resources and set the HADOOP_ROOT_LOGGER environment variable to control the log level from the command line.
    export HADOOP_ROOT_LOGGER=INFO,console
    hadoop jar my-app.jar ...
  5. Use -D for Configuration: Avoid hardcoding configurations like queue names, memory settings, or file paths in your Java code. Use the -D flag to make your job command flexible and environment-agnostic.
分享:
扫描分享到社交APP
上一篇
下一篇