java hadoop jar-杰瑞科技汇

Of course! The hadoop jar command is a fundamental utility in the Hadoop ecosystem for running Java (and other) applications on a Hadoop cluster. It's the primary way to execute MapReduce, YARN, and other Hadoop-based jobs.

Here's a comprehensive breakdown covering the command itself, its usage, practical examples, and best practices.

The Basic Command Structure

The general syntax of the hadoop jar command is:

hadoop jar <jar_file> [main_class] [args...]

Let's break down each component:

hadoop: This is the Hadoop command-line script. It's located in the $HADOOP_HOME/bin directory.
jar <jar_file>: This tells Hadoop to execute the code packaged inside the specified JAR (Java Archive) file. This is the executable for your job.
[main_class] (Optional): The fully qualified name of the main class to run (e.g., com.example.MyDriver). If your JAR file contains a Main-Class attribute in its META-INF/MANIFEST.MF file, you can omit this argument, and Hadoop will use the one specified in the manifest. It's a best practice to always specify it explicitly to avoid ambiguity.
[args...] (Optional): Any arguments you want to pass to your application's main method. These are application-specific.

A Practical Example: WordCount

The "Hello, World!" of Hadoop is the WordCount example. Let's see how to run it.

Step 1: Prepare Input Data

First, you need some input data. Let's create a text file.

# Create a directory in HDFS for our input
hdfs dfs -mkdir -p /user/hadoop/wordcount/input
# Create a local input file
echo "Hello Hadoop World Hadoop is great" > my_local_input.txt
echo "Hello Java World Java is powerful" >> my_local_input.txt
# Copy the local file to HDFS
hdfs dfs -put my_local_input.txt /user/hadoop/wordcount/input/

Step 2: Run the WordCount Job

Hadoop comes with a pre-built hadoop-mapreduce-examples.jar that contains the WordCount class.

hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar \
  wordcount \
  /user/hadoop/wordcount/input \
  /user/hadoop/wordcount/output

Let's dissect this command:

hadoop jar $HADOOP_HOME/.../hadoop-mapreduce-examples-*.jar: We are telling Hadoop to use the JAR file containing the examples.
wordcount: This is the main class (org.apache.hadoop.examples.WordCount). We are using the short name because it's the one defined in the JAR's manifest.
/user/hadoop/wordcount/input: This is the first argument to our main method—the input path in HDFS.
/user/hadoop/wordcount/output: This is the second argument—the output path in HDFS.

Important: The output directory must not exist before running the job. Hadoop will create it and write the results into it. If it exists, the job will fail.

Step 3: Check the Output

After the job completes (you'll see a "Job completed successfully" message), you can view the results.

# View the output directory in HDFS
hdfs dfs -ls /user/hadoop/wordcount/output
# You should see a file named 'part-r-00000' (or similar)
# This file contains the word counts
hdfs dfs -cat /user/hadoop/wordcount/output/part-r-00000

Expected Output:

Hadoop  2
Java    2
World   2
great   1
is  2
powerful    1

Common Command-Line Options (Flags)

The hadoop jar command also accepts a set of options to control the execution environment of your job. These are typically placed before the JAR file.

Here are the most important ones:

Option	Description	Example
`-D <property=value>`	Sets a Hadoop configuration property for the job. This is the most common way to customize your job.	`-D mapreduce.job.queuename=production`
`-conf <config_file>`	Specifies an alternate XML configuration file instead of `$HADOOP_HOME/etc/hadoop/core-site.xml`.	`-conf my-cluster-config.xml`
`-fs <namenode:port>`	Specifies the NameNode to use. Overrides `fs.defaultFS` from the configuration.	`-fs hdfs://namenode.example.com:8020`
`-jt <jobtracker:port>`	(Legacy) Specifies the JobTracker to use. In YARN, this is replaced by `-D` properties like `mapreduce.jobtracker.address`.	`-D mapreduce.jobtracker.address=tracker.example.com:8021`
`-files <local_files>`	Uploads comma-separated local files to the current working directory of the task on the Hadoop cluster. Useful for distributing config files or lookup tables.	`-files /path/to/local/config.properties`
`-archives <local_archives>`	Similar to `-files`, but for archives (e.g., `.zip`, `.tar.gz`, `.jar`). The archive is unarchived on the cluster.	`-archives /path/to/local/my_data.zip`
`-libjars <local_jars>`	Uploads comma-separated local JAR files to the classpath of the task. Essential if your job depends on third-party libraries.	`-libjars /path/to/guava-31.1.jar,/path/to/json-simple.jar`

Example with Options

Let's say your job needs a specific JAR file for JSON parsing and you want to run it on a specific queue.

hadoop jar my-complex-job.jar \
  com.mycompany.driver.MyComplexDriver \
  -libjars /path/to/json-simple-1.1.1.jar \
  -D mapreduce.job.queuename=data_science \
  -D mapreduce.map.memory.mb=2048 \
  /user/hadoop/input \
  /user/hadoop/output_complex

How to Package Your Own Application

To run your own Java code, you need to package it into a JAR file.

Project Structure (using Maven)

A standard Maven project structure is highly recommended.

my-hadoop-project/
├── pom.xml
└── src/
    └── main/
        ├── java/
        │   └── com/
        │       └── example/
        │           └── MyDriver.java      (Your main class)
        │           └── MyMapper.java
        │           └── MyReducer.java
        └── resources/
            └── log4j.properties          (Good for controlling logs)

The `pom.xml` (Maven Configuration)

This is the crucial part. You need to include Hadoop dependencies and configure the Maven Shade Plugin to create a "fat" JAR.

<project ...>
    <properties>
        <hadoop.version>3.3.6</hadoop.version>
    </properties>
    <dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>${hadoop.version}</version>
        </dependency>
    </dependencies>
    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>3.2.4</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <transformers>
                                <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                                    <mainClass>com.example.MyDriver</mainClass>
                                </transformer>
                            </transformers>
                            <!-- This prevents "duplicate" class warnings -->
                            <filters>
                                <filter>
                                    <artifact>*:*</artifact>
                                    <excludes>
                                        <exclude>META-INF/*.SF</exclude>
                                        <exclude>META-INF/*.DSA</exclude>
                                        <exclude>META-INF/*.RSA</exclude>
                                    </excludes>
                                </filter>
                            </filters>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</project>

Build and Run

Build the JAR:
```
mvn clean package
```
This will create a fat JAR in the target/ directory, e.g., my-hadoop-project-1.0-SNAPSHOT.jar.

Run the JAR:

hadoop jar target/my-hadoop-project-1.0-SNAPSHOT.jar \
  com.example.MyDriver \
  /user/hadoop/input \
  /user/hadoop/output_myapp

Best Practices

Use a Fat JAR (Uber-JAR): The Maven Shade Plugin (or Gradle Shadow Plugin) bundles all your dependencies (including Hadoop libraries) into a single JAR. This simplifies deployment because you only need to upload one file to the cluster.
Always Specify the Main Class: Even if it's in the manifest, explicitly stating the main class (com.example.MyDriver) makes your command more robust and easier to read.
Use -libjars for External Dependencies: If you have many dependencies or prefer not to create a fat JAR, use the -libjars option. Hadoop will distribute these JARs to all nodes for you.
Control Logging: Hadoop's logging can be verbose. Configure log4j.properties in your project's resources and set the HADOOP_ROOT_LOGGER environment variable to control the log level from the command line.
```
export HADOOP_ROOT_LOGGER=INFO,console
hadoop jar my-app.jar ...
```
Use -D for Configuration: Avoid hardcoding configurations like queue names, memory settings, or file paths in your Java code. Use the -D flag to make your job command flexible and environment-agnostic.

java hadoop jar

The Basic Command Structure

A Practical Example: WordCount

Step 1: Prepare Input Data

Step 2: Run the WordCount Job

Step 3: Check the Output

Common Command-Line Options (Flags)

Example with Options

How to Package Your Own Application

Project Structure (using Maven)

The `pom.xml` (Maven Configuration)

Build and Run

Best Practices

99ANYc3cd6

Selenium Python如何实现页面滚动？

Python random seed 如何固定随机数？

Picsart手绘教程图解，新手如何快速上手？

Python lambda与list如何结合使用？

如何用Java操作Access的OLE对象？

Java Socket编程实例具体怎么实现？

TortoiseSVN教程，新手如何快速上手？

Java字符串比较用==还是equals？

synchronized锁对象和锁方法有何区别？

Java字符串占几个字节？

Python 32位版怎么下载？

Python getnameinfo方法如何使用？

16进制如何转Java string？

Abaqus 6.10教程从哪里学？新手如何入门？

Capture One教程怎么学才高效？

Outlook 2007教程，新手如何快速上手？

java hadoop jar

The Basic Command Structure

A Practical Example: WordCount

Step 1: Prepare Input Data

Step 2: Run the WordCount Job

Step 3: Check the Output

Common Command-Line Options (Flags)

Example with Options

How to Package Your Own Application

Project Structure (using Maven)

The pom.xml (Maven Configuration)

Build and Run

Best Practices

相关推荐

Java Socket编程实例具体怎么实现？

The `pom.xml` (Maven Configuration)