Java中如何用XPath解析HTML？-杰瑞科技汇

Of course! Using XPath with HTML in Java is a very common task for web scraping, data extraction, and automated testing. The most popular and recommended library for this is Jsoup.

（图片来源网络，侵删）

Jsoup is a fantastic library that parses HTML into a Document Object Model (DOM), and then you can use XPath expressions to query that DOM.

Here's a complete guide, from setting up your project to practical examples.

The Core Concept: Jsoup + XPath

Jsoup: Parses raw HTML into a structured Document object. It handles all the messy parts of HTML (malformed tags, encoding issues, etc.) and provides a clean API to navigate the DOM.
XPath: A standard language for selecting nodes from an XML or HTML document. It's very powerful and expressive for finding elements based on their structure, attributes, or text content.

Jsoup doesn't have built-in XPath support, so we use a helper library called jsoup-xpath to bridge the gap.

Project Setup (Maven)

You need to add two dependencies to your pom.xml file: jsoup for HTML parsing and jsoup-xpath for the XPath functionality.

（图片来源网络，侵删）

<dependencies>
    <!-- For parsing HTML -->
    <dependency>
        <groupId>org.jsoup</groupId>
        <artifactId>jsoup</artifactId>
        <version>1.17.2</version> <!-- Use the latest version -->
    </dependency>
    <!-- For XPath support on Jsoup's DOM -->
    <dependency>
        <groupId>cn.wanghaomiao</groupId>
        <artifactId>jsoup-xpath</artifactId>
        <version>2.5.3</version> <!-- Use the latest version -->
    </dependency>
</dependencies>

Step-by-Step Guide with Examples

Let's use a sample HTML string for our examples.

Sample HTML:

<!DOCTYPE html>
<html>
<head>My Web Page</title>
</head>
<body>
    <h1>Welcome to the Page</h1>
    <div id="main-content">
        <p class="intro">This is the first paragraph.</p>
        <p class="intro">This is the second paragraph.</p>
        <ul>
            <li>Item 1</li>
            <li>Item 2</li>
            <li>Item 3</li>
        </ul>
    </div>
    <div id="sidebar">
        <a href="/about.html">About Us</a>
        <a href="/contact.html">Contact</a>
    </div>
    <p class="footer">Copyright 2025</p>
</body>
</html>

Step 1: Parse the HTML with Jsoup

First, you need to load your HTML into a Jsoup Document object. This is the foundation.

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class XPathExample {
    public static void main(String[] args) {
        String html = "<!DOCTYPE html>..."; // Paste the sample HTML here
        // Parse the HTML string into a Jsoup Document
        Document doc = Jsoup.parse(html);
        // Now you can use 'doc' with XPath
    }
}

Step 2: Selecting Elements with XPath

The JsoupXpath class provides the static select() method. It takes an XPath expression and the Document object.

（图片来源网络，侵删）

A. Selecting by ID

XPath uses id('id-value').

import cn.wanghaomiao.xpath.JsoupXpath;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import java.util.List;
// ... inside main method ...
// Select the element with id 'main-content'
List<Element> mainContent = JsoupXpath.select("//div[@id='main-content']", doc).getElements();
if (!mainContent.isEmpty()) {
    Element mainDiv = mainContent.get(0);
    System.out.println("Found main content div: " + mainDiv.text());
}
// Output: Found main content div: This is the first paragraph.This is the second paragraph.Item 1Item 2Item 3

B. Selecting by Class

XPath uses contains(@class, 'class-name') or [@class='class-name']. contains is more robust.

// Select all <p> elements that have the class 'intro'
List<Element> introParagraphs = JsoupXpath.select("//p[contains(@class, 'intro')]", doc).getElements();
System.out.println("\nFound intro paragraphs:");
for (Element p : introParagraphs) {
    System.out.println("- " + p.text());
}
/*
Output:
Found intro paragraphs:
- This is the first paragraph.
- This is the second paragraph.
*/

C. Selecting by Tag and Attribute

// Select all <li> elements inside a <ul> that is inside a <div>
List<Element> listItems = JsoupXpath.select("//div//ul//li", doc).getElements();
System.out.println("\nFound list items:");
for (Element li : listItems) {
    System.out.println("- " + li.text());
}
/*
Output:
Found list items:
- Item 1
- Item 2
- Item 3
*/

D. Selecting by Text Content

Use text() to check the text of an element.

// Select the <h1> element whose text is exactly 'Welcome to the Page'
List<Element> heading = JsoupXpath.select("//h1[text()='Welcome to the Page']", doc).getElements();
System.out.println("\nFound heading: " + heading.get(0).text());
// Output: Found heading: Welcome to the Page

E. Selecting by Attribute Value (other than class/id)

// Select all <a> elements that have an 'href' attribute
List<Element> links = JsoupXpath.select("//a[@href]", doc).getElements();
System.out.println("\nFound links:");
for (Element a : links) {
    System.out.println("Text: " + a.text() + ", Href: " + a.attr("href"));
}
/*
Output:
Found links:
Text: About Us, Href: /about.html
Text: Contact, Href: /contact.html
*/

Step 3: Extracting Data (Text, Attributes, HTML)

Once you have a list of Element objects, you can easily extract information.

// Let's get the first intro paragraph again
Element firstP = JsoupXpath.select("//p[contains(@class, 'intro')]", doc).getElements().get(0);
// 1. Get the text of the element
System.out.println("\nElement text: " + firstP.text());
// 2. Get an attribute's value
// For example, if we had an <img src="...">:
// String imageUrl = imgElement.attr("src");
// 3. Get the element's own HTML
System.out.println("Element HTML: " + firstP.html());
// 4. Get the element's outer HTML (including the tag itself)
System.out.println("Element outer HTML: " + firstP.outerHtml());

Practical Example: Scraping a Real Website

Let's scrape all the headlines from the Hacker News homepage.

Goal: Get the title and the URL of each news item on the first page.

import cn.wanghaomiao.xpath.JsoupXpath;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import java.io.IOException;
import java.util.List;
public class HackerNewsScraper {
    public static void main(String[] args) {
        try {
            // 1. Fetch the HTML from the URL
            String url = "https://news.ycombinator.com";
            System.out.println("Fetching HTML from: " + url);
            // User-Agent header is good practice to avoid being blocked
            Document doc = Jsoup.connect(url)
                               .userAgent("Mozilla/5.0")
                               .get();
            // 2. Define the XPath to select all story links
            // On Hacker News, each story is in a <tr> with classathing.
            // The title link is an <a> tag with class="storylink".
            String xpath = "//tr[@class='athing']//a[@class='storylink']";
            // 3. Execute the XPath query
            List<Element> storyLinks = JsoupXpath.select(xpath, doc).getElements();
            // 4. Iterate and print the results
            System.out.println("\n--- Found " + storyLinks.size() + " stories ---");
            for (Element link : storyLinks) {
                String title = link.text();
                String href = link.attr("href");
                // Handle relative URLs by making them absolute
                if (href.startsWith("/")) {
                    href = "https://news.ycombinator.com" + href;
                }
                System.out.println("Title: " + title);
                System.out.println("URL: " + href);
                System.out.println("----------------------");
            }
        } catch (IOException e) {
            System.err.println("Error fetching the URL: " + e.getMessage());
            e.printStackTrace();
        }
    }
}

Important Considerations and Best Practices

Robustness: Websites change. Your XPath might break if the site updates its HTML structure. It's often better to select on a stable parent and then use simpler selectors (like eq(index)) on the children.
Performance: For very large HTML documents, complex XPath queries can be slower than simpler Jsoup selectors. However, for most web scraping tasks, the performance is excellent.
Handling Dynamic Content: Jsoup and XPath can only see the initial HTML source. If a website loads content using JavaScript after the page loads (like many modern single-page apps), you will need a different tool, such as Selenium or Playwright, which can control a real web browser.
Ethics and robots.txt: Always be respectful to website owners. Check the website's robots.txt file (e.g., https://news.ycombinator.com/robots.txt) to see which parts of the site they ask scrapers not to access. Don't send too many requests in a short period.

Java中如何用XPath解析HTML？

The Core Concept: Jsoup + XPath

Project Setup (Maven)

Step-by-Step Guide with Examples

Step 1: Parse the HTML with Jsoup

Step 2: Selecting Elements with XPath

Step 3: Extracting Data (Text, Attributes, HTML)

Practical Example: Scraping a Real Website

Important Considerations and Best Practices

99ANYc3cd6

Java线程run()与start()有何区别？

Java中stack和heap有何区别？

Python中apply的axis参数到底该怎么用？

y480拆机清灰教程，步骤和注意事项有哪些？

Python requests参数有哪些？如何使用？

Java Socket编程实例具体怎么实现？

U盘装Win7系统步骤是怎样的？

Flash CS6教程视频哪里能看？

ThinkPad清灰教程，自己动手难不难？

直播LOL如何用OBS设置？

Python编写Metasploit

如何快速学会AutoCAD制图视频教程？

java 连接 sql server

Windows下Python pip安装配置步骤？

三菱plc编程软件使用教程

coreldraw教程入门

Java中如何用XPath解析HTML？

The Core Concept: Jsoup + XPath

Project Setup (Maven)

Step-by-Step Guide with Examples

Step 1: Parse the HTML with Jsoup

Step 2: Selecting Elements with XPath

Step 3: Extracting Data (Text, Attributes, HTML)

Practical Example: Scraping a Real Website

Important Considerations and Best Practices

相关推荐

Java Socket编程实例具体怎么实现？