杰瑞科技汇

html xpath java

Of course! This is a very common and powerful combination for web scraping and data extraction. Here's a comprehensive guide covering how to use HTML parsing with XPath in Java.

The Core Concept: Why HTML + XPath + Java?

  1. HTML: The structure of a web page. It's a tree of nested elements (tags).
  2. XPath (XML Path Language): A query language for selecting nodes from an XML or HTML document. It's like SQL for your document's tree structure.
  3. Java: The programming language that will load the HTML, parse it into a navigable tree, and use XPath to find the exact data you need.

Step 1: Choose a Java HTML Parser

You can't use XPath directly on a raw HTML string. You need a parser to convert the HTML into a Document Object Model (DOM), which is a tree structure that XPath can understand.

For Java, the most popular and effective library is Jsoup.

Why Jsoup?

  • Tolerant: It's excellent at parsing real-world, messy HTML (which is often not valid XML).
  • Easy to Use: It has a very intuitive API.
  • CSS Selector Support: While we'll focus on XPath, Jsoup also supports CSS selectors, which many developers find easier.

Add Jsoup to your project:

Maven (pom.xml):

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.17.2</version> <!-- Check for the latest version -->
</dependency>

Gradle (build.gradle):

implementation 'org.jsoup:jsoup:1.17.2' // Check for the latest version

Step 2: Parse HTML into a Jsoup Document

First, you need to get your HTML content into a Document object. Jsoup provides several ways to do this.

Example HTML (index.html): Let's use this sample HTML for our examples.

<!DOCTYPE html>
<html>
<head>Sample Page</title>
</head>
<body>
    <h1>Welcome to the Page</h1>
    <div id="content">
        <p class="intro">This is the first paragraph.</p>
        <p class="intro">This is the second paragraph.</p>
        <div class="product">
            <h2>Product A</h2>
            <span class="price">$19.99</span>
        </div>
        <div class="product">
            <h2>Product B</h2>
            <span class="price">$24.50</span>
        </div>
    </div>
    <ul id="links">
        <li><a href="https://example.com/page1">Page 1</a></li>
        <li><a href="https://example.com/page2">Page 2</a></li>
    </ul>
</body>
</html>

Java Code to Parse:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import javax.xml.xpath.*;
import org.w3c.dom.NodeList;
import org.w3c.dom.Node;
import java.io.File;
import java.io.IOException;
public class HtmlXPathExample {
    public static void main(String[] args) {
        try {
            // --- Method 1: Parse from a String ---
            String html = "<html><head><title>Test</title></head><body><p>Hello World</p></body></html>";
            Document docFromString = Jsoup.parse(html);
            System.out.println("Parsed from String Title: " + docFromString.title());
            // --- Method 2: Parse from a URL (most common for scraping) ---
            // Document docFromUrl = Jsoup.connect("https://example.com").get();
            // System.out.println("Parsed from URL Title: " + docFromUrl.title());
            // --- Method 3: Parse from a local file ---
            // Document docFromFile = Jsoup.parse(new File("path/to/your/index.html"), "UTF-8");
            // System.out.println("Parsed from File Title: " + docFromFile.title());
            // We will use the string version for our examples
            Document doc = Jsoup.parse(html); // Replace with docFromUrl or docFromFile for real use
            // Now, let's use XPath on this Jsoup Document
            useXPathOnJsoupDocument(doc);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
    // ... rest of the code will go here ...
}

Step 3: Using XPath with Jsoup (The Bridge)

Jsoup's Document object doesn't have a built-in evaluate() method like standard XML parsers. The most common and robust way to use XPath with Jsoup is to convert the Jsoup Document into a standard W3C org.w3c.dom.Document.

The Conversion Process:

  1. Jsoup Document -> Jsoup's org.jsoup.helper.W3CDom
  2. W3CDom -> W3C org.w3c.dom.Document
  3. W3C Document -> XPath evaluation

Here is the helper method and the main logic:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import javax.xml.xpath.*;
import org.w3c.dom.NodeList;
import org.w3c.dom.Node;
import java.io.File;
import java.io.IOException;
public class HtmlXPathExample {
    public static void main(String[] args) {
        // Let's use the full HTML string from our example
        String html = "<!DOCTYPE html>\n" +
                "<html>\n" +
                "<head>\n" +
                "    <title>Sample Page</title>\n" +
                "</head>\n" +
                "<body>\n" +
                "    <h1>Welcome to the Page</h1>\n" +
                "    <div id=\"content\">\n" +
                "        <p class=\"intro\">This is the first paragraph.</p>\n" +
                "        <p class=\"intro\">This is the second paragraph.</p>\n" +
                "        <div class=\"product\">\n" +
                "            <h2>Product A</h2>\n" +
                "            <span class=\"price\">$19.99</span>\n" +
                "        </div>\n" +
                "        <div class=\"product\">\n" +
                "            <h2>Product B</h2>\n" +
                "            <span class=\"price\">$24.50</span>\n" +
                "        </div>\n" +
                "    </div>\n" +
                "    <ul id=\"links\">\n" +
                "        <li><a href=\"https://example.com/page1\">Page 1</a></li>\n" +
                "        <li><a href=\"https://example.com/page2\">Page 2</a></li>\n" +
                "    </ul>\n" +
                "</body>\n" +
                "</html>";
        try {
            Document jsoupDoc = Jsoup.parse(html);
            useXPathOnJsoupDocument(jsoupDoc);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
    public static void useXPathOnJsoupDocument(org.jsoup.nodes.Document jsoupDoc) throws XPathExpressionException {
        // 1. Convert Jsoup Document to W3C Document
        org.w3c.dom.Document w3cDoc = new org.jsoup.helper.W3CDom().fromJsoup(jsoupDoc);
        // 2. Create an XPath factory and an XPath object
        XPathFactory xPathFactory = XPathFactory.newInstance();
        XPath xpath = xPathFactory.newXPath();
        // 3. Define and evaluate XPath expressions
        // Get the page title
        String title = xpath.evaluate("/html/head/title", w3cDoc);
        System.out.println("1. Page Title: " + title);
        // Get all paragraph texts
        NodeList paragraphNodes = (NodeList) xpath.evaluate("//p", w3cDoc, XPathConstants.NODESET);
        System.out.println("\n2. All Paragraphs:");
        for (int i = 0; i < paragraphNodes.getLength(); i++) {
            Node node = paragraphNodes.item(i);
            System.out.println("   - " + node.getTextContent().trim());
        }
        // Get the text of the first product's title
        String firstProductTitle = xpath.evaluate("//*[@class='product']/h2[1]", w3cDoc);
        System.out.println("\n3. First Product Title: " + firstProductTitle.trim());
        // Get the price of 'Product B'
        String productBPrice = xpath.evaluate("//*[@class='product'][h2='Product B']/span[@class='price']", w3cDoc);
        System.out.println("\n4. Price of Product B: " + productBPrice.trim());
        // Get all links (href attributes)
        NodeList linkNodes = (NodeList) xpath.evaluate("//a", w3cDoc, XPathConstants.NODESET);
        System.out.println("\n5. All Links:");
        for (int i = 0; i < linkNodes.getLength(); i++) {
            Node node = linkNodes.item(i);
            Element element = (Element) node; // Cast to Jsoup Element for easier attribute access
            System.out.println("   - Text: " + element.text() + ", Href: " + element.attr("href"));
        }
    }
}

Step 4: Essential XPath Cheat Sheet

Here are the most common XPath expressions you'll use.

Expression Description Example
nodename Selects all nodes with the given name. p
Selects from the root node. /html
Selects nodes from the current node that match the selection no matter where they are. //p (selects all <p> elements)
Selects the current node.
Selects the parent of the current node.
Selects attributes. @id, @class
Matches any element node. div/* (all children of a div)
Matches any attribute node. (all attributes)
[] Used for predicates to filter nodes. //p[@class='intro']
text() Selects the text of the current node. //p/text()
contains() A function that returns true if the string contains the specified substring. //*[contains(@class, 'product')]

Common Predicates: | Expression | Description | Example | | :--- | :--- | :--- | | [1] | Selects the first node. | //p[1] | | [last()] | Selects the last node. | //p[last()] | | [position() < 3] | Selects the first two nodes. | //p[position() < 3] | | [@id='main'] | Selects nodes with an id attribute equal to 'main'. | //*[@id='main'] | | [@class='intro'] | Selects nodes with a class attribute equal to 'intro'. | //p[@class='intro'] | | [starts-with(@id, 'section')] | Selects nodes where the id attribute starts with 'section'. | //*[starts-with(@id, 'section')] | | [contains(@href, 'example.com')] | Selects nodes where the href attribute contains 'example.com'. | //a[contains(@href, 'example.com')] |


Step 5: Jsoup's Built-in CSS Selectors (A Simpler Alternative)

While XPath is extremely powerful, for many common tasks, Jsoup's built-in CSS selector engine is faster and often more readable. If you don't have a complex, absolute-path requirement, consider using CSS selectors.

Here's how you'd achieve the same results as the XPath examples above using Jsoup's select() method.

// Assuming 'jsoupDoc' is our parsed Jsoup Document from before
// 1. Get the page title
String title = jsoupDoc.title(); // Easiest way
System.out.println("1. Page Title (CSS): " + title);
// 2. Get all paragraph texts
Elements paragraphs = jsoupDoc.select("p");
System.out.println("\n2. All Paragraphs (CSS):");
for (Element p : paragraphs) {
    System.out.println("   - " + p.text());
}
// 3. Get the text of the first product's title
// The :nth-of-child(n) selector is very powerful
String firstProductTitle = jsoupDoc.select(".product > h2").first().text();
System.out.println("\n3. First Product Title (CSS): " + firstProductTitle);
// 4. Get the price of 'Product B'
// This is more complex with pure CSS. XPath is better here.
// You might need to chain or use JavaScript for this, or just iterate.
Element productB = jsoupDoc.select(".product").stream()
                           .filter(el -> el.select("h2").text().equals("Product B"))
                           .findFirst()
                           .orElse(null);
if (productB != null) {
    String productBPrice = productB.select(".price").text();
    System.out.println("\n4. Price of Product B (CSS): " + productBPrice);
}
// 5. Get all links (href attributes)
Elements links = jsoupDoc.select("a[href]");
System.out.println("\n5. All Links (CSS):");
for (Element link : links) {
    System.out.println("   - Text: " + link.text() + ", Href: " + link.attr("href"));
}

Summary: XPath vs. CSS Selectors in Jsoup

Feature XPath CSS Selectors (Jsoup)
Power Extremely powerful, can navigate up the tree () and do complex math. Very powerful for selecting down the tree. Cannot navigate up.
Readability Can be complex and verbose for some cases. Generally more concise and readable for layout-based selection.
Navigation Bidirectional (up and down the tree). Unidirectional (down the tree only).
Functions Rich set of functions (string(), number(), boolean(), contains(), etc.). Limited. Relies on pseudo-classes (not, has is experimental in Jsoup).
Use Case Best for complex queries, finding data based on content relationships, or when you need to select a parent node. Best for most scraping tasks. Faster and simpler for finding elements by tag, ID, class, or attribute.

Recommendation:

  • Use CSS Selectors by default. They are faster and easier to write for 90% of web scraping tasks.
  • Use XPath when you need its unique power, such as selecting a parent element or performing complex conditional logic that is difficult with CSS.
分享:
扫描分享到社交APP
上一篇
下一篇