html xpath java-杰瑞科技汇

Of course! This is a very common and powerful combination for web scraping and data extraction. Here's a comprehensive guide covering how to use HTML parsing with XPath in Java.

The Core Concept: Why HTML + XPath + Java?

HTML: The structure of a web page. It's a tree of nested elements (tags).
XPath (XML Path Language): A query language for selecting nodes from an XML or HTML document. It's like SQL for your document's tree structure.
Java: The programming language that will load the HTML, parse it into a navigable tree, and use XPath to find the exact data you need.

Step 1: Choose a Java HTML Parser

You can't use XPath directly on a raw HTML string. You need a parser to convert the HTML into a Document Object Model (DOM), which is a tree structure that XPath can understand.

For Java, the most popular and effective library is Jsoup.

Why Jsoup?

Tolerant: It's excellent at parsing real-world, messy HTML (which is often not valid XML).
Easy to Use: It has a very intuitive API.
CSS Selector Support: While we'll focus on XPath, Jsoup also supports CSS selectors, which many developers find easier.

Add Jsoup to your project:

Maven (pom.xml):

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.17.2</version> <!-- Check for the latest version -->
</dependency>

Gradle (build.gradle):

implementation 'org.jsoup:jsoup:1.17.2' // Check for the latest version

Step 2: Parse HTML into a Jsoup Document

First, you need to get your HTML content into a Document object. Jsoup provides several ways to do this.

Example HTML (index.html): Let's use this sample HTML for our examples.

<!DOCTYPE html>
<html>
<head>Sample Page</title>
</head>
<body>
    <h1>Welcome to the Page</h1>
    <div id="content">
        <p class="intro">This is the first paragraph.</p>
        <p class="intro">This is the second paragraph.</p>
        <div class="product">
            <h2>Product A</h2>
            <span class="price">$19.99</span>
        </div>
        <div class="product">
            <h2>Product B</h2>
            <span class="price">$24.50</span>
        </div>
    </div>
    <ul id="links">
        <li><a href="https://example.com/page1">Page 1</a></li>
        <li><a href="https://example.com/page2">Page 2</a></li>
    </ul>
</body>
</html>

Java Code to Parse:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import javax.xml.xpath.*;
import org.w3c.dom.NodeList;
import org.w3c.dom.Node;
import java.io.File;
import java.io.IOException;
public class HtmlXPathExample {
    public static void main(String[] args) {
        try {
            // --- Method 1: Parse from a String ---
            String html = "<html><head><title>Test</title></head><body><p>Hello World</p></body></html>";
            Document docFromString = Jsoup.parse(html);
            System.out.println("Parsed from String Title: " + docFromString.title());
            // --- Method 2: Parse from a URL (most common for scraping) ---
            // Document docFromUrl = Jsoup.connect("https://example.com").get();
            // System.out.println("Parsed from URL Title: " + docFromUrl.title());
            // --- Method 3: Parse from a local file ---
            // Document docFromFile = Jsoup.parse(new File("path/to/your/index.html"), "UTF-8");
            // System.out.println("Parsed from File Title: " + docFromFile.title());
            // We will use the string version for our examples
            Document doc = Jsoup.parse(html); // Replace with docFromUrl or docFromFile for real use
            // Now, let's use XPath on this Jsoup Document
            useXPathOnJsoupDocument(doc);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
    // ... rest of the code will go here ...
}

Step 3: Using XPath with Jsoup (The Bridge)

Jsoup's Document object doesn't have a built-in evaluate() method like standard XML parsers. The most common and robust way to use XPath with Jsoup is to convert the Jsoup Document into a standard W3C org.w3c.dom.Document.

The Conversion Process:

Jsoup Document -> Jsoup's org.jsoup.helper.W3CDom
W3CDom -> W3C org.w3c.dom.Document
W3C Document -> XPath evaluation

Here is the helper method and the main logic:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import javax.xml.xpath.*;
import org.w3c.dom.NodeList;
import org.w3c.dom.Node;
import java.io.File;
import java.io.IOException;
public class HtmlXPathExample {
    public static void main(String[] args) {
        // Let's use the full HTML string from our example
        String html = "<!DOCTYPE html>\n" +
                "<html>\n" +
                "<head>\n" +
                "    <title>Sample Page</title>\n" +
                "</head>\n" +
                "<body>\n" +
                "    <h1>Welcome to the Page</h1>\n" +
                "    <div id=\"content\">\n" +
                "        <p class=\"intro\">This is the first paragraph.</p>\n" +
                "        <p class=\"intro\">This is the second paragraph.</p>\n" +
                "        <div class=\"product\">\n" +
                "            <h2>Product A</h2>\n" +
                "            <span class=\"price\">$19.99</span>\n" +
                "        </div>\n" +
                "        <div class=\"product\">\n" +
                "            <h2>Product B</h2>\n" +
                "            <span class=\"price\">$24.50</span>\n" +
                "        </div>\n" +
                "    </div>\n" +
                "    <ul id=\"links\">\n" +
                "        <li><a href=\"https://example.com/page1\">Page 1</a></li>\n" +
                "        <li><a href=\"https://example.com/page2\">Page 2</a></li>\n" +
                "    </ul>\n" +
                "</body>\n" +
                "</html>";
        try {
            Document jsoupDoc = Jsoup.parse(html);
            useXPathOnJsoupDocument(jsoupDoc);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
    public static void useXPathOnJsoupDocument(org.jsoup.nodes.Document jsoupDoc) throws XPathExpressionException {
        // 1. Convert Jsoup Document to W3C Document
        org.w3c.dom.Document w3cDoc = new org.jsoup.helper.W3CDom().fromJsoup(jsoupDoc);
        // 2. Create an XPath factory and an XPath object
        XPathFactory xPathFactory = XPathFactory.newInstance();
        XPath xpath = xPathFactory.newXPath();
        // 3. Define and evaluate XPath expressions
        // Get the page title
        String title = xpath.evaluate("/html/head/title", w3cDoc);
        System.out.println("1. Page Title: " + title);
        // Get all paragraph texts
        NodeList paragraphNodes = (NodeList) xpath.evaluate("//p", w3cDoc, XPathConstants.NODESET);
        System.out.println("\n2. All Paragraphs:");
        for (int i = 0; i < paragraphNodes.getLength(); i++) {
            Node node = paragraphNodes.item(i);
            System.out.println("   - " + node.getTextContent().trim());
        }
        // Get the text of the first product's title
        String firstProductTitle = xpath.evaluate("//*[@class='product']/h2[1]", w3cDoc);
        System.out.println("\n3. First Product Title: " + firstProductTitle.trim());
        // Get the price of 'Product B'
        String productBPrice = xpath.evaluate("//*[@class='product'][h2='Product B']/span[@class='price']", w3cDoc);
        System.out.println("\n4. Price of Product B: " + productBPrice.trim());
        // Get all links (href attributes)
        NodeList linkNodes = (NodeList) xpath.evaluate("//a", w3cDoc, XPathConstants.NODESET);
        System.out.println("\n5. All Links:");
        for (int i = 0; i < linkNodes.getLength(); i++) {
            Node node = linkNodes.item(i);
            Element element = (Element) node; // Cast to Jsoup Element for easier attribute access
            System.out.println("   - Text: " + element.text() + ", Href: " + element.attr("href"));
        }
    }
}

Step 4: Essential XPath Cheat Sheet

Here are the most common XPath expressions you'll use.

Expression	Description	Example
`nodename`	Selects all nodes with the given name.	`p`
	Selects from the root node.	`/html`
	Selects nodes from the current node that match the selection no matter where they are.	`//p` (selects all `<p>` elements)
	Selects the current node.
	Selects the parent of the current node.
	Selects attributes.	`@id`, `@class`
	Matches any element node.	`div/*` (all children of a `div`)
	Matches any attribute node.	(all attributes)
`[]`	Used for predicates to filter nodes.	`//p[@class='intro']`
`text()`	Selects the text of the current node.	`//p/text()`
`contains()`	A function that returns true if the string contains the specified substring.	`//*[contains(@class, 'product')]`

Common Predicates: | Expression | Description | Example | | :--- | :--- | :--- | | [1] | Selects the first node. | //p[1] | | [last()] | Selects the last node. | //p[last()] | | [position() < 3] | Selects the first two nodes. | //p[position() < 3] | | [@id='main'] | Selects nodes with an id attribute equal to 'main'. | //*[@id='main'] | | [@class='intro'] | Selects nodes with a class attribute equal to 'intro'. | //p[@class='intro'] | | [starts-with(@id, 'section')] | Selects nodes where the id attribute starts with 'section'. | //*[starts-with(@id, 'section')] | | [contains(@href, 'example.com')] | Selects nodes where the href attribute contains 'example.com'. | //a[contains(@href, 'example.com')] |

Step 5: Jsoup's Built-in CSS Selectors (A Simpler Alternative)

While XPath is extremely powerful, for many common tasks, Jsoup's built-in CSS selector engine is faster and often more readable. If you don't have a complex, absolute-path requirement, consider using CSS selectors.

Here's how you'd achieve the same results as the XPath examples above using Jsoup's select() method.

// Assuming 'jsoupDoc' is our parsed Jsoup Document from before
// 1. Get the page title
String title = jsoupDoc.title(); // Easiest way
System.out.println("1. Page Title (CSS): " + title);
// 2. Get all paragraph texts
Elements paragraphs = jsoupDoc.select("p");
System.out.println("\n2. All Paragraphs (CSS):");
for (Element p : paragraphs) {
    System.out.println("   - " + p.text());
}
// 3. Get the text of the first product's title
// The :nth-of-child(n) selector is very powerful
String firstProductTitle = jsoupDoc.select(".product > h2").first().text();
System.out.println("\n3. First Product Title (CSS): " + firstProductTitle);
// 4. Get the price of 'Product B'
// This is more complex with pure CSS. XPath is better here.
// You might need to chain or use JavaScript for this, or just iterate.
Element productB = jsoupDoc.select(".product").stream()
                           .filter(el -> el.select("h2").text().equals("Product B"))
                           .findFirst()
                           .orElse(null);
if (productB != null) {
    String productBPrice = productB.select(".price").text();
    System.out.println("\n4. Price of Product B (CSS): " + productBPrice);
}
// 5. Get all links (href attributes)
Elements links = jsoupDoc.select("a[href]");
System.out.println("\n5. All Links (CSS):");
for (Element link : links) {
    System.out.println("   - Text: " + link.text() + ", Href: " + link.attr("href"));
}

Summary: XPath vs. CSS Selectors in Jsoup

Feature	XPath	CSS Selectors (Jsoup)
Power	Extremely powerful, can navigate up the tree () and do complex math.	Very powerful for selecting down the tree. Cannot navigate up.
Readability	Can be complex and verbose for some cases.	Generally more concise and readable for layout-based selection.
Navigation	Bidirectional (up and down the tree).	Unidirectional (down the tree only).
Functions	Rich set of functions (`string()`, `number()`, `boolean()`, `contains()`, etc.).	Limited. Relies on pseudo-classes (`not`, `has` is experimental in Jsoup).
Use Case	Best for complex queries, finding data based on content relationships, or when you need to select a parent node.	Best for most scraping tasks. Faster and simpler for finding elements by tag, ID, class, or attribute.

Recommendation:

Use CSS Selectors by default. They are faster and easier to write for 90% of web scraping tasks.
Use XPath when you need its unique power, such as selecting a parent element or performing complex conditional logic that is difficult with CSS.

html xpath java

The Core Concept: Why HTML + XPath + Java?

Step 1: Choose a Java HTML Parser

Why Jsoup?

Step 2: Parse HTML into a Jsoup Document

Step 3: Using XPath with Jsoup (The Bridge)

The Conversion Process:

Step 4: Essential XPath Cheat Sheet

Step 5: Jsoup's Built-in CSS Selectors (A Simpler Alternative)

Summary: XPath vs. CSS Selectors in Jsoup

99ANYc3cd6

Java线程run()与start()有何区别？

Java中stack和heap有何区别？

Python中apply的axis参数到底该怎么用？

y480拆机清灰教程，步骤和注意事项有哪些？

Python requests参数有哪些？如何使用？

Java Socket编程实例具体怎么实现？

python urllib iframe

Python程序ValueWarning如何解决？

MySQL与Java时间戳如何转换？

Python如何用autopy实现自动化操作？

Puppet如何自动化部署Python Django项目？

Java webservice接口开发该从哪开始学？

Tkinter Frame布局控件如何高效使用？

Premiere Pro教程从哪学？新手入门该看哪些？

Python JPype安装问题如何解决？

Java Long与BigDecimal如何选择与转换？

html xpath java

The Core Concept: Why HTML + XPath + Java?

Step 1: Choose a Java HTML Parser

Why Jsoup?

Step 2: Parse HTML into a Jsoup Document

Step 3: Using XPath with Jsoup (The Bridge)

The Conversion Process:

Step 4: Essential XPath Cheat Sheet

Step 5: Jsoup's Built-in CSS Selectors (A Simpler Alternative)

Summary: XPath vs. CSS Selectors in Jsoup

相关推荐

Java Socket编程实例具体怎么实现？