Of course! Using XPath with HTML in Java is a very common task for web scraping, data extraction, and automated testing. The most popular and recommended library for this is Jsoup.

Jsoup is a fantastic library that parses HTML into a Document Object Model (DOM), and then you can use XPath expressions to query that DOM.
Here's a complete guide, from setting up your project to practical examples.
The Core Concept: Jsoup + XPath
- Jsoup: Parses raw HTML into a structured
Documentobject. It handles all the messy parts of HTML (malformed tags, encoding issues, etc.) and provides a clean API to navigate the DOM. - XPath: A standard language for selecting nodes from an XML or HTML document. It's very powerful and expressive for finding elements based on their structure, attributes, or text content.
Jsoup doesn't have built-in XPath support, so we use a helper library called jsoup-xpath to bridge the gap.
Project Setup (Maven)
You need to add two dependencies to your pom.xml file: jsoup for HTML parsing and jsoup-xpath for the XPath functionality.

<dependencies>
<!-- For parsing HTML -->
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.17.2</version> <!-- Use the latest version -->
</dependency>
<!-- For XPath support on Jsoup's DOM -->
<dependency>
<groupId>cn.wanghaomiao</groupId>
<artifactId>jsoup-xpath</artifactId>
<version>2.5.3</version> <!-- Use the latest version -->
</dependency>
</dependencies>
Step-by-Step Guide with Examples
Let's use a sample HTML string for our examples.
Sample HTML:
<!DOCTYPE html>
<html>
<head>My Web Page</title>
</head>
<body>
<h1>Welcome to the Page</h1>
<div id="main-content">
<p class="intro">This is the first paragraph.</p>
<p class="intro">This is the second paragraph.</p>
<ul>
<li>Item 1</li>
<li>Item 2</li>
<li>Item 3</li>
</ul>
</div>
<div id="sidebar">
<a href="/about.html">About Us</a>
<a href="/contact.html">Contact</a>
</div>
<p class="footer">Copyright 2025</p>
</body>
</html>
Step 1: Parse the HTML with Jsoup
First, you need to load your HTML into a Jsoup Document object. This is the foundation.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class XPathExample {
public static void main(String[] args) {
String html = "<!DOCTYPE html>..."; // Paste the sample HTML here
// Parse the HTML string into a Jsoup Document
Document doc = Jsoup.parse(html);
// Now you can use 'doc' with XPath
}
}
Step 2: Selecting Elements with XPath
The JsoupXpath class provides the static select() method. It takes an XPath expression and the Document object.

A. Selecting by ID
XPath uses id('id-value').
import cn.wanghaomiao.xpath.JsoupXpath;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import java.util.List;
// ... inside main method ...
// Select the element with id 'main-content'
List<Element> mainContent = JsoupXpath.select("//div[@id='main-content']", doc).getElements();
if (!mainContent.isEmpty()) {
Element mainDiv = mainContent.get(0);
System.out.println("Found main content div: " + mainDiv.text());
}
// Output: Found main content div: This is the first paragraph.This is the second paragraph.Item 1Item 2Item 3
B. Selecting by Class
XPath uses contains(@class, 'class-name') or [@class='class-name']. contains is more robust.
// Select all <p> elements that have the class 'intro'
List<Element> introParagraphs = JsoupXpath.select("//p[contains(@class, 'intro')]", doc).getElements();
System.out.println("\nFound intro paragraphs:");
for (Element p : introParagraphs) {
System.out.println("- " + p.text());
}
/*
Output:
Found intro paragraphs:
- This is the first paragraph.
- This is the second paragraph.
*/
C. Selecting by Tag and Attribute
// Select all <li> elements inside a <ul> that is inside a <div>
List<Element> listItems = JsoupXpath.select("//div//ul//li", doc).getElements();
System.out.println("\nFound list items:");
for (Element li : listItems) {
System.out.println("- " + li.text());
}
/*
Output:
Found list items:
- Item 1
- Item 2
- Item 3
*/
D. Selecting by Text Content
Use text() to check the text of an element.
// Select the <h1> element whose text is exactly 'Welcome to the Page'
List<Element> heading = JsoupXpath.select("//h1[text()='Welcome to the Page']", doc).getElements();
System.out.println("\nFound heading: " + heading.get(0).text());
// Output: Found heading: Welcome to the Page
E. Selecting by Attribute Value (other than class/id)
// Select all <a> elements that have an 'href' attribute
List<Element> links = JsoupXpath.select("//a[@href]", doc).getElements();
System.out.println("\nFound links:");
for (Element a : links) {
System.out.println("Text: " + a.text() + ", Href: " + a.attr("href"));
}
/*
Output:
Found links:
Text: About Us, Href: /about.html
Text: Contact, Href: /contact.html
*/
Step 3: Extracting Data (Text, Attributes, HTML)
Once you have a list of Element objects, you can easily extract information.
// Let's get the first intro paragraph again
Element firstP = JsoupXpath.select("//p[contains(@class, 'intro')]", doc).getElements().get(0);
// 1. Get the text of the element
System.out.println("\nElement text: " + firstP.text());
// 2. Get an attribute's value
// For example, if we had an <img src="...">:
// String imageUrl = imgElement.attr("src");
// 3. Get the element's own HTML
System.out.println("Element HTML: " + firstP.html());
// 4. Get the element's outer HTML (including the tag itself)
System.out.println("Element outer HTML: " + firstP.outerHtml());
Practical Example: Scraping a Real Website
Let's scrape all the headlines from the Hacker News homepage.
Goal: Get the title and the URL of each news item on the first page.
import cn.wanghaomiao.xpath.JsoupXpath;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import java.io.IOException;
import java.util.List;
public class HackerNewsScraper {
public static void main(String[] args) {
try {
// 1. Fetch the HTML from the URL
String url = "https://news.ycombinator.com";
System.out.println("Fetching HTML from: " + url);
// User-Agent header is good practice to avoid being blocked
Document doc = Jsoup.connect(url)
.userAgent("Mozilla/5.0")
.get();
// 2. Define the XPath to select all story links
// On Hacker News, each story is in a <tr> with classathing.
// The title link is an <a> tag with class="storylink".
String xpath = "//tr[@class='athing']//a[@class='storylink']";
// 3. Execute the XPath query
List<Element> storyLinks = JsoupXpath.select(xpath, doc).getElements();
// 4. Iterate and print the results
System.out.println("\n--- Found " + storyLinks.size() + " stories ---");
for (Element link : storyLinks) {
String title = link.text();
String href = link.attr("href");
// Handle relative URLs by making them absolute
if (href.startsWith("/")) {
href = "https://news.ycombinator.com" + href;
}
System.out.println("Title: " + title);
System.out.println("URL: " + href);
System.out.println("----------------------");
}
} catch (IOException e) {
System.err.println("Error fetching the URL: " + e.getMessage());
e.printStackTrace();
}
}
}
Important Considerations and Best Practices
- Robustness: Websites change. Your XPath might break if the site updates its HTML structure. It's often better to select on a stable parent and then use simpler selectors (like
eq(index)) on the children. - Performance: For very large HTML documents, complex XPath queries can be slower than simpler Jsoup selectors. However, for most web scraping tasks, the performance is excellent.
- Handling Dynamic Content: Jsoup and XPath can only see the initial HTML source. If a website loads content using JavaScript after the page loads (like many modern single-page apps), you will need a different tool, such as Selenium or Playwright, which can control a real web browser.
- Ethics and
robots.txt: Always be respectful to website owners. Check the website'srobots.txtfile (e.g.,https://news.ycombinator.com/robots.txt) to see which parts of the site they ask scrapers not to access. Don't send too many requests in a short period.
