Of course! Parsing XML and HTML with Python's lxml library is a very common and powerful task. lxml is known for its speed, feature-rich API, and excellent support for standards like XPath and CSS Selectors.

Here's a comprehensive guide on how to use lxml for parsing, starting from the basics and moving to more advanced examples.
Installation
First, you need to install the library. It's available on PyPI.
pip install lxml
lxml is a Pythonic binding for the C libraries libxml2 and libxslt, which makes it extremely fast.
Basic Parsing: etree
The core of lxml is the lxml.etree module. It's used to parse both XML and HTML documents.

Parsing from a String
You can parse a string of XML or HTML directly.
from lxml import etree
# A sample XML string
xml_string = """
<root>
<person id="1">
<name>John Doe</name>
<age>30</age>
<city>New York</city>
</person>
<person id="2">
<name>Jane Smith</name>
<age>25</age>
<city>London</city>
</person>
</root>
"""
# Parse the string. etree.fromstring() returns the root element of the tree.
# The parser is smart enough to detect if it's XML or HTML.
root = etree.fromstring(xml_string)
# The 'root' element is an Element object
print(f"Root tag: {root.tag}") # Output: Root tag: root
# You can iterate over child elements
for person in root:
print(f"\nFound a person with tag: {person.tag}")
print(f" ID attribute: {person.get('id')}")
Parsing from a File
It's more common to parse from a file. Use etree.parse().
from lxml import etree
# Assume 'data.xml' contains the XML string from the previous example
try:
# etree.parse() returns an ElementTree object, which contains the root
tree = etree.parse('data.xml')
root = tree.getroot()
print(f"Root tag from file: {root.tag}")
except FileNotFoundError:
print("Error: 'data.xml' not found. Please create this file.")
Navigating the Tree: The Element Object
When you parse a document, you get Element objects. These are the building blocks of the tree. Here are the most common ways to navigate them.
Let's use this HTML example for navigation:
<html>
<head>My Page</title>
</head>
<body>
<div class="content">
<p id="intro">Hello, world!</p>
<p>This is a paragraph.</p>
</div>
<a href="https://example.com">Link</a>
</body>
</html>
Accessing Children, Parent, and Siblings
from lxml import etree
html_string = """
<html>
<head>My Page</title>
</head>
<body>
<div class="content">
<p id="intro">Hello, world!</p>
<p>This is a paragraph.</p>
</div>
<a href="https://example.com">Link</a>
</body>
</html>
"""
root = etree.fromstring(html_string)
# 1. Get children
# .iterchildren() is an efficient way to get child elements
print("--- Children of <body> ---")
for child in root.find('body').iterchildren():
print(f"Child tag: {child.tag}")
# 2. Get parent
body = root.find('body')
html_parent = body.getparent()
print(f"\nParent of <body> is: <{html_parent.tag}>")
# 3. Get siblings
intro_p = root.find(".//p[@id='intro']")
next_sibling = intro_p.getnext() # Gets the next element at the same level
print(f"\nNext sibling of <p id='intro'> is: <{next_sibling.tag}>")
Accessing Text and Attributes
# (Using the same 'root' element from above)
# 1. Get text
# .text gets the direct text content of an elementelement = root.find('head/title')
print(f"Title text: {title_element.text}") # Output: Title text: My Page
# For text within tags (like <p>), .text is still the direct text.
intro_p = root.find(".//p[@id='intro']")
print(f"Intro paragraph text: {intro_p.text}") # Output: Intro paragraph text: Hello, world!
# 2. Get attributes
# .get() is the safe way to get an attribute
a_element = root.find('.//a')
print(f"Link href: {a_element.get('href')}") # Output: Link href: https://example.com
print(f"Link text: {a_element.text}") # Output: Link text: Link
# You can also get attributes as a dictionary
print(f"All attributes of <a>: {a_element.attrib}") # Output: All attributes of <a>: {'href': 'https://example.com'}
Searching: XPath and CSS Selectors
This is where lxml truly shines. It has excellent support for XPath, a powerful query language for XML/HTML documents.
Using XPath
XPath expressions are strings that describe a path to the elements you want.
- selects elements at any depth in the tree.
- selects direct children.
[@attribute='value']filters by an attribute./text()gets the text content of an element.
# (Using the same 'root' element from the HTML example)
# 1. Find all <p> elements anywhere in the document
all_p_elements = root.xpath('//p')
print(f"\nFound {len(all_p_elements)} <p> elements.")
# 2. Find the <p> element with a specific id
intro_p = root.xpath("//p[@id='intro']")[0] # xpath() returns a list
print(f"Found intro paragraph: {intro_p.text}")
# 3. Find text content directlytext = root.xpath('//title/text()')[0]
print(f"Found title text via XPath: {title_text}")
# 4. Find an element by attribute value
link_element = root.xpath("//a[@href='https://example.com']")[0]
print(f"Found link with specific href: {link_element.text}")
Using CSS Selectors
For those more familiar with web development, lxml also supports CSS selectors through the cssselect library (which is a dependency of lxml).
from lxml.cssselect import CSSSelector
# (Using the same 'root' element from the HTML example)
# 1. Create a CSS selector object
selector = CSSSelector('div.content p')
# 2. Apply the selector to the element
paragraphs_in_div = selector(root)
for p in paragraphs_in_div:
print(f"Found paragraph in div: {p.text}")
# You can also chain selectors
first_p_in_div = CSSSelector('div.content > p:first-child')(root)[0]
print(f"\nFirst child paragraph: {first_p_in_div.text}")
Modifying the Tree
You can easily add, remove, and change elements and attributes.
from lxml import etree
# Start with a simple tree
root = etree.Element("root")
child = etree.SubElement(root, "child")
child.text = "Original text"
print("Original tree:")
print(etree.tostring(root, pretty_print=True).decode())
# 1. Modify text
child.text = "Modified text"
# 2. Add a new attribute
child.set("new_attr", "new_value")
# 3. Add a new sub-element
new_child = etree.SubElement(root, "another_child")
new_child.set("id", "2")
new_child.text = "I'm new!"
# 4. Remove an element
root.remove(child) # Remove the 'child' element
print("\nModified tree:")
print(etree.tostring(root, pretty_print=True).decode())
Cleaning Up: Pretty Printing and Serialization
Once you're done modifying, you'll want to convert the tree back to a string or write it to a file.
from lxml import etree
# Let's use the modified tree from the previous example
# root = ... (the element with 'another_child')
# 1. Convert to a string (bytes by default)
# pretty_print=True adds indentation
xml_output_bytes = etree.tostring(root, pretty_print=True, encoding='unicode')
print(xml_output_bytes)
# 2. Write to a file
with open('modified_output.xml', 'w', encoding='utf-8') as f:
# tostring with method='xml' and pretty_print is great for files
f.write(etree.tostring(root, pretty_print=True, encoding='unicode', method='xml'))
print("\nTree written to modified_output.xml")
HTML Parsing with html.fromstring()
While etree.fromstring() works for HTML, it's better to use lxml.html for robustness, as it handles broken HTML better and provides extra HTML-specific features.
from lxml import html
# Malformed HTML - a browser would handle this fine
html_string = "<div><p>Hello</div><p>World</p>"
# Use html.fromstring() for better HTML parsing
doc = html.fromstring(html_string)
# lxml automatically fixes the structure
print(html.tostring(doc, pretty_print=True).decode())
# Output:
# <div>
# <p>Hello</p>
# <p>World</p>
# </div>
# It also makes getting text easier
print(f"All text: {doc.text_content().strip()}") # Output: All text: Hello World
Summary of Key Functions/Methods
| Task / Method | Description |
|---|---|
etree.fromstring(string) |
Parses a string/bytes into an Element (the root). |
etree.parse(file) |
Parses a file into an ElementTree object. Use .getroot() to get the root Element. |
element.tag |
The tag name of an element (e.g., 'div'). |
element.text |
The text content directly inside an element. |
element.tail |
The text content after an element's closing tag. |
element.get('attr_name') |
Gets the value of an attribute. Safe (returns None if not found). |
element.set('attr', 'value') |
Sets an attribute. |
element.attrib |
A dictionary of all attributes. |
element.find(xpath) |
Finds the first child element matching an XPath. |
element.findall(xpath) |
Finds all child elements matching an XPath. |
element.xpath(xpath) |
Finds all elements in the subtree matching an XPath. Most common search method. |
etree.tostring(element) |
Serializes an element (or tree) back to a string/bytes. |
