杰瑞科技汇

Python lxml parse如何高效解析XML?

Of course! Parsing XML and HTML with Python's lxml library is a very common and powerful task. lxml is known for its speed, feature-rich API, and excellent support for standards like XPath and CSS Selectors.

Python lxml parse如何高效解析XML?-图1
(图片来源网络,侵删)

Here's a comprehensive guide on how to use lxml for parsing, starting from the basics and moving to more advanced examples.


Installation

First, you need to install the library. It's available on PyPI.

pip install lxml

lxml is a Pythonic binding for the C libraries libxml2 and libxslt, which makes it extremely fast.


Basic Parsing: etree

The core of lxml is the lxml.etree module. It's used to parse both XML and HTML documents.

Python lxml parse如何高效解析XML?-图2
(图片来源网络,侵删)

Parsing from a String

You can parse a string of XML or HTML directly.

from lxml import etree
# A sample XML string
xml_string = """
<root>
    <person id="1">
        <name>John Doe</name>
        <age>30</age>
        <city>New York</city>
    </person>
    <person id="2">
        <name>Jane Smith</name>
        <age>25</age>
        <city>London</city>
    </person>
</root>
"""
# Parse the string. etree.fromstring() returns the root element of the tree.
# The parser is smart enough to detect if it's XML or HTML.
root = etree.fromstring(xml_string)
# The 'root' element is an Element object
print(f"Root tag: {root.tag}")  # Output: Root tag: root
# You can iterate over child elements
for person in root:
    print(f"\nFound a person with tag: {person.tag}")
    print(f"  ID attribute: {person.get('id')}")

Parsing from a File

It's more common to parse from a file. Use etree.parse().

from lxml import etree
# Assume 'data.xml' contains the XML string from the previous example
try:
    # etree.parse() returns an ElementTree object, which contains the root
    tree = etree.parse('data.xml')
    root = tree.getroot()
    print(f"Root tag from file: {root.tag}")
except FileNotFoundError:
    print("Error: 'data.xml' not found. Please create this file.")

Navigating the Tree: The Element Object

When you parse a document, you get Element objects. These are the building blocks of the tree. Here are the most common ways to navigate them.

Let's use this HTML example for navigation:

<html>
  <head>My Page</title>
  </head>
  <body>
    <div class="content">
      <p id="intro">Hello, world!</p>
      <p>This is a paragraph.</p>
    </div>
    <a href="https://example.com">Link</a>
  </body>
</html>

Accessing Children, Parent, and Siblings

from lxml import etree
html_string = """
<html>
  <head>My Page</title>
  </head>
  <body>
    <div class="content">
      <p id="intro">Hello, world!</p>
      <p>This is a paragraph.</p>
    </div>
    <a href="https://example.com">Link</a>
  </body>
</html>
"""
root = etree.fromstring(html_string)
# 1. Get children
# .iterchildren() is an efficient way to get child elements
print("--- Children of <body> ---")
for child in root.find('body').iterchildren():
    print(f"Child tag: {child.tag}")
# 2. Get parent
body = root.find('body')
html_parent = body.getparent()
print(f"\nParent of <body> is: <{html_parent.tag}>")
# 3. Get siblings
intro_p = root.find(".//p[@id='intro']")
next_sibling = intro_p.getnext() # Gets the next element at the same level
print(f"\nNext sibling of <p id='intro'> is: <{next_sibling.tag}>")

Accessing Text and Attributes

# (Using the same 'root' element from above)
# 1. Get text
# .text gets the direct text content of an elementelement = root.find('head/title')
print(f"Title text: {title_element.text}") # Output: Title text: My Page
# For text within tags (like <p>), .text is still the direct text.
intro_p = root.find(".//p[@id='intro']")
print(f"Intro paragraph text: {intro_p.text}") # Output: Intro paragraph text: Hello, world!
# 2. Get attributes
# .get() is the safe way to get an attribute
a_element = root.find('.//a')
print(f"Link href: {a_element.get('href')}") # Output: Link href: https://example.com
print(f"Link text: {a_element.text}") # Output: Link text: Link
# You can also get attributes as a dictionary
print(f"All attributes of <a>: {a_element.attrib}") # Output: All attributes of <a>: {'href': 'https://example.com'}

Searching: XPath and CSS Selectors

This is where lxml truly shines. It has excellent support for XPath, a powerful query language for XML/HTML documents.

Using XPath

XPath expressions are strings that describe a path to the elements you want.

  • selects elements at any depth in the tree.
  • selects direct children.
  • [@attribute='value'] filters by an attribute.
  • /text() gets the text content of an element.
# (Using the same 'root' element from the HTML example)
# 1. Find all <p> elements anywhere in the document
all_p_elements = root.xpath('//p')
print(f"\nFound {len(all_p_elements)} <p> elements.")
# 2. Find the <p> element with a specific id
intro_p = root.xpath("//p[@id='intro']")[0] # xpath() returns a list
print(f"Found intro paragraph: {intro_p.text}")
# 3. Find text content directlytext = root.xpath('//title/text()')[0]
print(f"Found title text via XPath: {title_text}")
# 4. Find an element by attribute value
link_element = root.xpath("//a[@href='https://example.com']")[0]
print(f"Found link with specific href: {link_element.text}")

Using CSS Selectors

For those more familiar with web development, lxml also supports CSS selectors through the cssselect library (which is a dependency of lxml).

from lxml.cssselect import CSSSelector
# (Using the same 'root' element from the HTML example)
# 1. Create a CSS selector object
selector = CSSSelector('div.content p')
# 2. Apply the selector to the element
paragraphs_in_div = selector(root)
for p in paragraphs_in_div:
    print(f"Found paragraph in div: {p.text}")
# You can also chain selectors
first_p_in_div = CSSSelector('div.content > p:first-child')(root)[0]
print(f"\nFirst child paragraph: {first_p_in_div.text}")

Modifying the Tree

You can easily add, remove, and change elements and attributes.

from lxml import etree
# Start with a simple tree
root = etree.Element("root")
child = etree.SubElement(root, "child")
child.text = "Original text"
print("Original tree:")
print(etree.tostring(root, pretty_print=True).decode())
# 1. Modify text
child.text = "Modified text"
# 2. Add a new attribute
child.set("new_attr", "new_value")
# 3. Add a new sub-element
new_child = etree.SubElement(root, "another_child")
new_child.set("id", "2")
new_child.text = "I'm new!"
# 4. Remove an element
root.remove(child) # Remove the 'child' element
print("\nModified tree:")
print(etree.tostring(root, pretty_print=True).decode())

Cleaning Up: Pretty Printing and Serialization

Once you're done modifying, you'll want to convert the tree back to a string or write it to a file.

from lxml import etree
# Let's use the modified tree from the previous example
# root = ... (the element with 'another_child')
# 1. Convert to a string (bytes by default)
# pretty_print=True adds indentation
xml_output_bytes = etree.tostring(root, pretty_print=True, encoding='unicode')
print(xml_output_bytes)
# 2. Write to a file
with open('modified_output.xml', 'w', encoding='utf-8') as f:
    # tostring with method='xml' and pretty_print is great for files
    f.write(etree.tostring(root, pretty_print=True, encoding='unicode', method='xml'))
print("\nTree written to modified_output.xml")

HTML Parsing with html.fromstring()

While etree.fromstring() works for HTML, it's better to use lxml.html for robustness, as it handles broken HTML better and provides extra HTML-specific features.

from lxml import html
# Malformed HTML - a browser would handle this fine
html_string = "<div><p>Hello</div><p>World</p>"
# Use html.fromstring() for better HTML parsing
doc = html.fromstring(html_string)
# lxml automatically fixes the structure
print(html.tostring(doc, pretty_print=True).decode())
# Output:
# <div>
#   <p>Hello</p>
#   <p>World</p>
# </div>
# It also makes getting text easier
print(f"All text: {doc.text_content().strip()}") # Output: All text: Hello World

Summary of Key Functions/Methods

Task / Method Description
etree.fromstring(string) Parses a string/bytes into an Element (the root).
etree.parse(file) Parses a file into an ElementTree object. Use .getroot() to get the root Element.
element.tag The tag name of an element (e.g., 'div').
element.text The text content directly inside an element.
element.tail The text content after an element's closing tag.
element.get('attr_name') Gets the value of an attribute. Safe (returns None if not found).
element.set('attr', 'value') Sets an attribute.
element.attrib A dictionary of all attributes.
element.find(xpath) Finds the first child element matching an XPath.
element.findall(xpath) Finds all child elements matching an XPath.
element.xpath(xpath) Finds all elements in the subtree matching an XPath. Most common search method.
etree.tostring(element) Serializes an element (or tree) back to a string/bytes.
分享:
扫描分享到社交APP
上一篇
下一篇