杰瑞科技汇

Python ElementTree如何高效解析XML?

Of course! Here's a comprehensive guide to using ElementTree in Python, covering its core concepts, common operations, and best practices.

What is ElementTree?

ElementTree is a simple and efficient API for parsing and manipulating XML data. It represents an XML document as a tree of Python objects, where each element is a node in the tree.

Python has two main implementations of ElementTree:

  1. xml.etree.ElementTree: The standard, built-in library. It's fast, memory-efficient (uses a C implementation under the hood), and perfect for most use cases. This is the one you should use by default.
  2. lxml.etree: A third-party library that is much more powerful and feature-rich. It's significantly faster, supports advanced features like XPath 1.0, XSLT, XML Schema validation, and proper handling of broken HTML. It's the go-to choice for complex applications or performance-critical tasks.

For this guide, we'll focus on the built-in xml.etree.ElementTree.


Parsing XML

You can parse XML from a file or directly from a string.

From a File (ET.parse)

This is the most common method. It reads the entire file into an ElementTree object, which represents the whole document.

import xml.etree.ElementTree as ET
try:
    # Parse the XML file
    tree = ET.parse('my_data.xml')
    # Get the root element of the tree
    root = tree.getroot()
    print(f"Root tag: {root.tag}")
    print(f"Root attributes: {root.attrib}")
except FileNotFoundError:
    print("Error: 'my_data.xml' not found.")
    # Create a dummy file for demonstration
    xml_content = """<?xml version="1.0"?>
<library location="Main Street">
    <book id="bk101">
        <author>Gambardella, Matthew</author>
        <title>XML Developer's Guide</title>
        <price>44.95</price>
    </book>
    <book id="bk102">
        <author>Ralls, Kim</author>
        <title>Midnight Rain</title>
        <price>5.95</price>
    </book>
</library>"""
    with open('my_data.xml', 'w') as f:
        f.write(xml_content)
    print("Created a dummy 'my_data.xml' file. Please re-run the script.")

From a String (ET.fromstring)

If your XML is already in a string, you can parse it directly. This gives you the root Element object immediately.

import xml.etree.ElementTree as ET
xml_string = """<?xml version="1.0"?>
<library location="Main Street">
    <book id="bk101">
        <author>Gambardella, Matthew</author>
        <title>XML Developer's Guide</title>
        <price>44.95</price>
    </book>
</library>"""
# Parse from a string
root = ET.fromstring(xml_string)
print(f"Root tag from string: {root.tag}")
print(f"Root attribute 'location': {root.get('location')}") # .get() is a safe way to get attributes

Navigating the Tree

Once you have the root Element, you can navigate the tree using properties and methods.

Key Properties of an Element Object:

  • .tag: The tag name (e.g., 'book', 'author').
  • .text: The text content inside the element (e.g., 'Gambardella, Matthew').
  • .attrib: A dictionary of the element's attributes (e.g., {'id': 'bk101'}).
  • .tail: Text content that comes after the element's closing tag. (Less commonly used).

Navigation Methods:

  • element.iter(): Iterates over all elements in the tree (and their children).
  • element.iter(tag): Iterates over all elements with a specific tag.
  • element.findall(tag): Finds all direct children with a specific tag. Returns a list.
  • element.find(tag): Finds the first direct child with a specific tag. Returns an Element or None.
  • element.text: Gets/sets the text content.
# Assume 'root' is the <library> element from the examples above
# --- Find all 'book' elements ---
all_books = root.findall('book')
print(f"\nFound {len(all_books)} books.")
# --- Iterate through the books ---
for book in all_books:
    print("\n--- Processing a Book ---")
    print(f"  Tag: {book.tag}")
    print(f"  Attributes: {book.attrib}")
    # Find the title and author text
    # .find() looks for the first child with that tagelement = book.find('title')
    author_element = book.find('author')
    if title_element is not None:
        print(f"  Title: {title_element.text}")
    if author_element is not None:
        print(f"  Author: {author_element.text}")
# --- Find the first book ---
first_book = root.find('book')
if first_book is not None:
    print(f"\nFirst book ID: {first_book.get('id')}") # Use .get() for attributes
# --- Iterate over every single element in the document ---
print("\n--- Iterating over all elements ---")
for elem in root.iter():
    print(f"Tag: {elem.tag}, Text: {elem.text}")

Modifying XML

ElementTree makes it easy to create, modify, and delete elements.

Creating and Adding Elements

import xml.etree.ElementTree as ET
# Start with a new root element
new_root = ET.Element("inventory")
# Create a new product element
product = ET.Element("product")
product.set("id", "p123") # Add an attribute
product.set("category", "electronics")
# Create sub-elements and add text
name = ET.SubElement(product, "name")
name.text = "Super Widget"
price = ET.SubElement(product, "price")
price.text = "99.99"
# Add the product to the root
new_root.append(product)
# You can also create elements from strings
# ET.fromstring returns an element, so we append it
another_product_str = "<product id='p456'><name>Mega Gadget</name><price>149.50</price></product>"
new_root.append(ET.fromstring(another_product_str))
print(ET.tostring(new_root, encoding='unicode'))

Modifying Existing Elements

# Let's modify the first book from our original example
# Assume 'root' is the <library> element
first_book = root.find('book')
# Change an attribute
first_book.set('id', 'bk101-updated')
# Change text contentelement = first_book.find('title')element is not None:element.text = "XML Developer's Guide (2nd Edition)"
# Add a new element
year_element = ET.SubElement(first_book, 'year')
year_element.text = "2005"
print("\n--- After Modification ---")
print(ET.tostring(root, encoding='unicode'))

Removing Elements

# Let's remove the <price> element from the first book
first_book = root.find('book')
price_to_remove = first_book.find('price')
if price_to_remove is not None:
    first_book.remove(price_to_remove) # The remove() method is called on the parent
print("\n--- After Removal ---")
print(ET.tostring(root, encoding='unicode'))

Writing XML to a File

After modifying the tree, you'll want to save it. Use tree.write().

# If you modified a tree object (from ET.parse)
tree.write('my_data_modified.xml', encoding='utf-8', xml_declaration=True)
# If you only have an Element object (like our new_root)
# You need to wrap it in an ElementTree first
new_tree = ET.ElementTree(new_root)
new_tree.write('new_inventory.xml', encoding='utf-8', xml_declaration=True)
print("\nSaved modified and new XML files.")
  • encoding='utf-8': Highly recommended for compatibility.
  • xml_declaration=True: Adds the <?xml version='1.0' encoding='utf-8'?> line at the top.

Namespaces

XML namespaces can complicate things. They add a prefix to tags, like <ns0:book>. When searching with find() or findall(), you must use the full namespace URI.

A common pattern is to define a dictionary of prefixes and URIs.

import xml.etree.ElementTree as ET
xml_with_namespace = """<?xml version="1.0"?>
<root xmlns:ns0="http://example.com/books" xmlns:ns1="http://example.com/price">
    <ns0:book id="bk101">
        <ns0:author>Gambardella, Matthew</ns0:author>
        <ns0:title>XML Developer's Guide</ns0:title>
        <ns1:amount>44.95</ns1:amount>
    </ns0:book>
</root>"""
root = ET.fromstring(xml_with_namespace)
# Define the namespace map
namespaces = {
    'b': 'http://example.com/books',  # 'b' is our chosen prefix
    'p': 'http://example.com/price'   # 'p' is our chosen prefix
}
# Now you can use the prefix in your find calls
# The format is {prefix}localname
book_element = root.find('b:book', namespaces)
author_element = book_element.find('b:author', namespaces)
price_element = book_element.find('p:amount', namespaces)
if book_element is not None:
    print(f"Found book with ID: {book_element.get('id')}")
if author_element is not None:
    print(f"Author: {author_element.text}")
if price_element is not None:
    print(f"Price: {price_element.text}")

Best Practices and Summary

Task Method / Function Notes
Parse from File ET.parse('file.xml') Returns an ElementTree object. Use .getroot() to get the root Element.
Parse from String ET.fromstring(xml_string) Returns the root Element directly.
Find Child Element parent.find('tag') Finds the first direct child. Returns Element or None.
Find Child Elements parent.findall('tag') Finds all direct children. Returns a list.
Iterate Recursively root.iter('tag') Iterates over all matching elements in the entire tree.
Get Attribute element.get('attr_name') Safe way to get an attribute (returns None if not found).
Set Attribute element.set('attr_name', 'value') Adds or updates an attribute.
Get Text element.text Returns the text content of an element.
Set Text element.text = 'new text'
Add Sub-Element ET.SubElement(parent, 'tag') Creates and adds a new element to a parent.
Remove Element parent.remove(child) Removes an element. The child must be a direct child of the parent.
Write to File tree.write('file.xml', ...) For an ElementTree object.
Write Element to File ET.ElementTree(element).write(...) For a single Element object.
Handle Namespaces root.find('prefix:tag', namespaces) Define a {'prefix': 'uri'} dictionary and pass it as the second argument.
分享:
扫描分享到社交APP
上一篇
下一篇