杰瑞科技汇

python elementree

Of course! Here is a comprehensive guide to using xml.etree.ElementTree in Python, which is the standard library for working with XML data.

python elementree-图1
(图片来源网络,侵删)

What is ElementTree?

ElementTree is a built-in Python library that provides a simple and efficient API for parsing and creating XML data. It treats an XML document as a tree of elements, where each element can have:

  • A tag (e.g., <book>)
  • Attributes (e.g., category="fiction")
  • Text content (e.g., The Hobbit)
  • Child elements

There are two main implementations you might encounter:

  1. xml.etree.ElementTree: The standard, built-in version that comes with Python. It's fast and good for most use cases.
  2. lxml.etree: A third-party library that is much faster and has more features (like better XPath support and error handling), but you need to install it separately (pip install lxml).

This guide will focus on the built-in xml.etree.ElementTree.


Parsing XML from a String

This is the most common starting point. You have an XML string and you want to convert it into a Python object tree.

python elementree-图2
(图片来源网络,侵删)
import xml.etree.ElementTree as ET
# A sample XML string
xml_string = """
<library>
    <book category="fiction">
        <title lang="en">The Hobbit</title>
        <author>J.R.R. Tolkien</author>
        <year>1937</year>
    </book>
    <book category="science">
        <title lang="en">A Brief History of Time</title>
        <author>Stephen Hawking</author>
        <year>1988</year>
    </book>
</library>
"""
# Parse the string
# ET.fromstring() parses a string and returns the root element of the tree
root = ET.fromstring(xml_string)
# The root element is an Element object
print(f"Root tag: {root.tag}")  # Output: Root tag: library
# Accessing attributes
print(f"Root attributes: {root.attrib}") # Output: Root attributes: {}
# Find a direct child element
first_book = root.find('book')
print(f"First book category: {first_book.attrib['category']}") # Output: First book category: fiction

Parsing XML from a File

If your XML data is in a file (e.g., library.xml), you should use ET.parse().

File: library.xml

<library>
    <book category="fiction">
        <title lang="en">The Hobbit</title>
        <author>J.R.R. Tolkien</author>
        <year>1937</year>
    </book>
    <book category="science">
        <title lang="en">A Brief History of Time</title>
        <author>Stephen Hawking</author>
        <year>1988</year>
    </book>
</library>

Python Code:

import xml.etree.ElementTree as ET
# Parse the file
# ET.parse() returns an ElementTree object, not the root element directly
tree = ET.parse('library.xml')
root = tree.getroot() # Get the root element from the tree
print(f"Root tag from file: {root.tag}") # Output: Root tag from file: library

Navigating the Element Tree

Once you have the root, you can navigate the tree.

Finding Elements

  • find(path): Finds the first child element that matches the path.
  • findall(path): Finds all child elements that match the path.
  • iter(tag): Creates a tree iterator to loop over all elements with a given tag.
import xml.etree.ElementTree as ET
xml_string = """
<library>
    <book category="fiction">
        <title lang="en">The Hobbit</title>
        <author>J.R.R. Tolkien</author>
    </book>
    <book category="science">
        <title lang="en">A Brief History of Time</title>
        <author>Stephen Hawking</author>
    </book>
</library>
"""
root = ET.fromstring(xml_string)
# --- find() ---
# Find the first 'title' element
first_title = root.find('book/title')
print(f"First title: {first_title.text}") # Output: First title: The Hobbit
# --- findall() ---
# Find all 'author' elements
all_authors = root.findall('.//author') # .// means search anywhere in the tree
print("\nAll authors:")
for author in all_authors:
    print(f"- {author.text}") # Output: - J.R.R. Tolkien, - Stephen Hawking
# Find all 'book' elements
all_books = root.findall('book')
print(f"\nFound {len(all_books)} books.")
# --- iter() ---
# Loop over all 'title' elements in the entire document
print("\nAll titles using iter():")in root.iter('title'):
    print(f"- Language: {title.attrib['lang']}, Text: {title.text}")

XPath-like Expressions: The path argument in find() and findall() supports a simplified subset of XPath:

  • tag: Direct child tag.
  • parent/child: A specific child of a parent.
  • .//tag: Any descendant tag at any depth (very useful!).
  • [@attrib]: Elements with a specific attribute.
  • [@attrib='value']: Elements with an attribute matching a specific value.
# Find the author of the book in the 'science' category
science_author = root.find('.//book[@category="science"]/author')
print(f"\nScience author: {science_author.text}") # Output: Science author: Stephen Hawking

Accessing Element Data

  • element.tag: The tag name (e.g., 'title').
  • element.text: The text content inside the tag (e.g., 'The Hobbit').
  • element.attrib: A dictionary of the element's attributes (e.g., {'lang': 'en'}).
book = root.find('book[@category="fiction"]')
print(f"Tag: {book.tag}")
print(f"Attributes: {book.attrib}")
print(f"Text content of the first child (title): {book.find('title').text}")

Modifying XML

You can easily modify the element tree in memory.

import xml.etree.ElementTree as ET
# Let's use the tree from the file example
tree = ET.parse('library.xml')
root = tree.getroot()
# --- Add a new book ---
new_book = ET.Element('book')
new_book.set('category', 'fantasy') # Add an attribute
new_book_title = ET.SubElement(new_book, 'title')
new_book_title.text = "Dune"
new_book_author = ET.SubElement(new_book, 'author')
new_book_author.text = "Frank Herbert"
new_book_year = ET.SubElement(new_book, 'year')
new_book_year.text = "1965"
# Add the new book to the root
root.append(new_book)
# --- Modify an existing element ---
# Change the year of the first book
first_book_year = root.find('book/year')
first_book_year.text = "1937" # It was already 1937, but you get the idea
# --- Remove an element ---
# Remove the 'category' attribute from the first book
first_book = root.find('book')
first_book.attrib.pop('category', None) # The second argument prevents an error if key doesn't exist
# --- Save the changes to a new file ---
tree.write('library_modified.xml', encoding='utf-8', xml_declaration=True)
print("Modified XML saved to library_modified.xml")

Output (library_modified.xml):

<?xml version='1.0' encoding='utf-8'?>
<library>
    <book>
        <title lang="en">The Hobbit</title>
        <author>J.R.R. Tolkien</author>
        <year>1937</year>
    </book>
    <book category="science">
        <title lang="en">A Brief History of Time</title>
        <author>Stephen Hawking</author>
        <year>1988</year>
    </book>
    <book category="fantasy">
        <title>Dune</title>
        <author>Frank Herbert</author>
        <year>1965</year>
    </book>
</library>

Creating XML from Scratch

You can build an XML tree programmatically using Element and SubElement.

import xml.etree.ElementTree as ET
# Create the root element
root = ET.Element('catalog')
# Create sub-elements and add them to the root
product1 = ET.SubElement(root, 'product')
product1.set('id', 'p1')
name1 = ET.SubElement(product1, 'name')
name1.text = 'Super Widget'
price1 = ET.SubElement(product1, 'price')
price1.text = '19.99'
product2 = ET.SubElement(root, 'product')
product2.set('id', 'p2')
name2 = ET.SubElement(product2, 'name')
name2.text = 'Mega Gadget'
price2 = ET.SubElement(product2, 'price')
price2.text = '49.50'
# Create an ElementTree object
tree = ET.ElementTree(root)
# Write to a file
tree.write('products.xml', encoding='utf-8', xml_declaration=True)
print("New XML file 'products.xml' created.")

Output (products.xml):

<?xml version='1.0' encoding='utf-8'?>
<catalog>
    <product id="p1">
        <name>Super Widget</name>
        <price>19.99</price>
    </product>
    <product id="p2">
        <name>Mega Gadget</name>
        <price>49.50</price>
    </product>
</catalog>

Best Practices and Important Notes

  1. Namespaces: XML namespaces can be tricky. If your XML uses them (e.g., <ns0:book>), you need to handle them by defining a mapping.

    # For an XML like: <ns0:library xmlns:ns0="http://example.com/library">
    ns = {'ns0': 'http://example.com/library'}
    # You must use the prefix in your find calls
    root.find('ns0:book', namespaces=ns)
  2. Pretty Printing: The standard ElementTree does not pretty-print (add indentation). The output will be a single line. For pretty printing, you often need a helper function or use the lxml library.

  3. Error Handling: Always wrap your XML parsing in a try...except block, as malformed XML will raise a ParseError.

    try:
        tree = ET.parse('bad_file.xml')
    except ET.ParseError as e:
        print(f"Error parsing XML: {e}")
  4. Performance: For very large XML files (many gigabytes), ElementTree might load the entire file into memory. In such cases, consider using iterparse which is an event-driven parser that can process files incrementally.

分享:
扫描分享到社交APP
上一篇
下一篇