Python lxml parse如何高效解析XML？-杰瑞科技汇

Of course! Parsing XML and HTML with Python's lxml library is a very common and powerful task. lxml is known for its speed, feature-rich API, and excellent support for standards like XPath and CSS Selectors.

（图片来源网络，侵删）

Here's a comprehensive guide on how to use lxml for parsing, starting from the basics and moving to more advanced examples.

Installation

First, you need to install the library. It's available on PyPI.

pip install lxml

lxml is a Pythonic binding for the C libraries libxml2 and libxslt, which makes it extremely fast.

Basic Parsing: `etree`

The core of lxml is the lxml.etree module. It's used to parse both XML and HTML documents.

（图片来源网络，侵删）

Parsing from a String

You can parse a string of XML or HTML directly.

from lxml import etree
# A sample XML string
xml_string = """
<root>
    <person id="1">
        <name>John Doe</name>
        <age>30</age>
        <city>New York</city>
    </person>
    <person id="2">
        <name>Jane Smith</name>
        <age>25</age>
        <city>London</city>
    </person>
</root>
"""
# Parse the string. etree.fromstring() returns the root element of the tree.
# The parser is smart enough to detect if it's XML or HTML.
root = etree.fromstring(xml_string)
# The 'root' element is an Element object
print(f"Root tag: {root.tag}")  # Output: Root tag: root
# You can iterate over child elements
for person in root:
    print(f"\nFound a person with tag: {person.tag}")
    print(f"  ID attribute: {person.get('id')}")

Parsing from a File

It's more common to parse from a file. Use etree.parse().

from lxml import etree
# Assume 'data.xml' contains the XML string from the previous example
try:
    # etree.parse() returns an ElementTree object, which contains the root
    tree = etree.parse('data.xml')
    root = tree.getroot()
    print(f"Root tag from file: {root.tag}")
except FileNotFoundError:
    print("Error: 'data.xml' not found. Please create this file.")

Navigating the Tree: The Element Object

When you parse a document, you get Element objects. These are the building blocks of the tree. Here are the most common ways to navigate them.

Let's use this HTML example for navigation:

<html>
  <head>My Page</title>
  </head>
  <body>
    <div class="content">
      <p id="intro">Hello, world!</p>
      <p>This is a paragraph.</p>
    </div>
    <a href="https://example.com">Link</a>
  </body>
</html>

Accessing Children, Parent, and Siblings

from lxml import etree
html_string = """
<html>
  <head>My Page</title>
  </head>
  <body>
    <div class="content">
      <p id="intro">Hello, world!</p>
      <p>This is a paragraph.</p>
    </div>
    <a href="https://example.com">Link</a>
  </body>
</html>
"""
root = etree.fromstring(html_string)
# 1. Get children
# .iterchildren() is an efficient way to get child elements
print("--- Children of <body> ---")
for child in root.find('body').iterchildren():
    print(f"Child tag: {child.tag}")
# 2. Get parent
body = root.find('body')
html_parent = body.getparent()
print(f"\nParent of <body> is: <{html_parent.tag}>")
# 3. Get siblings
intro_p = root.find(".//p[@id='intro']")
next_sibling = intro_p.getnext() # Gets the next element at the same level
print(f"\nNext sibling of <p id='intro'> is: <{next_sibling.tag}>")

Accessing Text and Attributes

# (Using the same 'root' element from above)
# 1. Get text
# .text gets the direct text content of an elementelement = root.find('head/title')
print(f"Title text: {title_element.text}") # Output: Title text: My Page
# For text within tags (like <p>), .text is still the direct text.
intro_p = root.find(".//p[@id='intro']")
print(f"Intro paragraph text: {intro_p.text}") # Output: Intro paragraph text: Hello, world!
# 2. Get attributes
# .get() is the safe way to get an attribute
a_element = root.find('.//a')
print(f"Link href: {a_element.get('href')}") # Output: Link href: https://example.com
print(f"Link text: {a_element.text}") # Output: Link text: Link
# You can also get attributes as a dictionary
print(f"All attributes of <a>: {a_element.attrib}") # Output: All attributes of <a>: {'href': 'https://example.com'}

Searching: XPath and CSS Selectors

This is where lxml truly shines. It has excellent support for XPath, a powerful query language for XML/HTML documents.

Using XPath

XPath expressions are strings that describe a path to the elements you want.

selects elements at any depth in the tree.
selects direct children.
[@attribute='value'] filters by an attribute.
/text() gets the text content of an element.

# (Using the same 'root' element from the HTML example)
# 1. Find all <p> elements anywhere in the document
all_p_elements = root.xpath('//p')
print(f"\nFound {len(all_p_elements)} <p> elements.")
# 2. Find the <p> element with a specific id
intro_p = root.xpath("//p[@id='intro']")[0] # xpath() returns a list
print(f"Found intro paragraph: {intro_p.text}")
# 3. Find text content directlytext = root.xpath('//title/text()')[0]
print(f"Found title text via XPath: {title_text}")
# 4. Find an element by attribute value
link_element = root.xpath("//a[@href='https://example.com']")[0]
print(f"Found link with specific href: {link_element.text}")

Using CSS Selectors

For those more familiar with web development, lxml also supports CSS selectors through the cssselect library (which is a dependency of lxml).

from lxml.cssselect import CSSSelector
# (Using the same 'root' element from the HTML example)
# 1. Create a CSS selector object
selector = CSSSelector('div.content p')
# 2. Apply the selector to the element
paragraphs_in_div = selector(root)
for p in paragraphs_in_div:
    print(f"Found paragraph in div: {p.text}")
# You can also chain selectors
first_p_in_div = CSSSelector('div.content > p:first-child')(root)[0]
print(f"\nFirst child paragraph: {first_p_in_div.text}")

Modifying the Tree

You can easily add, remove, and change elements and attributes.

from lxml import etree
# Start with a simple tree
root = etree.Element("root")
child = etree.SubElement(root, "child")
child.text = "Original text"
print("Original tree:")
print(etree.tostring(root, pretty_print=True).decode())
# 1. Modify text
child.text = "Modified text"
# 2. Add a new attribute
child.set("new_attr", "new_value")
# 3. Add a new sub-element
new_child = etree.SubElement(root, "another_child")
new_child.set("id", "2")
new_child.text = "I'm new!"
# 4. Remove an element
root.remove(child) # Remove the 'child' element
print("\nModified tree:")
print(etree.tostring(root, pretty_print=True).decode())

Cleaning Up: Pretty Printing and Serialization

Once you're done modifying, you'll want to convert the tree back to a string or write it to a file.

from lxml import etree
# Let's use the modified tree from the previous example
# root = ... (the element with 'another_child')
# 1. Convert to a string (bytes by default)
# pretty_print=True adds indentation
xml_output_bytes = etree.tostring(root, pretty_print=True, encoding='unicode')
print(xml_output_bytes)
# 2. Write to a file
with open('modified_output.xml', 'w', encoding='utf-8') as f:
    # tostring with method='xml' and pretty_print is great for files
    f.write(etree.tostring(root, pretty_print=True, encoding='unicode', method='xml'))
print("\nTree written to modified_output.xml")

HTML Parsing with `html.fromstring()`

While etree.fromstring() works for HTML, it's better to use lxml.html for robustness, as it handles broken HTML better and provides extra HTML-specific features.

from lxml import html
# Malformed HTML - a browser would handle this fine
html_string = "<div><p>Hello</div><p>World</p>"
# Use html.fromstring() for better HTML parsing
doc = html.fromstring(html_string)
# lxml automatically fixes the structure
print(html.tostring(doc, pretty_print=True).decode())
# Output:
# <div>
#   <p>Hello</p>
#   <p>World</p>
# </div>
# It also makes getting text easier
print(f"All text: {doc.text_content().strip()}") # Output: All text: Hello World

Summary of Key Functions/Methods

Task / Method	Description
`etree.fromstring(string)`	Parses a string/bytes into an `Element` (the root).
`etree.parse(file)`	Parses a file into an `ElementTree` object. Use `.getroot()` to get the root `Element`.
`element.tag`	The tag name of an element (e.g., `'div'`).
`element.text`	The text content directly inside an element.
`element.tail`	The text content after an element's closing tag.
`element.get('attr_name')`	Gets the value of an attribute. Safe (returns `None` if not found).
`element.set('attr', 'value')`	Sets an attribute.
`element.attrib`	A dictionary of all attributes.
`element.find(xpath)`	Finds the first child element matching an XPath.
`element.findall(xpath)`	Finds all child elements matching an XPath.
`element.xpath(xpath)`	Finds all elements in the subtree matching an XPath. Most common search method.
`etree.tostring(element)`	Serializes an element (or tree) back to a string/bytes.

Python lxml parse如何高效解析XML？

Installation

Basic Parsing: `etree`

Parsing from a String

Parsing from a File

Navigating the Tree: The Element Object

Accessing Children, Parent, and Siblings

Accessing Text and Attributes

Searching: XPath and CSS Selectors

Using XPath

Using CSS Selectors

Modifying the Tree

Cleaning Up: Pretty Printing and Serialization

HTML Parsing with `html.fromstring()`

Summary of Key Functions/Methods

99ANYc3cd6

iPhone5s换电池教程难不难？

ps ef与grep java如何协同查看Java进程？

Python 3.4如何实现telnet连接？

Java按钮点击事件如何绑定与实现？

Java byte 数组如何高效复制？

Java Socket编程实例具体怎么实现？

哪里能免费下载PS教程视频？

Python2与Python3核心差异有哪些？

Win10分区教程，如何安全调整分区大小？

python urlparse 下载

Servlet生命周期包含哪些关键阶段？

Java如何向MySQL高效插入数据？

宝马X3如何连接CarPlay？

Navicat使用教程详解，新手如何快速上手？

Qt Designer与Python如何高效结合使用？

Java static 内部类与普通内部类有何核心区别？

Python lxml parse如何高效解析XML？

Installation

Basic Parsing: etree

Parsing from a String

Parsing from a File

Navigating the Tree: The Element Object

Accessing Children, Parent, and Siblings

Accessing Text and Attributes

Searching: XPath and CSS Selectors

Using XPath

Using CSS Selectors

Modifying the Tree

Cleaning Up: Pretty Printing and Serialization

HTML Parsing with html.fromstring()

Summary of Key Functions/Methods

相关推荐

Java Socket编程实例具体怎么实现？

Basic Parsing: `etree`

HTML Parsing with `html.fromstring()`