杰瑞科技汇

Python HTML Parser 如何高效解析HTML?

Of course! Parsing HTML is a common task in Python, especially for web scraping. Here’s a comprehensive guide covering the most popular methods, from simple to advanced.

Python HTML Parser 如何高效解析HTML?-图1
(图片来源网络,侵删)

The Big Picture: Why Different Parsers?

HTML on the web is often messy, not perfectly formatted, and can be invalid. Different parsers handle this messiness in different ways, which affects their speed and reliability.

Parser Speed Leniency (Handles "bad" HTML) Installation Best For...
html.parser Slow High (Built-in) pip install beautifulsoup4 Quick scripts, avoiding external dependencies.
lxml Fast High pip install lxml Performance-critical applications.
html5lib Very Slow Very High pip install html5lib The most robust parser, creates valid DOM.

Recommendation: For 95% of cases, use lxml. It's fast and very forgiving. If you can't install external libraries, use the built-in html.parser.


Method 1: Using BeautifulSoup (The Easiest & Most Common Way)

BeautifulSoup is not a parser itself, but a library that makes parsing any HTML or XML document easy. It provides a convenient, Pythonic way to navigate, search, and modify the parse tree.

You'll need to install it, along with a parser:

Python HTML Parser 如何高效解析HTML?-图2
(图片来源网络,侵删)
pip install beautifulsoup4
pip install lxml  # Recommended parser

Basic Example

Let's parse a simple HTML string.

from bs4 import BeautifulSoup
# The HTML content you want to parse
html_doc = """
<html>
<head>The Dormouse's story</title>
</head>
<body>
    <p class="title"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
</body>
</html>
"""
# 1. Create a BeautifulSoup object
# 'lxml' is the parser we are using.
soup = BeautifulSoup(html_doc, 'lxml')
# 2. Navigating the Parse Tree
# Get the title tag
print("--- Title Tag ---")
print(soup.title)The Dormouse's story</title>
# Get the name of the title tag
print("\n--- Tag Name ---")
print(soup.title.name)
# Get the text content of the title tag
print("\n--- Tag Text ---")
print(soup.title.string)
#> The Dormouse's story
# Find the first <p> tag
print("\n--- First Paragraph ---")
print(soup.p)
#> <p class="title"><b>The Dormouse's story</b></p>
# Get the 'class' attribute of the first <p> tag
print("\n--- Class Attribute of First Paragraph ---")
print(soup.p['class'])]
# Find all <a> tags
print("\n--- All Links ---")
for link in soup.find_all('a'):
    print(link)
    #> <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
    #> <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
    #> <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
# Find a specific tag with a specific ID
print("\n--- Link with ID 'link2' ---")
print(soup.find(id='link2'))
#> <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
# Find all tags with a specific CSS class
print("\n--- All Tags with Class 'sister' ---")
for sister in soup.find_all(class_='sister'): # Note: 'class' is a reserved word, so use class_
    print(sister)
    #> <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
    #> <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
    #> <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
# Get the text from all the links
print("\n--- Text from all Links ---")
for link in soup.find_all('a'):
    print(link.get_text())
    #> Elsie
    #> Lacie
    #> Tillie
# Get the 'href' attribute from a link
print("\n--- HREF from Link 1 ---")
print(soup.find('a', id='link1')['href'])
#> http://example.com/elsie

Key BeautifulSoup Methods

  • soup.find('tag_name'): Finds the first occurrence of a tag.
  • soup.find_all('tag_name'): Finds all occurrences of a tag and returns a list.
  • soup.find(id='some_id'): Finds a tag by its id attribute.
  • soup.find(class_='some_class'): Finds tags by their class attribute.
  • .get_text(): A method on a tag to extract all its text content.
  • ['attribute_name']: Access an attribute of a tag (e.g., ['href'], ['src']).

Method 2: Using lxml Directly (For Advanced Users)

lxml is a powerful and fast library for processing XML and HTML. It has its own API that is more direct and less "Pythonic" than BeautifulSoup's, but offers more control and performance.

First, install it:

pip install lxml

Basic Example

from lxml import html
# The HTML content
html_doc = """
<html>
<head>The Dormouse's story</title>
</head>
<body>
    <p class="title"><b>The Dormouse's story</b></p>
    <p class="story">... <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a> ...</p>
</body>
</html>
"""
# 1. Parse the HTML string
# The 'html' parser specifically handles HTML documents
tree = html.fromstring(html_doc)
# 2. Using XPath to find elements
# XPath is a language for selecting nodes in an XML/HTML document.
# Get the text content of the title tag
# The text() function in XPath gets the text of a node.= tree.xpath('//title/text()')[0]
print(f"Title: {title}") The Dormouse's story
# Get the text of all <a> tags
all_links_text = tree.xpath('//a/text()')
print(f"All link text: {all_links_text}")
#> All link text: ['Elsie']
# Get the href attribute of the link with id 'link1'
specific_link_href = tree.xpath('//a[@id="link1"]/@href')[0]
print(f"Link 1 href: {specific_link_href}")
#> Link 1 href: http://example.com/elsie
# Find all <p> tags with class 'story'
story_paragraphs = tree.xpath('//p[@class="story"]')
print(f"Found {len(story_paragraphs)} story paragraphs.")
#> Found 1 story paragraphs.

Key lxml Concepts

  • XPath: The core of lxml's power. You use XPath expressions to select elements.
    • //title: Finds any <title> tag in the document.
    • //a[@id="link1"]: Finds any <a> tag with an id attribute equal to "link1".
    • /text(): Selects the text content of a node.
    • /@href: Selects the value of the href attribute.
  • html.fromstring(): Parses a string into an Element object, which represents the root of the HTML tree.
  • tree.xpath(): The main method for querying the tree using XPath expressions.

Method 3: Using html.parser (The Built-in Option)

This is the default parser that comes with Python. It's slower than lxml but requires no installation. You use it exactly the same way as BeautifulSoup.

Python HTML Parser 如何高效解析HTML?-图3
(图片来源网络,侵删)
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body><p class="title"><b>The Dormouse's story</b></p></body></html>
"""
# Just pass 'html.parser' as the second argument
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.title.string)
#> The Dormouse's story

Practical Example: Scraping a Real Web Page

Let's scrape the titles of the latest questions from Stack Overflow's homepage.

Note: Always check a website's robots.txt (e.g., https://stackoverflow.com/robots.txt) and Terms of Service before scraping. Be respectful and don't send too many requests in a short period.

import requests
from bs4 import BeautifulSoup
# The URL of the page we want to scrape
url = 'https://stackoverflow.com/questions'
# Set a User-Agent to mimic a browser
headers = {'User-Agent': 'My Web Scraper 1.0'}
try:
    # 1. Fetch the HTML content of the page
    response = requests.get(url, headers=headers)
    response.raise_for_status()  # Raise an exception for bad status codes (4xx or 5xx)
    # 2. Parse the HTML with BeautifulSoup
    soup = BeautifulSoup(response.text, 'lxml')
    # 3. Find the elements containing the question titles
    # We can use the browser's "Inspect" tool to find the right CSS selector.
    # On Stack Overflow, question titles are in <h3> tags with a class of "s-post-summary--content-title".
    question_titles = soup.find_all('h3', class_='s-post-summary--content-title')
    # 4. Extract and print the text
    print("Latest Stack Overflow Questions:")
    for title in question_titles:
        # .strip() removes leading/trailing whitespace
        print(title.get_text(strip=True))
except requests.exceptions.RequestException as e:
    print(f"Error fetching the URL: {e}")

Summary and Best Practices

  1. Start with BeautifulSoup: It's the most user-friendly and powerful tool for general-purpose scraping.
  2. Choose a Parser:
    • lxml: Use it for speed and robustness. It's the best default choice.
    • html.parser: Use it if you can't install external dependencies or for very simple, quick tasks.
  3. Use Developer Tools: The single most important tool for web scraping is your browser's Developer Tools (usually opened with F12). Use it to inspect the HTML structure and find the exact tags, classes, and IDs you need to target.
  4. Be Ethical:
    • Check robots.txt: It tells you which parts of a site you are not allowed to scrape.
    • Don't Overload Servers: Add delays between your requests (time.sleep(1)) to avoid hammering the website.
    • Identify Yourself: Set a User-Agent header in your requests so the site owner knows who is accessing their data.
分享:
扫描分享到社交APP
上一篇
下一篇