杰瑞科技汇

Python如何使用CSS选择器?

Of course! The term "Python CSS Selector" refers to using Python libraries to parse HTML and find elements on a web page using CSS selectors. This is a fundamental skill in web scraping and automated testing.

Python如何使用CSS选择器?-图1
(图片来源网络,侵删)

Here’s a comprehensive guide covering the most popular libraries and how to use them.

The Core Concept

Imagine the HTML of a web page as a tree structure. CSS selectors are a powerful way to "select" or "target" specific nodes (elements) in that tree. In Python, we use libraries to load the HTML into a tree-like structure and then apply our CSS selectors to find the elements we need.

The two most popular libraries for this are:

  1. Beautiful Soup: Excellent for beginners and general-purpose web scraping. It's very forgiving with "messy" HTML.
  2. lxml: Extremely fast and feature-rich. It's built on the C libraries libxml2 and libxslt, making it the performance champion. It's also the engine behind Beautiful Soup when you install the lxml parser.

Using Beautiful Soup (The Easiest Way)

Beautiful Soup is a fantastic choice for scraping most websites because it handles broken and poorly formatted HTML gracefully.

Python如何使用CSS选择器?-图2
(图片来源网络,侵删)

Step 1: Installation

First, you need to install the library. It's highly recommended to also install lxml as its parser because it's much faster than the default one.

pip install beautifulsoup4
pip install lxml

Step 2: Basic Usage

Let's say we have the following HTML content saved in a string or file.

Sample HTML (sample.html):

<!DOCTYPE html>
<html>
<head>A Sample Page</title>
</head>
<body>
    <h1>Welcome to the Page</h1>
    <div id="main-content">
        <p class="intro">This is the first paragraph.</p>
        <p class="intro">This is the second paragraph.</p>
        <ul class="item-list">
            <li class="item">Item 1</li>
            <li class="item active">Item 2</li>
            <li class="item">Item 3</li>
        </ul>
    </div>
    <a href="https://example.com" class="link">Example Link</a>
</body>
</html>

Step 3: Python Code with Beautiful Soup

Here’s how you would use Beautiful Soup with CSS selectors.

Python如何使用CSS选择器?-图3
(图片来源网络,侵删)
from bs4 import BeautifulSoup
# Let's assume the HTML is in a string
html_doc = """
<!DOCTYPE html>
<html>
<head>A Sample Page</title>
</head>
<body>
    <h1>Welcome to the Page</h1>
    <div id="main-content">
        <p class="intro">This is the first paragraph.</p>
        <p class="intro">This is the second paragraph.</p>
        <ul class="item-list">
            <li class="item">Item 1</li>
            <li class="item active">Item 2</li>
            <li class="item">Item 3</li>
        </ul>
    </div>
    <a href="https://example.com" class="link">Example Link</a>
</body>
</html>
"""
# 1. Create a BeautifulSoup object
# 'lxml' is the parser we installed. It's fast and recommended.
soup = BeautifulSoup(html_doc, 'lxml')
# --- CSS SELECTOR EXAMPLES ---
# 2. Select by Tag Name
# Finds the first <h1> tag
h1_tag = soup.select_one('h1')
print(f"First H1: {h1_tag.text.strip()}")
# 3. Select by ID
# The '#' prefix is for ID
main_div = soup.select_one('#main-content')
print(f"\nFound div by ID: {main_div.name}")
# 4. Select by Class Name
# The '.' prefix is for Class
# select() returns a list of all matching elements
intro_paragraphs = soup.select('p.intro')
print("\nFound paragraphs by class 'intro':")
for p in intro_paragraphs:
    print(f"- {p.text.strip()}")
# 5. Select by Attribute
# Use [attribute_name="attribute_value"]
# This finds the link with href="https://example.com"
example_link = soup.select_one('a[href="https://example.com"]')
print(f"\nFound link by href: {example_link.text.strip()}")
# 6. Combine Selectors (Descendant Selector)
# This finds all <li> elements inside a <ul> with class 'item-list'
list_items = soup.select('ul.item-list li')
print("\nFound all list items:")
for item in list_items:
    print(f"- {item.text.strip()}")
# 7. Select by Class and Tag
# Finds <li> elements that also have the class 'active'
active_item = soup.select_one('li.active')
print(f"\nFound active item: {active_item.text.strip()}")
# 8. Select by Multiple Classes (AND logic)
# An element can have multiple classes. To select an element that has BOTH,
# just put the class selectors together without a space.
# (This is less common for scraping, but good to know)
# active_item_alt = soup.select_one('li.item.active') # This also works
# 9. Select Child Elements
# The '>' selector is for direct children.
# This finds all <p> tags that are direct children of <div id="main-content">
direct_p_tags = soup.select('#main-content > p')
print("\nFound direct <p> children of #main-content:")
for p in direct_p_tags:
    print(f"- {p.text.strip()}")

Using lxml Directly (The Fastest Way)

lxml is more powerful and faster but can be less forgiving with malformed HTML. It's the engine that powers Beautiful Soup when you use the lxml parser.

Step 1: Installation

pip install lxml

Step 2: Basic Usage

The syntax is slightly different. You use the cssselect module.

from lxml import html
# Sample HTML content
html_doc = """
<!DOCTYPE html>
<html>
<head>A Sample Page</title>
</head>
<body>
    <h1>Welcome to the Page</h1>
    <div id="main-content">
        <p class="intro">This is the first paragraph.</p>
        <p class="intro">This is the second paragraph.</p>
        <ul class="item-list">
            <li class="item">Item 1</li>
            <li class="item active">Item 2</li>
            <li class="item">Item 3</li>
        </ul>
    </div>
    <a href="https://example.com" class="link">Example Link</a>
</body>
</html>
"""
# 1. Parse the HTML string into an lxml tree
tree = html.fromstring(html_doc)
# --- CSS SELECTOR EXAMPLES ---
# 2. Select elements using cssselect()
# It returns a list of elements
h1_tags = tree.cssselect('h1')
print(f"First H1: {h1_tags[0].text.strip()}")
# 3. Select by ID
main_div = tree.cssselect('#main-content')[0]
print(f"\nFound div by ID: {main_div.tag}")
# 4. Select by Class Name
intro_paragraphs = tree.cssselect('p.intro')
print("\nFound paragraphs by class 'intro':")
for p in intro_paragraphs:
    print(f"- {p.text.strip()}")
# 5. Select by Attribute
example_link = tree.cssselect('a[href="https://example.com"]')[0]
print(f"\nFound link by href: {example_link.text.strip()}")
# 6. Combine Selectors
list_items = tree.cssselect('ul.item-list li')
print("\nFound all list items:")
for item in list_items:
    print(f"- {item.text.strip()}")
# 7. Select by Class and Tag
active_item = tree.cssselect('li.active')[0]
print(f"\nFound active item: {active_item.text.strip()}")

Comparison: Beautiful Soup vs. lxml

Feature Beautiful Soup lxml
Ease of Use Winner. Very intuitive, Pythonic API. Great for beginners. More complex API. You work with Element objects directly.
Performance Good, but slower than lxml. When using lxml as a parser, it's fast, but there's a thin layer of overhead. Winner. Extremely fast, as it's implemented in C.
HTML Tolerance Winner. Very forgiving of broken, messy, or non-standard HTML. Less forgiving. Can fail on malformed HTML.
Features Excellent for parsing and searching. Lacks advanced features like XPath, XSLT, XML Schema validation. Winner. A full-featured XML toolkit. Supports XPath, XSLT, validation, etc.
Use Case General-purpose web scraping, quick data extraction. High-performance scraping, parsing large XML files, complex document validation.

Summary and Recommendation

  • For most web scraping tasks, especially if you're a beginner or dealing with unpredictable websites, start with Beautiful Soup and the lxml parser. It's the perfect balance of power, speed, and ease of use.
  • Use lxml directly when you need maximum performance, are working with well-formed XML, or need advanced features like XPath that Beautiful Soup doesn't expose directly.

Common CSS Selectors Cheat Sheet

Selector Example Description
Element p Selects all <p> elements.
ID #main Selects the element with id="main".
Class .intro Selects all elements with class="intro".
Attribute [target="_blank"] Selects all elements with target="_blank".
Descendant div p Selects all <p> elements inside a <div>.
Child div > p Selects all <p> elements that are direct children of a <div>.
Adjacent Sibling h1 + p Selects the <p> element immediately after an <h1>.
General Sibling h1 ~ p Selects all <p> elements that are siblings of an <h1>, after the <h1>.
Pseudo-class li:first-child Selects the first <li> element in a list.
Pseudo-class li:nth-child(2) Selects the second <li> element.
Pseudo-class a:hover Selects <a> elements when the user hovers over them. (Less useful for scraping)
分享:
扫描分享到社交APP
上一篇
下一篇