Python如何使用CSS选择器？-杰瑞科技汇

Of course! The term "Python CSS Selector" refers to using Python libraries to parse HTML and find elements on a web page using CSS selectors. This is a fundamental skill in web scraping and automated testing.

（图片来源网络，侵删）

Here’s a comprehensive guide covering the most popular libraries and how to use them.

The Core Concept

Imagine the HTML of a web page as a tree structure. CSS selectors are a powerful way to "select" or "target" specific nodes (elements) in that tree. In Python, we use libraries to load the HTML into a tree-like structure and then apply our CSS selectors to find the elements we need.

The two most popular libraries for this are:

Beautiful Soup: Excellent for beginners and general-purpose web scraping. It's very forgiving with "messy" HTML.
lxml: Extremely fast and feature-rich. It's built on the C libraries libxml2 and libxslt, making it the performance champion. It's also the engine behind Beautiful Soup when you install the lxml parser.

Using `Beautiful Soup` (The Easiest Way)

Beautiful Soup is a fantastic choice for scraping most websites because it handles broken and poorly formatted HTML gracefully.

（图片来源网络，侵删）

Step 1: Installation

First, you need to install the library. It's highly recommended to also install lxml as its parser because it's much faster than the default one.

pip install beautifulsoup4
pip install lxml

Step 2: Basic Usage

Let's say we have the following HTML content saved in a string or file.

Sample HTML (sample.html):

<!DOCTYPE html>
<html>
<head>A Sample Page</title>
</head>
<body>
    <h1>Welcome to the Page</h1>
    <div id="main-content">
        <p class="intro">This is the first paragraph.</p>
        <p class="intro">This is the second paragraph.</p>
        <ul class="item-list">
            <li class="item">Item 1</li>
            <li class="item active">Item 2</li>
            <li class="item">Item 3</li>
        </ul>
    </div>
    <a href="https://example.com" class="link">Example Link</a>
</body>
</html>

Step 3: Python Code with `Beautiful Soup`

Here’s how you would use Beautiful Soup with CSS selectors.

（图片来源网络，侵删）

from bs4 import BeautifulSoup
# Let's assume the HTML is in a string
html_doc = """
<!DOCTYPE html>
<html>
<head>A Sample Page</title>
</head>
<body>
    <h1>Welcome to the Page</h1>
    <div id="main-content">
        <p class="intro">This is the first paragraph.</p>
        <p class="intro">This is the second paragraph.</p>
        <ul class="item-list">
            <li class="item">Item 1</li>
            <li class="item active">Item 2</li>
            <li class="item">Item 3</li>
        </ul>
    </div>
    <a href="https://example.com" class="link">Example Link</a>
</body>
</html>
"""
# 1. Create a BeautifulSoup object
# 'lxml' is the parser we installed. It's fast and recommended.
soup = BeautifulSoup(html_doc, 'lxml')
# --- CSS SELECTOR EXAMPLES ---
# 2. Select by Tag Name
# Finds the first <h1> tag
h1_tag = soup.select_one('h1')
print(f"First H1: {h1_tag.text.strip()}")
# 3. Select by ID
# The '#' prefix is for ID
main_div = soup.select_one('#main-content')
print(f"\nFound div by ID: {main_div.name}")
# 4. Select by Class Name
# The '.' prefix is for Class
# select() returns a list of all matching elements
intro_paragraphs = soup.select('p.intro')
print("\nFound paragraphs by class 'intro':")
for p in intro_paragraphs:
    print(f"- {p.text.strip()}")
# 5. Select by Attribute
# Use [attribute_name="attribute_value"]
# This finds the link with href="https://example.com"
example_link = soup.select_one('a[href="https://example.com"]')
print(f"\nFound link by href: {example_link.text.strip()}")
# 6. Combine Selectors (Descendant Selector)
# This finds all <li> elements inside a <ul> with class 'item-list'
list_items = soup.select('ul.item-list li')
print("\nFound all list items:")
for item in list_items:
    print(f"- {item.text.strip()}")
# 7. Select by Class and Tag
# Finds <li> elements that also have the class 'active'
active_item = soup.select_one('li.active')
print(f"\nFound active item: {active_item.text.strip()}")
# 8. Select by Multiple Classes (AND logic)
# An element can have multiple classes. To select an element that has BOTH,
# just put the class selectors together without a space.
# (This is less common for scraping, but good to know)
# active_item_alt = soup.select_one('li.item.active') # This also works
# 9. Select Child Elements
# The '>' selector is for direct children.
# This finds all <p> tags that are direct children of <div id="main-content">
direct_p_tags = soup.select('#main-content > p')
print("\nFound direct <p> children of #main-content:")
for p in direct_p_tags:
    print(f"- {p.text.strip()}")

Using `lxml` Directly (The Fastest Way)

lxml is more powerful and faster but can be less forgiving with malformed HTML. It's the engine that powers Beautiful Soup when you use the lxml parser.

Step 1: Installation

pip install lxml

Step 2: Basic Usage

The syntax is slightly different. You use the cssselect module.

from lxml import html
# Sample HTML content
html_doc = """
<!DOCTYPE html>
<html>
<head>A Sample Page</title>
</head>
<body>
    <h1>Welcome to the Page</h1>
    <div id="main-content">
        <p class="intro">This is the first paragraph.</p>
        <p class="intro">This is the second paragraph.</p>
        <ul class="item-list">
            <li class="item">Item 1</li>
            <li class="item active">Item 2</li>
            <li class="item">Item 3</li>
        </ul>
    </div>
    <a href="https://example.com" class="link">Example Link</a>
</body>
</html>
"""
# 1. Parse the HTML string into an lxml tree
tree = html.fromstring(html_doc)
# --- CSS SELECTOR EXAMPLES ---
# 2. Select elements using cssselect()
# It returns a list of elements
h1_tags = tree.cssselect('h1')
print(f"First H1: {h1_tags[0].text.strip()}")
# 3. Select by ID
main_div = tree.cssselect('#main-content')[0]
print(f"\nFound div by ID: {main_div.tag}")
# 4. Select by Class Name
intro_paragraphs = tree.cssselect('p.intro')
print("\nFound paragraphs by class 'intro':")
for p in intro_paragraphs:
    print(f"- {p.text.strip()}")
# 5. Select by Attribute
example_link = tree.cssselect('a[href="https://example.com"]')[0]
print(f"\nFound link by href: {example_link.text.strip()}")
# 6. Combine Selectors
list_items = tree.cssselect('ul.item-list li')
print("\nFound all list items:")
for item in list_items:
    print(f"- {item.text.strip()}")
# 7. Select by Class and Tag
active_item = tree.cssselect('li.active')[0]
print(f"\nFound active item: {active_item.text.strip()}")

Comparison: `Beautiful Soup` vs. `lxml`

Feature	`Beautiful Soup`	`lxml`
Ease of Use	Winner. Very intuitive, Pythonic API. Great for beginners.	More complex API. You work with `Element` objects directly.
Performance	Good, but slower than `lxml`. When using `lxml` as a parser, it's fast, but there's a thin layer of overhead.	Winner. Extremely fast, as it's implemented in C.
HTML Tolerance	Winner. Very forgiving of broken, messy, or non-standard HTML.	Less forgiving. Can fail on malformed HTML.
Features	Excellent for parsing and searching. Lacks advanced features like XPath, XSLT, XML Schema validation.	Winner. A full-featured XML toolkit. Supports XPath, XSLT, validation, etc.
Use Case	General-purpose web scraping, quick data extraction.	High-performance scraping, parsing large XML files, complex document validation.

Summary and Recommendation

For most web scraping tasks, especially if you're a beginner or dealing with unpredictable websites, start with Beautiful Soup and the lxml parser. It's the perfect balance of power, speed, and ease of use.
Use lxml directly when you need maximum performance, are working with well-formed XML, or need advanced features like XPath that Beautiful Soup doesn't expose directly.

Common CSS Selectors Cheat Sheet

Selector	Example	Description
Element	`p`	Selects all `<p>` elements.
ID	`#main`	Selects the element with `id="main"`.
Class	`.intro`	Selects all elements with `class="intro"`.
Attribute	`[target="_blank"]`	Selects all elements with `target="_blank"`.
Descendant	`div p`	Selects all `<p>` elements inside a `<div>`.
Child	`div > p`	Selects all `<p>` elements that are direct children of a `<div>`.
Adjacent Sibling	`h1 + p`	Selects the `<p>` element immediately after an `<h1>`.
General Sibling	`h1 ~ p`	Selects all `<p>` elements that are siblings of an `<h1>`, after the `<h1>`.
Pseudo-class	`li:first-child`	Selects the first `<li>` element in a list.
Pseudo-class	`li:nth-child(2)`	Selects the second `<li>` element.
Pseudo-class	`a:hover`	Selects `<a>` elements when the user hovers over them. (Less useful for scraping)

Python如何使用CSS选择器？

The Core Concept

Using `Beautiful Soup` (The Easiest Way)

Step 1: Installation

Step 2: Basic Usage

Step 3: Python Code with `Beautiful Soup`

Using `lxml` Directly (The Fastest Way)

Step 1: Installation

Step 2: Basic Usage

Comparison: `Beautiful Soup` vs. `lxml`

Summary and Recommendation

Common CSS Selectors Cheat Sheet

99ANYc3cd6

如何在Python中使用QTreeWidget？

Word 2010视频教程怎么学？

Python cleaneddata如何高效处理？

Flash AS3.0教程如何快速入门？

Java中int与char如何相互转换？

男士瑜伽入门，哪些基本动作必学？

Axure RP 7.0教程视频哪里找？

Dreamweaver CS6视频教程从哪里学起？

SolidWorks工程图教程如何快速上手？

Java中date compareTo如何精确比较日期？

Python字符串拼接，用StringBuilder更高效吗？

Python Selenium如何切换窗口/iframe/标签页？

ImageReady教程，从入门到精通怎么学？

Python串口多线程如何高效通信与同步？

SolidWorks高级应用教程，如何高效掌握核心技巧？

cad for mac 教程

Python如何使用CSS选择器？

The Core Concept

Using Beautiful Soup (The Easiest Way)

Step 1: Installation

Step 2: Basic Usage

Step 3: Python Code with Beautiful Soup

Using lxml Directly (The Fastest Way)

Step 1: Installation

Step 2: Basic Usage

Comparison: Beautiful Soup vs. lxml

Summary and Recommendation

Common CSS Selectors Cheat Sheet

相关推荐

男士瑜伽入门，哪些基本动作必学？

Using `Beautiful Soup` (The Easiest Way)

Step 3: Python Code with `Beautiful Soup`

Using `lxml` Directly (The Fastest Way)

Comparison: `Beautiful Soup` vs. `lxml`