Python BeautifulSoup如何高效解析HTML？-杰瑞科技汇

Of course! Let's dive into Beautiful Soup, a fantastic Python library for web scraping.

（图片来源网络，侵删）

What is Beautiful Soup?

Think of Beautiful Soup as a "parser." Its job is to take messy, unstructured HTML (or XML) code from a webpage and turn it into a structured Python object that you can easily navigate and search.

It's not a web scraper by itself (it doesn't fetch web pages). Instead, it works with a web scraping library like requests to download the HTML and then parse it.

Analogy: Imagine a giant, messy closet full of clothes (the raw HTML). Beautiful Soup is like a personal assistant who comes in, folds everything, and organizes it neatly by shirts, pants, and socks, so you can easily find what you need.

The Core Workflow: 3 Simple Steps

Fetch the HTML: Use the requests library to get the content of a webpage.
Parse the HTML: Feed the HTML content into Beautiful Soup to create a parse tree.
Search the Tree: Use Beautiful Soup's methods to find and extract the data you want.

Step 1: Installation

You'll need to install two libraries: requests to get the web page and beautifulsoup4 to parse it.

（图片来源网络，侵删）

pip install requests
pip install beautifulsoup4

Step 2: A Simple Example

Let's scrape the titles and links from the Wikipedia page for "Python (programming language)".

The HTML Structure (Simplified)

The page has a section called "Getting started". We want to find all the links (<a> tags) within that section.

<div id="Getting_started">
  <h2>Getting started</h2>
  <p>
    <a href="/wiki/Python_Software_Foundation" title="Python Software Foundation">
      Python Software Foundation
    </a>
  </p>
  <p>
    <a href="/wiki/Hello_world_program" title="Hello world program">
      "Hello, world!" program
    </a>
  </p>
</div>

The Python Code

Here is the complete script to fetch the page and extract the links.

import requests
from bs4 import BeautifulSoup
# 1. Fetch the HTML content
url = 'https://en.wikipedia.org/wiki/Python_(programming_language)'
try:
    response = requests.get(url)
    # Raise an exception if the request was unsuccessful (e.g., 404, 500)
    response.raise_for_status() 
except requests.exceptions.RequestException as e:
    print(f"Error fetching URL: {e}")
    exit()
# 2. Parse the HTML with Beautiful Soup
# We use 'html.parser' which is Python's built-in parser.
soup = BeautifulSoup(response.text, 'html.parser')
# 3. Search the tree and extract data
# --- Method 1: Find by ID (most reliable) ---
# Find the main content div which has the id 'mw-content-text'
main_content = soup.find('div', id='mw-content-text')
# Now, within that div, find the section with the id 'Getting_started'
getting_started_section = main_content.find('div', id='Getting_started')
# Find all <a> tags within that specific section
links_in_section = getting_started_section.find_all('a')
print("--- Links found in the 'Getting started' section ---")
for link in links_in_section:
    # .get('href') is the safe way to get an attribute
    href = link.get('href') 
    # .get_text() gets the visible text of the tag
    text = link.get_text(strip=True) # strip=True removes leading/trailing whitespace
    # We only want valid links that start with /wiki/
    if href and href.startswith('/wiki/'):
        print(f"Text: {text.ljust(40)} | Link: https://en.wikipedia.org{href}")
# --- Method 2: Find by CSS Selector (more advanced, very powerful) ---
# This does the same thing as the block above, but in one line.
print("\n--- Using CSS Selector for the same result ---")
css_selector_links = soup.select('div#Getting_started a')
for link in css_selector_links:
    href = link.get('href')
    text = link.get_text(strip=True)
    if href and href.startswith('/wiki/'):
        print(f"Text: {text.ljust(40)} | Link: https://en.wikipedia.org{href}")

Key Beautiful Soup Objects and Methods

Understanding these four is 90% of what you'll need to do.

（图片来源网络，侵删）

Object/Method	Description	Example
`soup`	The main object created from `BeautifulSoup(html, 'parser')`. It represents the entire document.	`soup = BeautifulSoup(html_doc, 'html.parser')`
`.find()`	Finds the first matching element. Returns a single `Tag` object or `None`.	`first_div = soup.find('div')`
`.find_all()`	Finds all matching elements. Returns a list of `Tag` objects.	`all_links = soup.find_all('a')`
`Tag` object	Represents an HTML tag (e.g., `<p>`, `<div>`, `<a>`).	`link_tag = soup.find('a')`
`.name`	The name of the tag.	`link_tag.name` -> `'a'`
`.attrs`	A dictionary of the tag's attributes.	`link_tag.attrs` -> `{'href': '/wiki/Python', 'title': 'Python'}`
`.get('attr_name')`	The recommended way to get the value of a specific attribute.	`link_tag.get('href')` -> `'/wiki/Python'`
`.get_text()`	Gets all the human-readable text from a tag and its children.	`link_tag.get_text()` -> `'Python'`
`.string`	Gets the text content of a tag if it has only one child string.	`link_tag.string` -> `'Python'`

Advanced Searching (CSS Selectors)

Beautiful Soup has excellent support for CSS selectors, which is often more intuitive than using find and find_all.

CSS Selector	Method	Description
`tag_name`	`soup.select('a')`	Select all `<a>` tags.
`.class_name`	`soup.select('.infobox')`	Select all elements with `class="infobox"`.
`#id_name`	`soup.select('#content')`	Select the element with `id="content"`.
`parent > child`	`soup.select('div > p')`	Select all `<p>` tags that are direct children of a `<div>`.
`tag.class`	`soup.select('a.external')`	Select all `<a>` tags with `class="external"`.
`[attribute]`	`soup.select('a[href]')`	Select all `<a>` tags that have an `href` attribute.
`[attribute=value]`	`soup.select('a[href="/wiki/Python"]')`	Select all `<a>` tags with `href` exactly equal to `/wiki/Python`.

Handling Different Parsers

The parser you use is important. html.parser is built-in but a bit slow. lxml is much faster and more lenient with broken HTML, but you have to install it separately.

Parser	How to Install	Pros	Cons
`html.parser`	Built-in	No extra installation, secure.	Slower, less forgiving of bad HTML.
`lxml`	`pip install lxml`	Very fast, handles broken HTML well.	Requires external C library.
`html5lib`	`pip install html5lib	Extremely lenient, creates valid HTML5.	Slowest, requires external library.

Example with lxml:

from bs4 import BeautifulSoup
import requests
response = requests.get('http://example.com')
soup = BeautifulSoup(response.text, 'lxml') # Just change the parser here!

Best Practices and Ethics

Check robots.txt: Before scraping any site, check its robots.txt file (e.g., https://example.com/robots.txt). This file tells you which parts of the site the owner does not want bots to access.
Be a Good Citizen:
- Set a User-Agent: Some sites block default requests user agents. Make your script look like a browser.
```
headers = {'User-Agent': 'MyCoolScraper/1.0 (myemail@example.com)'}
response = requests.get(url, headers=headers)
```
- Add Delays: Don't hammer the server with requests. Add a small delay between requests.
```
import time
time.sleep(2) # Wait for 2 seconds
```
Handle Errors: Network connections can fail. Webpages can change their structure.

Python BeautifulSoup如何高效解析HTML？

What is Beautiful Soup?

The Core Workflow: 3 Simple Steps

Step 1: Installation

Step 2: A Simple Example

The HTML Structure (Simplified)

The Python Code

Key Beautiful Soup Objects and Methods

Advanced Searching (CSS Selectors)

Handling Different Parsers

Best Practices and Ethics

99ANYc3cd6

Python if or and 组合优先级如何判断？

Sublime Text 如何配置 Python 开发环境？

HTC One如何一键获取root权限？

Sublime Text下Python环境如何配置？

Python for in range循环如何高效使用？

Java中如何获取String的最后一个字符？

Ubuntu 16安装步骤详解？

Python、MySQL、Django 如何协同工作？

HTC如何刷入第三方Recovery？

Java MediaPlayer如何实现音频视频播放控制？

WordPress模板教程该怎么学？

Integer类如何实现自动装箱与拆箱？

3ds max2012教程从哪开始学？

Java中sleep和wait有何核心区别？

Premiere教程PDF哪里找？新手如何学Premiere？

Python requests post如何发送请求与处理响应？

Python BeautifulSoup如何高效解析HTML？

What is Beautiful Soup?

The Core Workflow: 3 Simple Steps

Step 1: Installation

Step 2: A Simple Example

The HTML Structure (Simplified)

The Python Code

Key Beautiful Soup Objects and Methods

Advanced Searching (CSS Selectors)

Handling Different Parsers

Best Practices and Ethics

相关推荐

Java中如何获取String的最后一个字符？