杰瑞科技汇

Python BeautifulSoup如何高效解析HTML?

Of course! Let's dive into Beautiful Soup, a fantastic Python library for web scraping.

Python BeautifulSoup如何高效解析HTML?-图1
(图片来源网络,侵删)

What is Beautiful Soup?

Think of Beautiful Soup as a "parser." Its job is to take messy, unstructured HTML (or XML) code from a webpage and turn it into a structured Python object that you can easily navigate and search.

It's not a web scraper by itself (it doesn't fetch web pages). Instead, it works with a web scraping library like requests to download the HTML and then parse it.

Analogy: Imagine a giant, messy closet full of clothes (the raw HTML). Beautiful Soup is like a personal assistant who comes in, folds everything, and organizes it neatly by shirts, pants, and socks, so you can easily find what you need.


The Core Workflow: 3 Simple Steps

  1. Fetch the HTML: Use the requests library to get the content of a webpage.
  2. Parse the HTML: Feed the HTML content into Beautiful Soup to create a parse tree.
  3. Search the Tree: Use Beautiful Soup's methods to find and extract the data you want.

Step 1: Installation

You'll need to install two libraries: requests to get the web page and beautifulsoup4 to parse it.

Python BeautifulSoup如何高效解析HTML?-图2
(图片来源网络,侵删)
pip install requests
pip install beautifulsoup4

Step 2: A Simple Example

Let's scrape the titles and links from the Wikipedia page for "Python (programming language)".

The HTML Structure (Simplified)

The page has a section called "Getting started". We want to find all the links (<a> tags) within that section.

<div id="Getting_started">
  <h2>Getting started</h2>
  <p>
    <a href="/wiki/Python_Software_Foundation" title="Python Software Foundation">
      Python Software Foundation
    </a>
  </p>
  <p>
    <a href="/wiki/Hello_world_program" title="Hello world program">
      "Hello, world!" program
    </a>
  </p>
</div>

The Python Code

Here is the complete script to fetch the page and extract the links.

import requests
from bs4 import BeautifulSoup
# 1. Fetch the HTML content
url = 'https://en.wikipedia.org/wiki/Python_(programming_language)'
try:
    response = requests.get(url)
    # Raise an exception if the request was unsuccessful (e.g., 404, 500)
    response.raise_for_status() 
except requests.exceptions.RequestException as e:
    print(f"Error fetching URL: {e}")
    exit()
# 2. Parse the HTML with Beautiful Soup
# We use 'html.parser' which is Python's built-in parser.
soup = BeautifulSoup(response.text, 'html.parser')
# 3. Search the tree and extract data
# --- Method 1: Find by ID (most reliable) ---
# Find the main content div which has the id 'mw-content-text'
main_content = soup.find('div', id='mw-content-text')
# Now, within that div, find the section with the id 'Getting_started'
getting_started_section = main_content.find('div', id='Getting_started')
# Find all <a> tags within that specific section
links_in_section = getting_started_section.find_all('a')
print("--- Links found in the 'Getting started' section ---")
for link in links_in_section:
    # .get('href') is the safe way to get an attribute
    href = link.get('href') 
    # .get_text() gets the visible text of the tag
    text = link.get_text(strip=True) # strip=True removes leading/trailing whitespace
    # We only want valid links that start with /wiki/
    if href and href.startswith('/wiki/'):
        print(f"Text: {text.ljust(40)} | Link: https://en.wikipedia.org{href}")
# --- Method 2: Find by CSS Selector (more advanced, very powerful) ---
# This does the same thing as the block above, but in one line.
print("\n--- Using CSS Selector for the same result ---")
css_selector_links = soup.select('div#Getting_started a')
for link in css_selector_links:
    href = link.get('href')
    text = link.get_text(strip=True)
    if href and href.startswith('/wiki/'):
        print(f"Text: {text.ljust(40)} | Link: https://en.wikipedia.org{href}")

Key Beautiful Soup Objects and Methods

Understanding these four is 90% of what you'll need to do.

Python BeautifulSoup如何高效解析HTML?-图3
(图片来源网络,侵删)
Object/Method Description Example
soup The main object created from BeautifulSoup(html, 'parser'). It represents the entire document. soup = BeautifulSoup(html_doc, 'html.parser')
.find() Finds the first matching element. Returns a single Tag object or None. first_div = soup.find('div')
.find_all() Finds all matching elements. Returns a list of Tag objects. all_links = soup.find_all('a')
Tag object Represents an HTML tag (e.g., <p>, <div>, <a>). link_tag = soup.find('a')
.name The name of the tag. link_tag.name -> 'a'
.attrs A dictionary of the tag's attributes. link_tag.attrs -> {'href': '/wiki/Python', 'title': 'Python'}
.get('attr_name') The recommended way to get the value of a specific attribute. link_tag.get('href') -> '/wiki/Python'
.get_text() Gets all the human-readable text from a tag and its children. link_tag.get_text() -> 'Python'
.string Gets the text content of a tag if it has only one child string. link_tag.string -> 'Python'

Advanced Searching (CSS Selectors)

Beautiful Soup has excellent support for CSS selectors, which is often more intuitive than using find and find_all.

CSS Selector Method Description
tag_name soup.select('a') Select all <a> tags.
.class_name soup.select('.infobox') Select all elements with class="infobox".
#id_name soup.select('#content') Select the element with id="content".
parent > child soup.select('div > p') Select all <p> tags that are direct children of a <div>.
tag.class soup.select('a.external') Select all <a> tags with class="external".
[attribute] soup.select('a[href]') Select all <a> tags that have an href attribute.
[attribute=value] soup.select('a[href="/wiki/Python"]') Select all <a> tags with href exactly equal to /wiki/Python.

Handling Different Parsers

The parser you use is important. html.parser is built-in but a bit slow. lxml is much faster and more lenient with broken HTML, but you have to install it separately.

Parser How to Install Pros Cons
html.parser Built-in No extra installation, secure. Slower, less forgiving of bad HTML.
lxml pip install lxml Very fast, handles broken HTML well. Requires external C library.
html5lib `pip install html5lib Extremely lenient, creates valid HTML5. Slowest, requires external library.

Example with lxml:

from bs4 import BeautifulSoup
import requests
response = requests.get('http://example.com')
soup = BeautifulSoup(response.text, 'lxml') # Just change the parser here!

Best Practices and Ethics

  1. Check robots.txt: Before scraping any site, check its robots.txt file (e.g., https://example.com/robots.txt). This file tells you which parts of the site the owner does not want bots to access.
  2. Be a Good Citizen:
    • Set a User-Agent: Some sites block default requests user agents. Make your script look like a browser.
      headers = {'User-Agent': 'MyCoolScraper/1.0 (myemail@example.com)'}
      response = requests.get(url, headers=headers)
    • Add Delays: Don't hammer the server with requests. Add a small delay between requests.
      import time
      time.sleep(2) # Wait for 2 seconds
  3. Handle Errors: Network connections can fail. Webpages can change their structure.
分享:
扫描分享到社交APP
上一篇
下一篇