BeautifulSoup如何高效解析HTML？-杰瑞科技汇

Of course! Here is a comprehensive guide to using Beautiful Soup, one of the most popular Python libraries for web scraping.

（图片来源网络，侵删）

What is Beautiful Soup?

Beautiful Soup is a Python library designed for pulling data out of HTML and XML files. It creates a parsed tree of the document's contents, making it incredibly easy to navigate, search, and modify the parse tree.

Think of it as giving your Python program the ability to "read" and "understand" the structure of a webpage.

Key Features:

Parses HTML and XML documents.
Provides simple, Pythonic ways to navigate the parsed tree (e.g., find tags by name, find tags by CSS class).
Converts incoming documents to Unicode and outgoing documents to UTF-8.
Tolerates "badly-formed" markup, which is common on the real web.
Works with various parsers (like Python's built-in html.parser, lxml, and html5lib).

Installation

First, you need to install Beautiful Soup. It's highly recommended to also install a parser like lxml, as it's much faster than the built-in html.parser.

（图片来源网络，侵删）

pip install beautifulsoup4
pip install lxml

Basic Workflow: A Step-by-Step Example

Let's scrape the titles and links from the Wikipedia page for "Python (programming language)".

Step 1: Fetch the Web Page

You need to get the HTML content of the page. The standard library for this in Python is requests.

pip install requests

Now, let's write the code to fetch the page.

import requests
from bs4 import BeautifulSoup
# The URL of the page we want to scrape
url = 'https://en.wikipedia.org/wiki/Python_(programming_language)'
# Send an HTTP GET request to the URL
try:
    response = requests.get(url)
    # Raise an exception if the request was unsuccessful (e.g., 404 Not Found)
    response.raise_for_status() 
except requests.exceptions.RequestException as e:
    print(f"Error fetching the URL: {e}")
else:
    # The HTML content of the page is in the .text attribute
    html_content = response.text
    print("Successfully fetched the page!")

Step 2: Create a BeautifulSoup Object

Now, we'll parse the HTML content using Beautiful Soup and the lxml parser.

（图片来源网络，侵删）

# Create a BeautifulSoup object
# The first argument is the HTML content
# The second argument is the parser to use ('lxml', 'html.parser', etc.)
soup = BeautifulSoup(html_content, 'lxml')

Step 3: Find Elements in the Parsed Tree

This is the core of Beautiful Soup. We'll use methods like .find() and .find_all() to locate the data we need.

Goal: Find all the main section headings (e.g., "History", "Design philosophy and features", etc.). On Wikipedia, these are inside <h2> tags.

# Find all <h2> tags on the page
section_headings = soup.find_all('h2')
print(f"Found {len(section_headings)} <h2> tags.")

Goal: Extract the text from these headings.

The <h2> tag also contains an <span> with a class. We want to get the text inside the <h2> but ignore the <span>.

for heading in section_headings:
    # .get_text() strips all tags and returns just the human-readable text
    heading_text = heading.get_text(strip=True)
    print(heading_text)

Step 4: Refine Your Search with Attributes and CSS Selectors

Often, you need to be more specific. Let's find the main infobox table on the right side of the page. It has a class called infobox.

# Find the first element with the class 'infobox'
infobox = soup.find('table', class_='infobox')
if infobox:
    print("Found the infobox table!")
    # Find all rows (<tr>) in the infobox
    rows = infobox.find_all('tr')
    for row in rows:
        # Find the header cell (<th>) in the row
        header_cell = row.find('th')
        # Find the data cell (<td>) in the row
        data_cell = row.find('td')
        if header_cell and data_cell:
            # .get_text(strip=True) cleans up whitespace
            header = header_cell.get_text(strip=True)
            data = data_cell.get_text(strip=True)
            print(f"{header}: {data}")

Note: In modern Python, class_ is used instead of class because class is a reserved keyword.

Core Methods and Concepts

Method	Description	Example
`soup.find(tag, attrs, ...)`	Finds the first matching element.	`soup.find('div', id='main-content')`
`soup.find_all(tag, attrs, ...)`	Finds all matching elements and returns a list.	`soup.find_all('a', class_='link')`
`.get_text()`	Returns all the text inside an element and its children.	`my_element.get_text()`
`.text`	Similar to `.get_text()`, but as a property.	`my_element.text`
`.string`	Returns the text inside an element only if it has a single string child. Returns `None` otherwise.	`my_element.string`
`.parent`	Returns the parent of a tag.	`my_element.parent`
`.find_next_sibling()`	Finds the next sibling element.	`my_element.find_next_sibling('p')`
`.select(CSS_SELECTOR)`	A powerful method that uses CSS selectors to find elements.	`soup.select('div.content p a')`

Navigating the Tree (The "Dot" Notation)

You can navigate the tree using dot notation for tags.

# Find the <body> tag
body = soup.body
# Find the first <div> inside the body
first_div = body.div
# Find the first <p> inside that first <div>
first_paragraph = first_div.p
print(first_paragraph)

Handling Different Parsers

Beautiful Soup supports several parsers. The choice can affect speed and how it handles "broken" HTML.

Parser	Pros	Cons	How to Use
`lxml`	Very fast, lenient with broken HTML.	Requires external C libraries to be installed.	`BeautifulSoup(html, 'lxml')`
`html.parser`	Built into Python, no extra installation needed.	Slower than `lxml`, less lenient.	`BeautifulSoup(html, 'html.parser')`
`html5lib`	Extremely lenient, parses the page exactly like a web browser.	Very slow, requires external library.	`BeautifulSoup(html, 'html5lib')`

Recommendation: Use lxml if you can install it. It's the best balance of speed and robustness. Use html.parser as a fallback if you can't install anything else.

A Complete, Runnable Example

Let's combine everything into a final script that scrapes the titles of the latest news from Hacker News.

import requests
from bs4 import BeautifulSoup
def get_hacker_news_titles():
    """
    Fetches the Hacker News homepage and prints the titles of the top 30 news links.
    """
    url = 'https://news.ycombinator.com'
    headers = {
        'User-Agent': 'My Web Scraper 1.0 (https://example.com/my-scraper)'
    }
    try:
        response = requests.get(url, headers=headers)
        response.raise_for_status()
    except requests.exceptions.RequestException as e:
        print(f"Error fetching Hacker News: {e}")
        return
    soup = BeautifulSoup(response.text, 'lxml')
    # On Hacker News, story titles are in <span> tags with the class 'titleline'
    # We use a CSS selector here, which is very powerfulspans = soup.select('span.titleline')
    print("--- Top Hacker News Stories ---")
    for i, span in enumerate(title_spans[:30]): # Get the top 30
        # The actual title is in an <a> tag inside the <span>
        title_link = span.find('a')
        if title_link:
            title = title_link.get_text(strip=True)
            print(f"{i+1}. {title}")
if __name__ == "__main__":
    get_hacker_news_titles()

Important Considerations & Best Practices

Be Respectful (Check robots.txt): Before scraping a site, check its robots.txt file (e.g., https://example.com/robots.txt). This file tells bots which parts of the site they are not allowed to access. Always respect these rules.
Set a User-Agent: Some websites block default requests User-Agents because they are associated with simple scrapers. Set a custom one to make your

BeautifulSoup如何高效解析HTML？

What is Beautiful Soup?

Installation

Basic Workflow: A Step-by-Step Example

Step 1: Fetch the Web Page

Step 2: Create a BeautifulSoup Object

Step 3: Find Elements in the Parsed Tree

Step 4: Refine Your Search with Attributes and CSS Selectors

Core Methods and Concepts

Navigating the Tree (The "Dot" Notation)

Handling Different Parsers

A Complete, Runnable Example

Important Considerations & Best Practices

99ANYc3cd6

Python安装MySQL模块报错怎么办？

Ubuntu 16.04教程怎么学？

Python如何连接MySQL数据库？

Java调用DLL是多线程的吗？

Java中JSON字符串转Map有哪些方法？

Python自然语言处理如何高效处理PDF文档？

Axis2教程，如何快速上手开发WebService？

Python嵌套作用域如何访问外部变量？

Java webservice调用方法有哪些？

PHP如何开发微信公众平台？

Java String如何获取最后一个字符？

Java字符串数组如何初始化？

access 2007视频教程

Autocad破解版安装教程是否安全可靠？

python timestamp now

python http encoding

BeautifulSoup如何高效解析HTML？

What is Beautiful Soup?

Installation

Basic Workflow: A Step-by-Step Example

Step 1: Fetch the Web Page

Step 2: Create a BeautifulSoup Object

Step 3: Find Elements in the Parsed Tree

Step 4: Refine Your Search with Attributes and CSS Selectors

Core Methods and Concepts

Navigating the Tree (The "Dot" Notation)

Handling Different Parsers

A Complete, Runnable Example

Important Considerations & Best Practices

相关推荐

Python自然语言处理如何高效处理PDF文档？