杰瑞科技汇

Python爬虫如何高效抓取数据?

Of course! Building a web crawler (or spider) in Python is a fantastic project. I'll guide you through the entire process, from the basic concepts to a complete, production-ready code example.

Python爬虫如何高效抓取数据?-图1
(图片来源网络,侵删)

We'll cover:

  1. Core Concepts: What a crawler does and the challenges it faces.
  2. Essential Libraries: The best tools for the job in Python.
  3. A Simple Crawler: A step-by-step guide to building a basic, single-threaded crawler.
  4. A Robust, Production-Ready Crawler: A more advanced, multi-threaded crawler with rate-limiting, politeness, and data extraction.
  5. Legal and Ethical Considerations: The most important part!

Core Concepts: What is a Web Crawler?

A web crawler is a program that automatically browses the web, typically for the purpose of web indexing (like Google or Bing) or web scraping (extracting data).

The basic workflow is a cycle:

  1. Start with a list of URLs called the "frontier."
  2. Take a URL from the frontier.
  3. Fetch the content of that web page (the HTML).
  4. Parse the HTML to find new links and extract any desired data.
  5. Add the new, unique links back to the frontier.
  6. Save the extracted data.
  7. Repeat until the frontier is empty or a stopping condition is met.

Key Challenges:

  • Politeness: Don't overload a server with too many requests. This is crucial to avoid getting your IP address blocked.
  • Scalability: How do you handle millions of URLs? A simple list won't work.
  • Data Extraction: How do you reliably find the data you want within the messy structure of HTML?
  • Handling Dynamic Content: Modern websites often load content using JavaScript, which basic crawlers can't see.

Essential Python Libraries

You don't need to reinvent the wheel. Here are the most popular and effective libraries:

Python爬虫如何高效抓取数据?-图2
(图片来源网络,侵删)
Library Purpose Installation
requests The standard for making HTTP requests. It's simple and powerful. pip install requests
Beautiful Soup A library for parsing HTML and XML. It creates a parse tree from page content, making it easy to navigate and search. pip install beautifulsoup4
lxml A high-performance parser that can be used as an alternative to Beautiful Soup's default parser. It's much faster. pip install lxml
Scrapy A full-fledged web crawling framework. It handles asynchronous requests, data pipelines, scheduling, and more. It's overkill for simple tasks but excellent for large-scale projects. pip install scrapy

For this guide, we'll use requests for fetching and Beautiful Soup for parsing, as they are the perfect combination for learning and building most custom crawlers.


A Simple Crawler: Step-by-Step

Let's build a crawler that starts on a Wikipedia page and follows all the links to other Wikipedia pages, printing the titles as it goes.

Step 1: Fetch and Parse a Page

First, let's fetch the content of a page and parse it.

import requests
from bs4 import BeautifulSoup
# The URL we want to start with
url = 'https://en.wikipedia.org/wiki/Web_crawler'
try:
    # 1. Fetch the content
    response = requests.get(url)
    # Raise an exception if the request was unsuccessful (e.g., 404, 500)
    response.raise_for_status() 
    # 2. Parse the HTML content
    soup = BeautifulSoup(response.text, 'lxml')
    # 3. Extract the page title= soup.find('title').text
    print(f"Successfully fetched: {title}")
    # 4. Find all links on the page
    # We look for <a> tags with an 'href' attribute
    for link in soup.find_all('a', href=True):
        href = link['href']
        print(f"Found link: {href}")
except requests.exceptions.RequestException as e:
    print(f"Error fetching the URL: {e}")

Step 2: Add Logic to Follow Links

Now, let's make it a real crawler by adding a "frontier" and a way to track visited URLs to avoid infinite loops.

Python爬虫如何高效抓取数据?-图3
(图片来源网络,侵删)
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
# --- Configuration ---
START_URL = 'https://en.wikipedia.org/wiki/Web_crawler'
MAX_PAGES = 10  # Limit the number of pages to crawl for this example
# --- Frontier and Visited Sets ---
frontier = [START_URL]
visited_urls = set()
# --- Crawler Loop ---
while frontier and len(visited_urls) < MAX_PAGES:
    # Get the next URL from the frontier
    current_url = frontier.pop(0)
    # Skip if we've already visited this URL
    if current_url in visited_urls:
        continue
    print(f"Crawling: {current_url}")
    visited_urls.add(current_url)
    try:
        response = requests.get(current_url, timeout=5)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, 'lxml')
        # Find all links that point to other Wikipedia pages
        for link in soup.find_all('a', href=True):
            href = link['href']
            # Convert relative URLs to absolute URLs
            absolute_url = urljoin(current_url, href)
            # Check if the link is another Wikipedia article
            # We use a simple domain check to keep it focused
            if 'en.wikipedia.org' in absolute_url and '/wiki/' in absolute_url:
                # Normalize the URL to avoid duplicates (e.g., /wiki/Page and /wiki/Page#section)
                parsed_url = urlparse(absolute_url)
                clean_url = parsed_url._replace(fragment="", params="").geturl()
                if clean_url not in visited_urls and clean_url not in frontier:
                    frontier.append(clean_url)
    except requests.exceptions.RequestException as e:
        print(f"Failed to crawl {current_url}: {e}")
print("\nCrawling finished.")
print(f"Total pages visited: {len(visited_urls)}")

This simple crawler has a major flaw: it's slow because it processes one URL at a time. For a real project, you'd use multi-threading.


A Robust, Production-Ready Crawler

This version is more advanced. It uses a thread pool for concurrency, implements rate-limiting, and separates data extraction from the crawling logic.

Key Features:

  • Concurrency: Uses concurrent.futures.ThreadPoolExecutor to crawl multiple pages simultaneously.
  • Rate Limiting: A time.sleep() call prevents overwhelming servers.
  • Politeness: Respects robots.txt (a file that tells crawlers which parts of a site they can't access).
  • Modular Code: Clear separation between the crawler and the data extractor.
  • Robust Error Handling: Catches and logs various network and parsing errors.
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
import time
import concurrent.futures
import os
# --- Configuration ---
START_URL = 'https://en.wikipedia.org/wiki/Web_crawler'
MAX_THREADS = 5  # Number of concurrent threads
MAX_PAGES = 50  # Total number of pages to crawl
REQUEST_DELAY = 1  # Seconds to wait between requests (politeness)
OUTPUT_FILE = 'crawled_data.txt'
# --- Frontier and Visited Sets ---
frontier = [START_URL]
visited_urls = set()
# --- Data Storage ---
crawled_data = []
# --- Robots.txt Check ---
def is_allowed(url):
    try:
        # robots.txt is usually at the root of the domain
        base_url = f"{urlparse(url).scheme}://{urlparse(url).netloc}"
        robots_url = urljoin(base_url, 'robots.txt')
        response = requests.get(robots_url, timeout=5)
        if response.status_code == 200:
            # A simple check for the most common disallow
            # In a real crawler, you'd use a proper parser like `urllib.robotparser`
            if 'Disallow: /' in response.text:
                return False
        return True
    except:
        # If we can't fetch robots.txt, we'll be cautious and disallow
        return False
# --- Data Extraction Logic ---
def extract_data(soup, url):
    """Extracts the title and all paragraph text from a page."""tag = soup.find('title')= title_tag.text if title_tag else "No Title Found"
    paragraphs = []
    for p in soup.find_all('p'):
        # Clean up the text a bit
        text = p.get_text(strip=True)
        if text:
            paragraphs.append(text)
    return {
        'url': url,
        'title': title,
        'paragraphs': paragraphs
    }
# --- The Crawler Worker Function ---
def crawl_worker(url):
    """A single worker that fetches, parses, and extracts data from a URL."""
    if url in visited_urls:
        return
    print(f"Thread {os.getpid()}: Crawling {url}")
    visited_urls.add(url)
    try:
        # Be polite!
        time.sleep(REQUEST_DELAY)
        response = requests.get(url, timeout=10, headers={'User-Agent': 'MySimpleCrawler/1.0'})
        response.raise_for_status()
        soup = BeautifulSoup(response.text, 'lxml')
        # 1. Extract data
        data = extract_data(soup, url)
        crawled_data.append(data)
        print(f"Thread {os.getpid()}: Extracted data from {data['title']}")
        # 2. Find new links
        for link in soup.find_all('a', href=True):
            href = link['href']
            absolute_url = urljoin(url, href)
            # Filter for Wikipedia articles
            if 'en.wikipedia.org' in absolute_url and '/wiki/' in absolute_url:
                parsed_url = urlparse(absolute_url)
                clean_url = parsed_url._replace(fragment="", params="").geturl()
                if clean_url not in visited_urls and clean_url not in frontier:
                    frontier.append(clean_url)
    except requests.exceptions.RequestException as e:
        print(f"Thread {os.getpid()}: Error crawling {url}: {e}")
    except Exception as e:
        print(f"Thread {os.getpid()}: An unexpected error occurred with {url}: {e}")
# --- Main Execution ---
if __name__ == "__main__":
    print("Starting web crawler...")
    # Check if we are allowed to crawl the start URL
    if not is_allowed(START_URL):
        print(f"Crawling {START_URL} is disallowed by robots.txt. Exiting.")
    else:
        with concurrent.futures.ThreadPoolExecutor(max_workers=MAX_THREADS) as executor:
            while frontier and len(visited_urls) < MAX_PAGES:
                # Submit a batch of URLs to the thread pool
                urls_to_crawl = []
                for _ in range(min(MAX_THREADS, len(frontier))):
                    if frontier:
                        urls_to_crawl.append(frontier.pop(0))
                # Map the worker function to the URLs
                executor.map(crawl_worker, urls_to_crawl)
                # Optional: Add a small delay between batches
                # time.sleep(0.1)
    print("\nCrawling finished.")
    print(f"Total pages visited: {len(visited_urls)}")
    # Save the extracted data
    with open(OUTPUT_FILE, 'w', encoding='utf-8') as f:
        for data in crawled_data:
            f.write(f"URL: {data['url']}\n")
            f.write(f"Title: {data['title']}\n")
            f.write("Content:\n")
            f.write("\n".join(data['paragraphs']))
            f.write("\n" + "="*80 + "\n")
    print(f"Data saved to {OUTPUT_FILE}")

Legal and Ethical Considerations (CRITICAL!)

Before you crawl any website, you must follow these rules to avoid legal trouble and be a good internet citizen.

  1. Check robots.txt: Every website should have a file at www.example.com/robots.txt. This file tells you which parts of the site the owner does not want you to crawl. Always respect it. Our robust crawler has a basic check for this.

  2. Read the Terms of Service (ToS): The website's ToS will almost certainly have a clause about automated access. Violating it can get you banned or lead to legal action.

  3. Be Polite:

    • Identify Your Bot: Use a descriptive User-Agent string in your request headers (e.g., MyCoolProject-Crawler/1.0; +http://mywebsite.com). This allows site administrators to contact you if there's a problem.
    • Limit Your Request Rate: Do not send requests too quickly. Our crawler uses time.sleep() for this. A good rule of thumb is one request per second, but check the site's robots.txt for specific Crawl-delay instructions.
    • Don't Hit Sensitive Endpoints: Avoid logging in, submitting forms, or accessing pages that are clearly for user interaction.
  4. Don't Re-scrape Unnecessarily: If you only need data that doesn't change often, download it once and store it. Don't re-crawl the same site every day.

  5. Respect Copyright: The data you scrape is likely owned by someone. Use it responsibly and don't republish it in a way that infringes on copyright.

For large-scale, commercial crawling, consider using official APIs if they are available, as they are the most reliable and legal way to get data.

分享:
扫描分享到社交APP
上一篇
下一篇