杰瑞科技汇

Python Crawlspider如何高效抓取网页数据?

Of course! Let's dive deep into Python's CrawlSpider.

Python Crawlspider如何高效抓取网页数据?-图1
(图片来源网络,侵删)

What is a CrawlSpider?

A CrawlSpider is a high-level web scraping spider class provided by the Scrapy framework. It's designed specifically for crawling websites where you want to follow links from page to page to discover new items to scrape.

Think of it as a powerful, rule-based web crawler. Instead of manually writing code to parse a page, find all the links, and decide which ones to follow, you define rules for the CrawlSpider, and it handles the entire process for you.


Key Differences: CrawlSpider vs. Spider

Feature scrapy.Spider (Base Spider) scrapy.spiders.CrawlSpider (Rule-based Spider)
Primary Use Case Scraping a single page or a list of known URLs. Crawling an entire site or a large section of it by following links.
Link Discovery Manual. You must explicitly write code to find and yield new requests in your parse method. Automatic. You define rules with LinkExtractors to find links.
Code Structure Simpler, more direct. You only have a parse method. More structured. You have parse (for parsing items) and separate rules for link following.
Flexibility More flexible for complex, one-off scraping tasks. Less flexible for complex logic, but much more efficient for standard crawling.

Core Components of a CrawlSpider

  1. CrawlSpider Class: The main class you inherit from.
  2. rules Attribute: A list of Rule objects. This is the heart of the spider. Each rule defines how to behave when a link is found.
  3. Rule Class: Defines a specific crawling behavior.
  4. LinkExtractor Class: Used inside a Rule to find links on a page based on CSS or XPath selectors.

The Rule Object: The Brain of the Crawler

A Rule object tells the spider what to do when it encounters a link matching certain criteria. It has several important parameters:

  • link_extractor=LinkExtractor(...): (Required) This is the engine that finds all the links on a page.
    • allow: A list of regex patterns. Only links matching these patterns will be followed. This is the most common parameter.
    • deny: A list of regex patterns. Links matching these will be ignored.
    • restrict_xpaths: An XPath selector. Links will only be extracted from the parts of the page matching this XPath. This is very efficient.
    • restrict_css: A CSS selector. Similar to restrict_xpaths.
  • callback: The method name of the spider that will be used to parse the response of the followed link. Important: If you use a callback, you must name it something other than parse (e.g., parse_item), because parse has a special meaning for CrawlSpider (it's used to process the links before applying rules).
  • follow: A boolean (True or False). If True, the spider will continue to follow links found on the page that was just processed by the callback. If False, it will not. The default is True if a callback is provided, and False otherwise.
  • process_links: A method name that will be called with the list of extracted links, before they are scheduled for crawling. Useful for filtering links further.
  • process_request: A method name that will be called with each Request before it's scheduled. Useful for modifying requests (e.g., adding headers).

Step-by-Step Example: Crawling a Bookstore

Let's create a CrawlSpider to scrape book data from the http://books.toscrape.com/ website. This site is designed for scraping practice.

Python Crawlspider如何高效抓取网页数据?-图2
(图片来源网络,侵删)

Goal:

  1. Start on the main page (http://books.toscrape.com/).
  2. Follow the "Next" button to go to subsequent pages.
  3. On each page, find all the book links and go to each book's detail page.
  4. On the detail page, scrape the book title, price, and availability.

Step 1: Create the Scrapy Project

If you haven't already, create a new Scrapy project.

scrapy startproject book_crawler
cd book_crawler

Step 2: Generate the Spider

Use the Scrapy command to generate a new spider. We'll call it books.

scrapy genspider -t crawl books books.toscrape.com
  • -t crawl: This tells Scrapy to generate a CrawlSpider template.
  • books: The name of our spider.
  • books.toscrape.com: The domain the spider is allowed to crawl.

This will create a file book_crawler/spiders/books.py with a basic CrawlSpider template.

Step 3: Define the Rules and Logic

Now, edit book_crawler/spiders/books.py. We will define two rules:

  1. Rule 1: Follow the "Next" button to paginate through the list of books.
  2. Rule 2: For each book link found on the list page, go to its detail page and parse the book's information.

Here is the complete, commented code for books.py:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
# We need an item to store our scraped data
# Let's define it in items.py first
# book_crawler/items.py:
# class BookItem(scrapy.Item):
#     title = scrapy.Field()
#     price = scrapy.Field()
#     availability = scrapy.Field()
from book_crawler.items import BookItem
class BooksSpider(CrawlSpider):
    name = 'books'
    allowed_domains = ['books.toscrape.com']
    start_urls = ['http://books.toscrape.com/']
    # --- Define the crawling rules ---
    rules = (
        # Rule 1: Follow the "Next" button for pagination.
        # - LinkExtractor finds links with text containing 'next'.
        # - No callback, so the default 'parse' method will handle the response.
        # - 'follow=True' (default) means we will continue to find 'next' links on these pages.
        Rule(
            LinkExtractor(restrict_xpaths='//li[@class="next"]/a'),
            follow=True
        ),
        # Rule 2: Go to each book's detail page and parse the item.
        # - LinkExtractor finds all links to book detail pages.
        # - The callback 'parse_item' will be called for each book page.
        # - 'follow=False' (default) means we won't look for more links on the book detail page.
        Rule(
            LinkExtractor(restrict_xpaths='//article[@class="product_pod"]/h3/a'),
            callback='parse_item',
            follow=False
        ),
    )
    # --- The callback method for parsing book detail pages ---
    def parse_item(self, response):
        """This method is called for every book detail page."""
        self.logger.info(f'Parsing book page: {response.url}')
        # Create a BookItem instance
        item = BookItem()
        # Extract data using XPath selectors
        item['title'] = response.xpath('//h1/text()').get()
        item['price'] = response.xpath('//p[@class="price_color"]/text()').get()
        item['availability'] = response.xpath('//p[@class="instock availability"]/text()').get().strip()
        yield item

Step 4: Define the Item

Before running, make sure you've defined the BookItem in book_crawler/items.py.

book_crawler/items.py:

import scrapy
class BookItem(scrapy.Item):= scrapy.Field()
    price = scrapy.Field()
    availability = scrapy.Field()

Step 5: Run the Spider

You can now run the spider from your project's root directory.

scrapy crawl books

You will see the spider start on the first page, extract all book links, follow them to parse the details, and then follow the "Next" button to repeat the process until it reaches the last page.

To save the output to a JSON file, use the -o flag:

scrapy crawl books -o books.json

This will create a books.json file with all the scraped book data.


Best Practices and Common Pitfalls

  1. Don't Override parse: If you define a parse method in a CrawlSpider, it will override the default parsing logic for links that don't have a specific callback. This can be confusing. It's better to use a different name for your item parsing callback (like parse_item).
  2. Use restrict_xpaths or restrict_css: Using allow=r'...' on a large page can be slow and inefficient. It's much better to narrow down the area of the page where links can be found using restrict_xpaths or restrict_css. This is a huge performance boost.
  3. Be Respectful: Always check a website's robots.txt file (e.g., http://books.toscrape.com/robots.txt) and respect its rules. Set a reasonable DOWNLOAD_DELAY in your settings.py to avoid overwhelming the server.
  4. Handle Relative URLs: The LinkExtractor automatically converts relative URLs (like /catalogue/page-2.html) to absolute URLs (like http://books.toscrape.com/catalogue/page-2.html). You don't need to handle this manually.
  5. Debugging: If a rule isn't working as expected, use the scrapy shell to test your XPath and CSS selectors interactively.
    scrapy shell "http://books.toscrape.com/"

    Then you can test your selectors in the shell:

    # In the shell
    response.xpath('//li[@class="next"]/a')
    response.css('li.next a')
分享:
扫描分享到社交APP
上一篇
下一篇