Python Crawlspider如何高效抓取网页数据？-杰瑞科技汇

Of course! Let's dive deep into Python's CrawlSpider.

（图片来源网络，侵删）

What is a `CrawlSpider`?

A CrawlSpider is a high-level web scraping spider class provided by the Scrapy framework. It's designed specifically for crawling websites where you want to follow links from page to page to discover new items to scrape.

Think of it as a powerful, rule-based web crawler. Instead of manually writing code to parse a page, find all the links, and decide which ones to follow, you define rules for the CrawlSpider, and it handles the entire process for you.

Key Differences: `CrawlSpider` vs. `Spider`

Feature	`scrapy.Spider` (Base Spider)	`scrapy.spiders.CrawlSpider` (Rule-based Spider)
Primary Use Case	Scraping a single page or a list of known URLs.	Crawling an entire site or a large section of it by following links.
Link Discovery	Manual. You must explicitly write code to find and yield new requests in your `parse` method.	Automatic. You define `rules` with `LinkExtractor`s to find links.
Code Structure	Simpler, more direct. You only have a `parse` method.	More structured. You have `parse` (for parsing items) and separate `rules` for link following.
Flexibility	More flexible for complex, one-off scraping tasks.	Less flexible for complex logic, but much more efficient for standard crawling.

Core Components of a `CrawlSpider`

CrawlSpider Class: The main class you inherit from.
rules Attribute: A list of Rule objects. This is the heart of the spider. Each rule defines how to behave when a link is found.
Rule Class: Defines a specific crawling behavior.
LinkExtractor Class: Used inside a Rule to find links on a page based on CSS or XPath selectors.

The `Rule` Object: The Brain of the Crawler

A Rule object tells the spider what to do when it encounters a link matching certain criteria. It has several important parameters:

link_extractor=LinkExtractor(...): (Required) This is the engine that finds all the links on a page.
- allow: A list of regex patterns. Only links matching these patterns will be followed. This is the most common parameter.
- deny: A list of regex patterns. Links matching these will be ignored.
- restrict_xpaths: An XPath selector. Links will only be extracted from the parts of the page matching this XPath. This is very efficient.
- restrict_css: A CSS selector. Similar to restrict_xpaths.
callback: The method name of the spider that will be used to parse the response of the followed link. Important: If you use a callback, you must name it something other than parse (e.g., parse_item), because parse has a special meaning for CrawlSpider (it's used to process the links before applying rules).
follow: A boolean (True or False). If True, the spider will continue to follow links found on the page that was just processed by the callback. If False, it will not. The default is True if a callback is provided, and False otherwise.
process_links: A method name that will be called with the list of extracted links, before they are scheduled for crawling. Useful for filtering links further.
process_request: A method name that will be called with each Request before it's scheduled. Useful for modifying requests (e.g., adding headers).

Step-by-Step Example: Crawling a Bookstore

Let's create a CrawlSpider to scrape book data from the http://books.toscrape.com/ website. This site is designed for scraping practice.

（图片来源网络，侵删）

Goal:

Start on the main page (http://books.toscrape.com/).
Follow the "Next" button to go to subsequent pages.
On each page, find all the book links and go to each book's detail page.
On the detail page, scrape the book title, price, and availability.

Step 1: Create the Scrapy Project

If you haven't already, create a new Scrapy project.

scrapy startproject book_crawler
cd book_crawler

Step 2: Generate the Spider

Use the Scrapy command to generate a new spider. We'll call it books.

scrapy genspider -t crawl books books.toscrape.com

-t crawl: This tells Scrapy to generate a CrawlSpider template.
books: The name of our spider.
books.toscrape.com: The domain the spider is allowed to crawl.

This will create a file book_crawler/spiders/books.py with a basic CrawlSpider template.

Step 3: Define the Rules and Logic

Now, edit book_crawler/spiders/books.py. We will define two rules:

Rule 1: Follow the "Next" button to paginate through the list of books.
Rule 2: For each book link found on the list page, go to its detail page and parse the book's information.

Here is the complete, commented code for books.py:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
# We need an item to store our scraped data
# Let's define it in items.py first
# book_crawler/items.py:
# class BookItem(scrapy.Item):
#     title = scrapy.Field()
#     price = scrapy.Field()
#     availability = scrapy.Field()
from book_crawler.items import BookItem
class BooksSpider(CrawlSpider):
    name = 'books'
    allowed_domains = ['books.toscrape.com']
    start_urls = ['http://books.toscrape.com/']
    # --- Define the crawling rules ---
    rules = (
        # Rule 1: Follow the "Next" button for pagination.
        # - LinkExtractor finds links with text containing 'next'.
        # - No callback, so the default 'parse' method will handle the response.
        # - 'follow=True' (default) means we will continue to find 'next' links on these pages.
        Rule(
            LinkExtractor(restrict_xpaths='//li[@class="next"]/a'),
            follow=True
        ),
        # Rule 2: Go to each book's detail page and parse the item.
        # - LinkExtractor finds all links to book detail pages.
        # - The callback 'parse_item' will be called for each book page.
        # - 'follow=False' (default) means we won't look for more links on the book detail page.
        Rule(
            LinkExtractor(restrict_xpaths='//article[@class="product_pod"]/h3/a'),
            callback='parse_item',
            follow=False
        ),
    )
    # --- The callback method for parsing book detail pages ---
    def parse_item(self, response):
        """This method is called for every book detail page."""
        self.logger.info(f'Parsing book page: {response.url}')
        # Create a BookItem instance
        item = BookItem()
        # Extract data using XPath selectors
        item['title'] = response.xpath('//h1/text()').get()
        item['price'] = response.xpath('//p[@class="price_color"]/text()').get()
        item['availability'] = response.xpath('//p[@class="instock availability"]/text()').get().strip()
        yield item

Step 4: Define the Item

Before running, make sure you've defined the BookItem in book_crawler/items.py.

book_crawler/items.py:

import scrapy
class BookItem(scrapy.Item):= scrapy.Field()
    price = scrapy.Field()
    availability = scrapy.Field()

Step 5: Run the Spider

You can now run the spider from your project's root directory.

scrapy crawl books

You will see the spider start on the first page, extract all book links, follow them to parse the details, and then follow the "Next" button to repeat the process until it reaches the last page.

To save the output to a JSON file, use the -o flag:

scrapy crawl books -o books.json

This will create a books.json file with all the scraped book data.

Best Practices and Common Pitfalls

Don't Override parse: If you define a parse method in a CrawlSpider, it will override the default parsing logic for links that don't have a specific callback. This can be confusing. It's better to use a different name for your item parsing callback (like parse_item).
Use restrict_xpaths or restrict_css: Using allow=r'...' on a large page can be slow and inefficient. It's much better to narrow down the area of the page where links can be found using restrict_xpaths or restrict_css. This is a huge performance boost.
Be Respectful: Always check a website's robots.txt file (e.g., http://books.toscrape.com/robots.txt) and respect its rules. Set a reasonable DOWNLOAD_DELAY in your settings.py to avoid overwhelming the server.
Handle Relative URLs: The LinkExtractor automatically converts relative URLs (like /catalogue/page-2.html) to absolute URLs (like http://books.toscrape.com/catalogue/page-2.html). You don't need to handle this manually.
Debugging: If a rule isn't working as expected, use the scrapy shell to test your XPath and CSS selectors interactively.
```
scrapy shell "http://books.toscrape.com/"
```
Then you can test your selectors in the shell:
```
# In the shell
response.xpath('//li[@class="next"]/a')
response.css('li.next a')
```

Python Crawlspider如何高效抓取网页数据？

What is a `CrawlSpider`?

Key Differences: `CrawlSpider` vs. `Spider`

Core Components of a `CrawlSpider`

The `Rule` Object: The Brain of the Crawler

Step-by-Step Example: Crawling a Bookstore

Step 1: Create the Scrapy Project

Step 2: Generate the Spider

Step 3: Define the Rules and Logic

Step 4: Define the Item

Step 5: Run the Spider

Best Practices and Common Pitfalls

99ANYc3cd6

Python如何快速集成QCustomPlot？PyQt绘图实战指南

Python antigravity是什么？如何实现飞行效果？

Python getiterator方法如何使用？

ghost win7教程怎么操作？

Java如何读取jar包内的文件？

Java Socket编程实例具体怎么实现？

Java表单提交如何正确使用from标签？

Java Error Exception 如何解决？

Java double转string，精度丢失怎么解决？

Java char如何转integer？

VMware虚拟机装Ubuntu，详细步骤是怎样的？

Python如何替换列表指定索引的值？

Python verify失败怎么办？

Photoshop CS5视频教程哪里能下载？

PHP如何调用Java的WebService？

安装win10与win7双系统安装教程

Python Crawlspider如何高效抓取网页数据？

What is a CrawlSpider?

Key Differences: CrawlSpider vs. Spider

Core Components of a CrawlSpider

The Rule Object: The Brain of the Crawler

Step-by-Step Example: Crawling a Bookstore

Step 1: Create the Scrapy Project

Step 2: Generate the Spider

Step 3: Define the Rules and Logic

Step 4: Define the Item

Step 5: Run the Spider

Best Practices and Common Pitfalls

相关推荐

Java Socket编程实例具体怎么实现？

What is a `CrawlSpider`?

Key Differences: `CrawlSpider` vs. `Spider`

Core Components of a `CrawlSpider`

The `Rule` Object: The Brain of the Crawler