Scrapy Rules如何灵活配置实现精准爬取？-杰瑞科技汇

Of course! Here is a comprehensive guide to Scrapy's rules, a powerful feature for defining how your spider crawls websites.

（图片来源网络，侵删）

What are Scrapy Rules?

In Scrapy, rules are a list of objects that define a spider's crawling behavior. They are part of the CrawlSpider class, which is a more advanced spider designed for sites that follow a common pattern (e.g., a list of links on a page leads to detail pages).

A Rule tells your spider:

How to find links: Which links on a page should be followed?
What to do with the pages those links lead to: Should you parse the content of those pages using a specific method?

Instead of writing complex logic inside the parse method to find and follow links, you can declaratively define this behavior using rules. This makes your spider cleaner, more maintainable, and less prone to errors.

The `CrawlSpider` Class

To use rules, your spider must inherit from scrapy.spiders.CrawlSpider.

（图片来源网络，侵删）

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
    name = 'my_spider'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']
    # Rules will be defined here
    rules = (
        # ... rule objects go here ...
    )
    # You can still define other parsing methods
    def parse_start_url(self, response):
        # Optional method to parse the start_url(s)
        pass

The `Rule` Object

A Rule is an instance of the scrapy.spiders.Rule class. Its constructor takes several arguments to define its behavior.

Key Arguments for a `Rule`:

Argument	Description	Example
`link_extractor`	(Required) An instance of `LinkExtractor`. It defines which links the rule should match and follow. This is the "how to find links" part.	`LinkExtractor(allow=r'/products/.*')`
`callback`	The method to use for parsing the response of the followed link. This is the "what to do" part.	`callback='parse_product'`
`follow`	A boolean (`True`/`False`). If `True`, the rule will extract links from the responses processed by its `callback` and continue following them. If `False`, it will not.	`follow=True`
`process_links`	A method to filter or process the list of links extracted by the `link_extractor` before they are scheduled for crawling.	`process_links='filter_product_links'`
`process_request`	A method to process each `Request` generated by the rule before it is scheduled. Useful for modifying meta-data or setting a `dont_filter` flag.	`process_request='add_custom_header'`

The `LinkExtractor`

The LinkExtractor is the engine that finds links. It's the most critical part of a Rule.

You import it like this: from scrapy.linkextractors import LinkExtractor

Key Arguments for `LinkExtractor`:

Argument	Description	Example
`allow`	A regular expression (or list of regexes). Only links matching this pattern will be extracted. This is the most common argument.	`allow=(r'/item-\d+\.html', r'/category/\d+')`
`deny`	A regular expression (or list of regexes). Links matching this pattern will be ignored, even if they match `allow`.	`deny=r'/logout'`
`allow_domains`	A list of domains. Only links from these domains will be extracted.	`allow_domains=['example.com', 'shop.example.com']`
`deny_domains`	A list of domains. Links from these domains will be ignored.	`deny_domains=['ads.example.com']`
`restrict_xpaths`	An XPath expression. Links will only be extracted from the parts of the page that match this XPath. This is very efficient for narrowing down the search area.	`restrict_xpaths='//div[@class="pagination"]//a'`
`tags`	A tuple of tags to consider for link extraction (default: `('a', 'area')`).	`tags=('a', 'link')`
`attrs`	A tuple of attributes to consider (default: `('href',)`)	`attrs=('href', 'src')`

Practical Examples

Let's build a spider for a hypothetical e-commerce site. The site has:

（图片来源网络，侵删）

A list of products on a category page.
Individual product detail pages.

Example 1: Simple Rule to Follow Product Links

This spider starts on a category page, finds all links to product detail pages, and parses them.

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class ProductSpider(CrawlSpider):
    name = 'product_spider'
    allowed_domains = ['shop.example.com']
    start_urls = ['http://shop.example.com/electronics']
    # Rule: Find all links that look like a product page and follow them.
    # When a product page is reached, call the parse_product method.
    rules = (
        Rule(
            LinkExtractor(allow=r'/products/[a-z0-9-]+\.html'),
            callback='parse_product',
            follow=False  # Don't follow links on the product page itself
        ),
    )
    def parse_product(self, response):
        """Parses a single product page."""
        print(f"Parsing product page: {response.url}")
        product_name = response.css('h1.product-title::text').get()
        price = response.css('span.price::text').get()
        yield {
            'product_name': product_name,
            'price': price,
            'url': response.url
        }

Example 2: Following Pagination and Product Links

This spider is more advanced. It starts on a category page, follows links to other category pages (pagination) and also to product detail pages.

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class AdvancedProductSpider(CrawlSpider):
    name = 'advanced_product_spider'
    allowed_domains = ['shop.example.com']
    start_urls = ['http://shop.example.com/electronics']
    rules = (
        # Rule 1: Follow pagination links (e.g., "Next Page")
        # We use 'follow=True' because we want to continue applying rules
        # to the newly loaded category page.
        Rule(
            LinkExtractor(restrict_xpaths='//a[contains(text(), "Next Page")]'),
            follow=True
        ),
        # Rule 2: Find and follow product links.
        # We use 'follow=False' because we don't want to follow links
        # from the product detail page.
        Rule(
            LinkExtractor(allow=r'/products/[a-z0-9-]+\.html'),
            callback='parse_product',
            follow=False
        ),
    )
    def parse_product(self, response):
        """Parses a single product page."""
        print(f"Parsing product page: {response.url}")
        product_name = response.css('h1.product-title::text').get()
        price = response.css('span.price::text').get()
        yield {
            'product_name': product_name,
            'price': price,
            'url': response.url
        }
    def parse(self, response):
        """
        This method is called for the start_urls and for any response
        that is followed without a callback (due to follow=True).
        We can add logic here if needed, but often it's not necessary
        when using rules effectively.
        """
        print(f"Processing page: {response.url}")
        # Rules are automatically applied, so we don't need to call parse_start_url
        # or manually extract links here.

Example 3: Using `process_links` and `process_request`

Let's say you want to filter out duplicate product links or add a custom header to every request.

class CustomRuleSpider(CrawlSpider):
    name = 'custom_rule_spider'
    allowed_domains = ['shop.example.com']
    start_urls = ['http://shop.example.com/electronics']
    rules = (
        Rule(
            LinkExtractor(allow=r'/products/[a-z0-9-]+\.html'),
            callback='parse_product',
            follow=False,
            # Filter links before they are scheduled
            process_links='filter_links',
            # Process requests before they are sent
            process_request='process_request_callback'
        ),
    )
    def filter_links(self, links):
        """Removes duplicate links from the extracted list."""
        seen = set()
        for link in links:
            if link.url not in seen:
                seen.add(link.url)
                yield link
    def process_request_callback(self, request):
        """Adds a custom header to every request."""
        request.headers['X-Custom-Header'] = 'MyValue'
        return request
    def parse_product(self, response):
        # ... same as before ...
        pass

Best Practices and Common Pitfalls

CrawlSpider vs. Spider:
- Use CrawlSpider when the site structure is predictable and you can define rules.
- Use the base Spider class for more complex, "script-like" crawling logic where you need fine-grained control over request generation and parsing flow (e.g., sites that require JavaScript rendering for navigation, or sites with very unique, non-standard link structures).
Callback Naming: It's a good practice to name your callback methods descriptively (e.g., parse_product, parse_category, parse_list).
follow=True can be dangerous: If you set follow=True on a rule that extracts a lot of links (like a site-wide footer or navigation bar), you can end up crawling the entire website, which might be against the robots.txt or the site owner's wishes. Be specific with your LinkExtractor to avoid this.
Order of Rules Matters: Scrapy applies rules in the order they are defined in the rules tuple. If a link matches multiple rules, the first matching rule in the list will be used. If you have a very general rule first, it might catch links that a more specific rule later in the list was intended for.
parse Method in CrawlSpider: You should not name your main parsing method parse if you are using CrawlSpider and rules. The CrawlSpider uses parse internally to orchestrate the rule-based crawling. If you define a parse method, it will override this internal logic and your rules will not be applied. Use a different name like parse_start_url if you need to handle the initial response.

Scrapy Rules如何灵活配置实现精准爬取？

What are Scrapy Rules?

The `CrawlSpider` Class

The `Rule` Object

Key Arguments for a `Rule`:

The `LinkExtractor`

Key Arguments for `LinkExtractor`:

Practical Examples

Example 1: Simple Rule to Follow Product Links

Example 2: Following Pagination and Product Links

Example 3: Using `process_links` and `process_request`

Best Practices and Common Pitfalls

99ANYc3cd6

Java enum 如何转为 int？

Linux内核视频教程该怎么学？

Python HTTPHandler如何实现自定义请求处理？

Excel VBA视频教程怎么学更高效？

Python多进程如何高效实现并行计算？

男士瑜伽入门，哪些基本动作必学？

Python socket 如何收发完整数据包？

Java客户端如何调用WebService？

Java如何高效捕获异常？

java代码实现页面跳转页面跳转页面跳转

Java如何远程调用WebService？

Java long转integer会丢失精度吗？

Effective Python PDF如何高效学习？

Python requests如何高效处理JSON数据？

Java调用WebService客户端，如何快速实现？

superpals plumber教程怎么学？

Scrapy Rules如何灵活配置实现精准爬取？

What are Scrapy Rules?

The CrawlSpider Class

The Rule Object

Key Arguments for a Rule:

The LinkExtractor

Key Arguments for LinkExtractor:

Practical Examples

Example 1: Simple Rule to Follow Product Links

Example 2: Following Pagination and Product Links

Example 3: Using process_links and process_request

Best Practices and Common Pitfalls

相关推荐

男士瑜伽入门，哪些基本动作必学？

The `CrawlSpider` Class

The `Rule` Object

Key Arguments for a `Rule`:

The `LinkExtractor`

Key Arguments for `LinkExtractor`:

Example 3: Using `process_links` and `process_request`