杰瑞科技汇

python MapCompose

Of course! Let's dive deep into MapCompose, a very powerful and common component in the Scrapy library.

python MapCompose-图1
(图片来源网络,侵删)

What is MapCompose?

At its core, MapCompose is a processing pipeline for data. It's a class from scrapy.utils.defer that takes a sequence of functions and chains them together to process a single input value.

Think of it as an assembly line on a factory floor. An item (the raw data) enters the first station, the first worker (function) does its job, and passes the result to the next worker, and so on, until the final, polished item comes out the end.

Its primary use case in Scrapy is for cleaning and transforming data extracted from web pages using Scrapy's selectors (like XPath or CSS).


The Core Concept: Chaining Functions

The key idea behind MapCompose is that it takes multiple functions as arguments and applies them in sequence to an input value.

python MapCompose-图2
(图片来源网络,侵删)

The flow is: Input ValueFunction 1Intermediate Result 1Function 2Intermediate Result 2 → ... → Final Result

Let's look at a simple, non-Scrapy example first to understand the mechanics.

Simple Example

Imagine you have a raw string extracted from a website: " Price: $19.99 ". You want to clean it to get a clean float: 99.

You could write a function to do this, but MapCompose lets you break it down into small, reusable, single-purpose functions.

from scrapy.utils.defer import MapCompose
# --- Our Processing Functions (Single Purpose) ---
def extract_price(text):
    """Extracts the price part from a string like 'Price: $19.99'."""
    if "Price:" in text:
        return text.split("Price:")[1].strip()
    return text.strip()
def remove_dollar_sign(text):
    """Removes the dollar sign."""
    return text.replace("$", "")
def to_float(text):
    """Converts a string to a float."""
    try:
        return float(text)
    except ValueError:
        return 0.0 # Or handle the error as you see fit
# --- Create the MapCompose Pipeline ---
price_pipeline = MapCompose(extract_price, remove_dollar_sign, to_float)
# --- Use the pipeline ---
raw_data = "  Price: $19.99  "
processed_price = price_pipeline(raw_data)
print(f"Raw Data: '{raw_data}'")
print(f"Processed Price: {processed_price}")
print(f"Type: {type(processed_price)}")

Output:

Raw Data: '  Price: $19.99  '
Processed Price: 19.99
Type: <class 'float'>

As you can see, MapCompose took the input string and passed it through extract_price, then the result of that to remove_dollar_sign, and finally that result to to_float.


How MapCompose Works with Iterables

This is where MapCompose truly shines in Scrapy. If you pass it an iterable (like a list of strings from a response.css() or response.xpath() call), it will apply the entire pipeline to each item in the iterable.

Let's extend our example. Imagine a product page with multiple prices listed.

from scrapy.utils.defer import MapCompose
# (Same functions as before)
def extract_price(text):
    if "Price:" in text:
        return text.split("Price:")[1].strip()
    return text.strip()
def remove_dollar_sign(text):
    return text.replace("$", "")
def to_float(text):
    try:
        return float(text)
    except ValueError:
        return 0.0
# --- Create the pipeline ---
price_pipeline = MapCompose(extract_price, remove_dollar_sign, to_float)
# --- Simulate extracting a list of prices from a page ---
# This is what you might get from response.css('.price::text').getall()
raw_prices = [
    "  Price: $19.99  ",
    "On Sale for $9.50",
    "Free", # This one will cause an error in to_float
    "  Price: $25.00  "
]
# --- Use the pipeline on the list ---
processed_prices = price_pipeline(raw_prices)
print(f"Raw Prices: {raw_prices}")
print(f"Processed Prices: {processed_prices}")

Output:

Raw Prices: ['  Price: $19.99  ', 'On Sale for $9.50', 'Free', '  Price: $25.00  ']
Processed Prices: [19.99, 9.5, 0.0, 25.0]

MapCompose iterated through the raw_prices list, applying the function chain to each element, and returned a new list of processed values.


Practical Scrapy Example: A Spider Item

This is the most common place you'll see MapCompose. Let's build a Scrapy Item and a spider to scrape book titles and prices from a mock bookstore.

Define the Item (items.py)

import scrapy
from scrapy.utils.defer import MapCompose
# --- Our cleaning functions ---
def clean_price(text):
    """Removes currency symbols and whitespace, converts to float."""
    # Handle cases where price might not be a number
    if text.strip().lower() in ['out of stock', 'na']:
        return None
    try:
        return float(text.replace('£', '').replace('$', '').strip())
    except ValueError:
        return None
def clean_title(text):
    """Strips whitespace and normalizes title case."""
    return text.strip().title()
class BookItem(scrapy.Item):= scrapy.Field(
        input_processor=MapCompose(clean_title)
    )
    price = scrapy.Field(
        input_processor=MapCompose(clean_price)
    )
    # We don't need a processor for author if we just want the raw string
    author = scrapy.Field()

Explanation:

  • We import MapCompose.
  • We define our small, focused cleaning functions.
  • In the BookItem class, we assign MapCompose(...) to the input_processor of a field.
    • For the title, the pipeline will be MapCompose(clean_title).
    • For the price, the pipeline will be MapCompose(clean_price).

The Spider (my_spider.py)

import scrapy
from myproject.items import BookItem # Assuming items.py is in myproject
class BookSpider(scrapy.Spider):
    name = 'book_spider'
    start_urls = ['https://books.toscrape.com/'] # A real website for scraping practice
    def parse(self, response):
        for book in response.css('article.product_pod'):
            item = BookItem()
            # The .getall() returns a list. MapCompose will process each element.
            # In this case, there's only one title, but it's good practice.
            item['title'] = book.css('h3 a::text').getall()
            # The .get() returns a single string. MapCompose will process it.
            item['price'] = book.css('p.price_color::text').get()
            # No processor for author, so we just get the raw string
            item['author'] = 'Unknown' # This site doesn't list authors on the main page
            yield item

How it All Connects

When the spider runs and extracts book.css('p.price_color::text').get(), it gets a string like '£51.77'.

  1. Scrapy sees that the price field in BookItem has an input_processor.
  2. It takes the extracted value ('£51.77') and passes it to the MapCompose instance.
  3. MapCompose calls the first function in its list: clean_price('£51.77').
  4. clean_price returns 77.
  5. Since that was the only function in the pipeline, the final value 77 is assigned to item['price'].

If you had MapCompose(func1, func2, func3), it would be func3(func2(func1(value))).


Key Advantages of MapCompose

  1. Readability and Maintainability: Code is broken into small, single-purpose functions. It's much easier to read MapCompose(clean, strip, to_int) than one large, complex function.
  2. Reusability: You can reuse the same cleaning functions across different items or even different spiders.
  3. Testability: You can test each small function (clean_price, clean_title) in isolation, which is much simpler than testing one giant "mega-function".
  4. Declarative: The Item definition clearly states how its fields should be processed. It's self-documenting.

MapCompose vs. Pipeline

It's crucial not to confuse MapCompose with Scrapy's Item Pipelines.

Feature MapCompose Item Pipeline
Purpose Field-level processing. Cleans/transforms data as it's being extracted. Item-level processing. Acts on a complete item after it has been extracted.
When it Runs During the parse method, when scrapy.Item fields are populated. After the parse method has yielded an item, and before the item is saved or exported.
Scope Works on a single field value (or a list of values for that field). Works on the entire scrapy.Item object, giving you access to all its fields.
Example Use - Convert a string to a number.
- Strip whitespace from a title.
- Extract a substring from a URL.
- Check if an item is valid (e.g., price is not None).
- Deduplicate items based on a unique ID.
- Enrich an item with additional data from an API.
- Save the item to a database.

Rule of Thumb:

  • Use MapCompose for simple, direct transformations on the raw data you just pulled from a page.
  • Use an Item Pipeline for more complex logic that requires context from the entire item or for actions like saving to a database.
分享:
扫描分享到社交APP
上一篇
下一篇