杰瑞科技汇

Python rasterstats如何实现栅格统计?

Of course! rasterstats is an incredibly useful Python library for calculating descriptive statistics of raster data based on vector geometries. It's perfect for answering questions like:

  • "What is the average elevation within each city boundary?"
  • "What is the maximum temperature in each forest stand?"
  • "What is the most common land cover type for each watershed?"

Here’s a comprehensive guide covering its installation, core functions, and practical examples.


Installation

First, you need to install the library. It has a few core dependencies like numpy and fiona. It's highly recommended to also install rasterio, which is the modern standard for raster I/O in Python and is used by rasterstats for better performance.

# Install the core library
pip install rasterstats
# Recommended: Install rasterio for better performance and compatibility
pip install rasterio

Core Concepts and Main Functions

rasterstats revolves around two main functions:

  1. zonal_stats(): This is the workhorse. It calculates statistics (mean, median, sum, etc.) for a raster, aggregated over the zones defined by a vector layer (polygons or multipolygons).
  2. point_query(): This function extracts the exact raster value at a set of point locations.

The Data

For our examples, we'll need two files:

  1. A Vector File (Zones): A GeoPackage (.gpkg) or Shapefile containing polygons. Let's call it zones.gpkg.
  2. A Raster File (Values): A GeoTIFF (.tif) file containing the data we want to summarize. Let's call it elevation.tif.

You can easily create sample data with geopandas and rasterio if you don't have any handy.


Example 1: zonal_stats() - Calculating Statistics for Polygons

This is the most common use case. We'll calculate the mean and maximum elevation for each zone in our zones.gpkg file from the elevation.tif raster.

Basic Usage

Let's start with the simplest call, which calculates the default statistics (count, min, max, mean).

from rasterstats import zonal_stats
import geopandas as gpd
# Define file paths
zones_vector_path = 'zones.gpkg'
raster_path = 'elevation.tif'
# The vector data can be a file path...
# stats = zonal_stats(zones_vector_path, raster_path)
# ...or a list of geometries from a GeoDataFrame
gdf = gpd.read_file(zones_vector_path)
geometries = gdf.geometry.tolist()
# Calculate default statistics
stats = zonal_stats(geometries, raster_path)
# Print the results for the first zone
print("Default statistics for the first zone:")
print(stats[0])
# Expected output: {'count': 150, 'min': 124.5, 'max': 245.8, 'mean': 185.2}

Specifying Custom Statistics

You can easily specify which statistics you want to calculate using the stats argument.

# Calculate only the mean and sum
stats_mean_sum = zonal_stats(geometries, raster_path, stats=['mean', 'sum'])
print("\nMean and Sum for the first zone:")
print(stats_mean_sum[0])
# Expected output: {'mean': 185.2, 'sum': 27780.0}

Other Useful Arguments

  • nodata: Explicitly tell zonal_stats what the raster's NoData value is. This is crucial for accurate results.
  • all_touched: A boolean. If True, every raster cell that touches a polygon boundary is included in the calculation. If False (default), only cells whose center is within the polygon are included. This can have a big impact on results, especially for small polygons.
  • affine: If your raster isn't a GeoTIFF (i.e., it lacks spatial reference info), you can provide its affine transformation matrix. rasterio can help you get this.
import rasterio
# A more robust example with nodata and all_touched
with rasterio.open(raster_path) as src:
    nodata_val = src.nodata
    affine_transform = src.transform
stats_robust = zonal_stats(
    geometries,
    raster_path,
    stats=['mean', 'std', 'median'],
    nodata=nodata_val,
    all_touched=True,
    affine=affine_transform
)
print("\nRobust statistics (mean, std, median) for the first zone:")
print(stats_robust[0])
# Expected output: {'mean': 186.1, 'std': 25.4, 'median': 188.0}

Adding Statistics Back to the GeoDataFrame

The real power comes from attaching these statistics to your original vector data.

import pandas as pd
# Create a DataFrame from the stats list
stats_df = pd.DataFrame(stats_robust)
# Join the stats DataFrame with the original GeoDataFrame
gdf_with_stats = gdf.join(stats_df)
print("\nGeoDataFrame with added statistics:")
print(gdf_with_stats.head())
# Now you can save this enriched GeoDataFrame
gdf_with_stats.to_file("zones_with_elevation_stats.gpkg", driver="GPKG")

Example 2: point_query() - Extracting Values at Points

This function is simpler. You provide a list of point coordinates and it returns the corresponding raster values.

from rasterstats import point_query
import geopandas as gpd
# Let's get some points from our zones file
points_gdf = gdf.sample(5) # Get 5 random points from our zones
points_coords = [(point.x, point.y) for point in points_gdf.geometry]
# Extract elevation values at these points
# By default, it returns a list of values
values = point_query(points_coords, raster_path)
print("\nExtracted raster values at 5 points:")
print(values)
# Expected output: [210.5, 195.2, 188.7, 201.1, 179.9]
# You can also specify the band number (e.g., for a multi-band raster)
# values_band1 = point_query(points_coords, raster_path, band=1)

Advanced Tips & Common Pitfalls

a. Handling Different Vector Formats

rasterstats uses fiona under the hood, so it supports any vector format that fiona can read, including Shapefile, GeoPackage, GeoJSON, etc. You can pass either a file path or a list of geometries.

b. Performance with Large Datasets

For very large vector files or high-resolution rasters, zonal_stats can be slow. Here are some tips:

  1. Use rasterio: As mentioned, passing the affine transform from rasterio.open() is much faster than letting rasterstats figure it out.
  2. Simplify Geometries: If your vector polygons have very complex boundaries, consider simplifying them first using gdf.simplify(tolerance). This can drastically reduce processing time without a significant loss in accuracy.
  3. Reproject to Raster CRS: Ensure your vector layer is in the same Coordinate Reference System (CRS) as your raster. rasterstats will do this for you, but doing it beforehand with gdf.to_crs() can be more efficient.
# Example of reprojecting the vector to match the raster
with rasterio.open(raster_path) as src:
    raster_crs = src.crs
gdf = gdf.to_crs(raster_crs)
geometries = gdf.geometry.tolist()
# Now run zonal_stats
stats = zonal_stats(geometries, raster_path)

c. NaN vs. NoData

Be aware of the difference. nodata is a value defined in the raster's metadata (often -9999 or 0) that means "no data here". NaN (Not a Number) is a special floating-point value. rasterstats correctly handles nodata values by excluding them from calculations. If your raster uses NaN as its NoData representation, rasterstats should also handle it correctly.


Summary

Function Use Case Key Arguments
zonal_stats() Calculate statistics (mean, sum, etc.) for raster data aggregated over vector polygons. geometries, raster_path, stats, nodata, all_touched, affine
point_query() Extract the exact raster value at a specific set of point coordinates. points, raster_path, band

rasterstats is a fantastic, lightweight, and powerful tool that bridges the gap between the vector world (GIS, databases) and the raster world (imagery, model outputs). Combined with geopandas and rasterio, it forms a core part of the modern geospatial data science toolkit in Python.

分享:
扫描分享到社交APP
上一篇
下一篇