Of course! rasterstats is an incredibly useful Python library for calculating descriptive statistics of raster data based on vector geometries. It's perfect for answering questions like:
- "What is the average elevation within each city boundary?"
- "What is the maximum temperature in each forest stand?"
- "What is the most common land cover type for each watershed?"
Here’s a comprehensive guide covering its installation, core functions, and practical examples.
Installation
First, you need to install the library. It has a few core dependencies like numpy and fiona. It's highly recommended to also install rasterio, which is the modern standard for raster I/O in Python and is used by rasterstats for better performance.
# Install the core library pip install rasterstats # Recommended: Install rasterio for better performance and compatibility pip install rasterio
Core Concepts and Main Functions
rasterstats revolves around two main functions:
zonal_stats(): This is the workhorse. It calculates statistics (mean, median, sum, etc.) for a raster, aggregated over the zones defined by a vector layer (polygons or multipolygons).point_query(): This function extracts the exact raster value at a set of point locations.
The Data
For our examples, we'll need two files:
- A Vector File (Zones): A GeoPackage (
.gpkg) or Shapefile containing polygons. Let's call itzones.gpkg. - A Raster File (Values): A GeoTIFF (
.tif) file containing the data we want to summarize. Let's call itelevation.tif.
You can easily create sample data with geopandas and rasterio if you don't have any handy.
Example 1: zonal_stats() - Calculating Statistics for Polygons
This is the most common use case. We'll calculate the mean and maximum elevation for each zone in our zones.gpkg file from the elevation.tif raster.
Basic Usage
Let's start with the simplest call, which calculates the default statistics (count, min, max, mean).
from rasterstats import zonal_stats
import geopandas as gpd
# Define file paths
zones_vector_path = 'zones.gpkg'
raster_path = 'elevation.tif'
# The vector data can be a file path...
# stats = zonal_stats(zones_vector_path, raster_path)
# ...or a list of geometries from a GeoDataFrame
gdf = gpd.read_file(zones_vector_path)
geometries = gdf.geometry.tolist()
# Calculate default statistics
stats = zonal_stats(geometries, raster_path)
# Print the results for the first zone
print("Default statistics for the first zone:")
print(stats[0])
# Expected output: {'count': 150, 'min': 124.5, 'max': 245.8, 'mean': 185.2}
Specifying Custom Statistics
You can easily specify which statistics you want to calculate using the stats argument.
# Calculate only the mean and sum
stats_mean_sum = zonal_stats(geometries, raster_path, stats=['mean', 'sum'])
print("\nMean and Sum for the first zone:")
print(stats_mean_sum[0])
# Expected output: {'mean': 185.2, 'sum': 27780.0}
Other Useful Arguments
nodata: Explicitly tellzonal_statswhat the raster's NoData value is. This is crucial for accurate results.all_touched: A boolean. IfTrue, every raster cell that touches a polygon boundary is included in the calculation. IfFalse(default), only cells whose center is within the polygon are included. This can have a big impact on results, especially for small polygons.affine: If your raster isn't a GeoTIFF (i.e., it lacks spatial reference info), you can provide its affine transformation matrix.rasteriocan help you get this.
import rasterio
# A more robust example with nodata and all_touched
with rasterio.open(raster_path) as src:
nodata_val = src.nodata
affine_transform = src.transform
stats_robust = zonal_stats(
geometries,
raster_path,
stats=['mean', 'std', 'median'],
nodata=nodata_val,
all_touched=True,
affine=affine_transform
)
print("\nRobust statistics (mean, std, median) for the first zone:")
print(stats_robust[0])
# Expected output: {'mean': 186.1, 'std': 25.4, 'median': 188.0}
Adding Statistics Back to the GeoDataFrame
The real power comes from attaching these statistics to your original vector data.
import pandas as pd
# Create a DataFrame from the stats list
stats_df = pd.DataFrame(stats_robust)
# Join the stats DataFrame with the original GeoDataFrame
gdf_with_stats = gdf.join(stats_df)
print("\nGeoDataFrame with added statistics:")
print(gdf_with_stats.head())
# Now you can save this enriched GeoDataFrame
gdf_with_stats.to_file("zones_with_elevation_stats.gpkg", driver="GPKG")
Example 2: point_query() - Extracting Values at Points
This function is simpler. You provide a list of point coordinates and it returns the corresponding raster values.
from rasterstats import point_query
import geopandas as gpd
# Let's get some points from our zones file
points_gdf = gdf.sample(5) # Get 5 random points from our zones
points_coords = [(point.x, point.y) for point in points_gdf.geometry]
# Extract elevation values at these points
# By default, it returns a list of values
values = point_query(points_coords, raster_path)
print("\nExtracted raster values at 5 points:")
print(values)
# Expected output: [210.5, 195.2, 188.7, 201.1, 179.9]
# You can also specify the band number (e.g., for a multi-band raster)
# values_band1 = point_query(points_coords, raster_path, band=1)
Advanced Tips & Common Pitfalls
a. Handling Different Vector Formats
rasterstats uses fiona under the hood, so it supports any vector format that fiona can read, including Shapefile, GeoPackage, GeoJSON, etc. You can pass either a file path or a list of geometries.
b. Performance with Large Datasets
For very large vector files or high-resolution rasters, zonal_stats can be slow. Here are some tips:
- Use
rasterio: As mentioned, passing theaffinetransform fromrasterio.open()is much faster than lettingrasterstatsfigure it out. - Simplify Geometries: If your vector polygons have very complex boundaries, consider simplifying them first using
gdf.simplify(tolerance). This can drastically reduce processing time without a significant loss in accuracy. - Reproject to Raster CRS: Ensure your vector layer is in the same Coordinate Reference System (CRS) as your raster.
rasterstatswill do this for you, but doing it beforehand withgdf.to_crs()can be more efficient.
# Example of reprojecting the vector to match the raster
with rasterio.open(raster_path) as src:
raster_crs = src.crs
gdf = gdf.to_crs(raster_crs)
geometries = gdf.geometry.tolist()
# Now run zonal_stats
stats = zonal_stats(geometries, raster_path)
c. NaN vs. NoData
Be aware of the difference. nodata is a value defined in the raster's metadata (often -9999 or 0) that means "no data here". NaN (Not a Number) is a special floating-point value. rasterstats correctly handles nodata values by excluding them from calculations. If your raster uses NaN as its NoData representation, rasterstats should also handle it correctly.
Summary
| Function | Use Case | Key Arguments |
|---|---|---|
zonal_stats() |
Calculate statistics (mean, sum, etc.) for raster data aggregated over vector polygons. | geometries, raster_path, stats, nodata, all_touched, affine |
point_query() |
Extract the exact raster value at a specific set of point coordinates. | points, raster_path, band |
rasterstats is a fantastic, lightweight, and powerful tool that bridges the gap between the vector world (GIS, databases) and the raster world (imagery, model outputs). Combined with geopandas and rasterio, it forms a core part of the modern geospatial data science toolkit in Python.
