杰瑞科技汇

Python中bins与labels参数如何关联使用?

Of course! The term "bins" and "labels" in Python is most commonly associated with grouping numerical data into intervals (bins) and then assigning a descriptive name (label) to each interval.

Python中bins与labels参数如何关联使用?-图1
(图片来源网络,侵删)

This is a fundamental task in data analysis and visualization. The most popular library for this is Pandas, which has a powerful function called pd.cut().

Let's break it down with clear examples.


The Core Concept: Bins and Labels

Imagine you have a list of people's ages and you want to group them into categories like "Child", "Teen", "Adult", etc.

  • Data: [5, 17, 25, 32, 45, 8, 91]
  • Bins (The Intervals): We need to define the age ranges.
    • 0-12 (Child)
    • 13-19 (Teen)
    • 20-64 (Adult)
    • 65+ (Senior)
  • Labels (The Names): The descriptive names for those intervals.
    • ['Child', 'Teen', 'Adult', 'Senior']

The goal is to convert the raw age data into categorical data based on these rules.

Python中bins与labels参数如何关联使用?-图2
(图片来源网络,侵删)

Using pandas.cut()

This is the most direct and flexible way to achieve this. pd.cut() takes an array of values and divides it into discrete intervals.

A. Simple Example: Equal Width Bins

Let's start by creating some sample data and dividing it into a set number of bins of equal width.

import pandas as pd
import numpy as np
# 1. Sample Data
data = np.random.randint(0, 101, size=20) # 20 random numbers between 0 and 100
print("Original Data:")
print(data)
# Example output: [85 12 57 91  3 49 61 33 78 50 42  9 29 70 54 44 67 25 19 98]
# 2. Create Bins and Labels
# Let's create 3 bins: 0-33, 34-66, 67-100
num_bins = 3
bin_labels = ['Low', 'Medium', 'High']
# 3. Use pd.cut()
# `bins` can be an integer (for equal-width bins) or a list of cut-offs.
# `labels` assigns a name to each bin.
# `right=False` means the intervals are [left, right) (left-inclusive, right-exclusive).
# By default, it's (left, right] (right-inclusive).
binned_data = pd.cut(data, bins=num_bins, labels=bin_labels, right=False)
print("\nBinned Data (as a Categorical object):")
print(binned_data)
# Example output:
# [High, Low, Medium, High, Low, Medium, Medium, Medium, High, Medium, ...]
# Categories (3, object): [Low < Medium < High]
# 4. Create a DataFrame to see it clearly
df = pd.DataFrame({'Value': data, 'Category': binned_data})
print("\nDataFrame with Categories:")
print(df)

B. Example: Custom Bin Edges

Often, you want to define the exact boundaries for your bins, especially for real-world data like ages or income.

import pandas as pd
# 1. Sample Data (ages)
ages = [8, 15, 22, 35, 48, 60, 70, 5, 18, 25, 99]
# 2. Define Custom Bin Edges
# These edges define the intervals: [0-17], [18-35], [36-65], [66-100]
bin_edges = [0, 18, 36, 66, 100]
# 3. Define Corresponding Labels
age_labels = ['Child', 'Young Adult', 'Middle-Aged', 'Senior']
# 4. Use pd.cut()
# We don't need to specify `num_bins` here, we use the `bin_edges` list.
# `right=False` is important here to make 18-17.999... 'Child' and 18-35.999... 'Young Adult'.
age_categories = pd.cut(ages, bins=bin_edges, labels=age_labels, right=False)
# 5. Display in a DataFrame
df_ages = pd.DataFrame({'Age': ages, 'Age Group': age_categories})
print(df_ages)

Output:

Python中bins与labels参数如何关联使用?-图3
(图片来源网络,侵删)
   Age      Age Group
0    8          Child
1   15          Child
2   22  Young Adult
3   35  Young Adult
4   48   Middle-Aged
5   60   Middle-Aged
6   70        Senior
7    5          Child
8   18  Young Adult
9   25  Young Adult
10  99        Senior

Using numpy.histogram()

Sometimes, you just need the counts for each bin without creating categorical labels. numpy.histogram() is perfect for this. It returns the counts and the bin edges.

You can then manually assign labels if you wish.

import numpy as np
# 1. Sample Data
data = np.random.randn(100) # 100 random numbers from a standard normal distribution
# 2. Define number of bins or bin edges
num_bins = 5
# Or, define edges manually:
# bin_edges = [-3, -2, -1, 0, 1, 2, 3]
# 3. Use np.histogram()
counts, bin_edges = np.histogram(data, bins=num_bins)
print("Counts for each bin:")
print(counts)
# Example output: [ 2 15 38 32 13]
print("\nEdges of the bins:")
print(bin_edges)
# Example output: [-3.14159265 -1.88543673 -0.62928081  0.62687511  1.88203103  3.13718695]
# 4. (Optional) Create labels from the edges
# A common way is to take the average of the edges for each bin
bin_centers = (bin_edges[:-1] + bin_edges[1:]) / 2
bin_labels = [f"Bin {i+1} ({bin_centers[i]:.2f})" for i in range(len(bin_centers))]
print("\nGenerated Labels:")
print(bin_labels)
# Example output: ['Bin 1 (-2.51)', 'Bin 2 (-1.26)', 'Bin 3 (0.00)', 'Bin 4 (1.25)', 'Bin 5 (2.51)']

Using pandas.qcut() for Quantile Bins

What if your data is not uniformly distributed? For example, income data. Most people are in a lower income bracket, with a few outliers earning much more. Using pd.cut() with equal-width bins would result in most people being in the first bin.

In this case, pd.qcut() is your friend. It divides the data into bins based on quantiles (e.g., percentiles), so that each bin has (approximately) the same number of data points.

import pandas as pd
import numpy as np
# 1. Sample Skewed Data (e.g., incomes)
# Most values are small, with a few very large ones.
incomes = np.random.lognormal(mean=4, sigma=0.5, size=1000)
# 2. Define Quantiles
# Let's create 4 quartiles (0-25%, 25-50%, 50-75%, 75-100%)
quantiles = [0, 0.25, 0.5, 0.75, 1.0]
# 3. Define Labels for the Quantiles
income_labels = ['Low', 'Lower-Middle', 'Upper-Middle', 'High']
# 4. Use pd.qcut()
# This will ensure each group has roughly 250 people (1000 / 4).
income_groups = pd.qcut(incomes, q=quantiles, labels=income_labels)
# 5. Display in a DataFrame
df_income = pd.DataFrame({'Income': incomes, 'Income Group': income_groups})
# Verify the counts are roughly equal
print("Counts for each income group:")
print(df_income['Income Group'].value_counts())

Output:

Counts for each income group:
Upper-Middle    250
Low             250
Lower-Middle    250
High            250
Name: Income Group, dtype: int64

Notice how the counts are exactly equal (or as close as possible), which is the key feature of qcut.


Summary: cut() vs. qcut()

Feature pandas.cut() pandas.qcut()
Division Method By values (e.g., 0-10, 10-20). By rank/quantiles (e.g., 0-25th percentile, 25-50th).
Bin Width Equal width. Equal number of items per bin.
Best For Data that is (or can be) uniformly distributed. Skewed data (e.g., income, city populations).
Key Parameter bins (integer or list of edges). q (integer or list of quantiles).

Key Parameters for pd.cut() and pd.qcut()

  • x: The input array or Series of data.
  • bins / q: The number of bins or the bin edges for cut; the number of quantiles for qcut.
  • labels: (Optional) A list of labels to name the bins. If not provided, integer labels (0, 1, 2...) are used.
  • right: (Default: True) Defines whether the bins are closed on the right edge (left, right] or left edge [left, right). This is crucial for handling edge cases (e.g., is a person who is exactly 18 a "Child" or a "Young Adult"?).
  • include_lowest: (Default: False) Whether the first interval should be closed on the left side, e.g., [left, right). Useful when right=False.
分享:
扫描分享到社交APP
上一篇
下一篇