杰瑞科技汇

Python如何实现分类数据处理?

Of course! "Categorical" in Python most commonly refers to the pandas.Categorical data type. It's a powerful and memory-efficient way to handle data that has a fixed, limited number of possible values (categories).

Python如何实现分类数据处理?-图1
(图片来源网络,侵删)

Let's break it down.

What is a Categorical Data Type?

Think about a column in your data that represents something like:

  • gender (with values: 'Male', 'Female', 'Other')
  • day_of_week (with values: 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun')
  • satisfaction_rating (with values: 'Low', 'Medium', 'High')
  • size (with values: 'S', 'M', 'L', 'XL')

These are all categorical variables. They have a specific set of categories, and the data is limited to those categories.

Instead of storing these as strings (which are memory-intensive and slow to sort), pandas.Categorical stores them more efficiently using integer codes under the hood.

Python如何实现分类数据处理?-图2
(图片来源网络,侵删)

Why Use Categorical Data?

Using pandas.Categorical offers three main advantages:

  1. Memory Efficiency: If you have a column with many repeated strings (e.g., a column of 'USA', 'USA', 'Canada', 'USA', 'Mexico'), storing it as a category can use significantly less memory than storing it as a string (object) type. The categories are stored once, and the column holds integer references to them.

  2. Performance: Operations like sorting and grouping are much faster on categorical data than on strings. Sorting is done based on the logical order of the categories, not alphabetical order (which is often not what you want).

  3. Semantic Meaning: It explicitly tells you and your data analysis tools that this column is not just text; it represents a distinct set of categories. This prevents accidental operations that don't make sense (e.g., trying to add two genders together).


How to Use pandas.Categorical

Let's walk through the main concepts with code examples.

Creating a Categorical Series

You can create a categorical Series in a few ways.

Method A: Convert from an existing Series

import pandas as pd
import numpy as np
# Create a Series of string data
data = ['Low', 'High', 'Medium', 'Low', 'Medium', 'High']
s = pd.Series(data)
# Convert it to a categorical type
s_cat = s.astype('category')
print("Original Series:")
print(s)
print("\nOriginal Series dtype:", s.dtype)
print("\nCategorical Series:")
print(s_cat)
print("\nCategorical Series dtype:", s_cat.dtype)

Output:

Original Series:
0       Low
1      High
2    Medium
3       Low
4    Medium
5      High
dtype: object
Original Series dtype: object
Categorical Series:
0       Low
1      High
2    Medium
3       Low
4    Medium
5      High
Categories (3, object): ['High', 'Low', 'Medium']  # Note the sorting!
Categorical Series dtype: category

Notice that pandas automatically sorted the categories alphabetically by default.

Specifying Categories and Order

This is one of the most powerful features. You can define the categories yourself and specify an order. This is crucial for things like 'Low', 'Medium', 'High'.

# Define the categories and their logical order
my_categories = ['Low', 'Medium', 'High']
my_ordered_categories = pd.CategoricalDtype(categories=my_categories, ordered=True)
# Create the Series with the new dtype
s_ordered = s.astype(my_ordered_categories)
print(s_ordered)
print("\nCategorical Series dtype:", s_ordered.dtype)

Output:

0       Low
1      High
2    Medium
3       Low
4    Medium
5      High
Categories (3, object): ['Low', 'Medium', 'High']  # Now it's in the correct order!
Categorical Series dtype: category

The ordered=True flag is key. It enables logical comparisons.

Logical Comparisons (with ordered=True)

Because our s_ordered is now ordered, we can use comparison operators like <, >, <=, >=.

# This will only work if the Categorical is 'ordered'
try:
    # Find all ratings that are 'Medium' or higher
    print(s_ordered >= 'Medium')
except TypeError as e:
    print(f"Error: {e}")
    print("This happens because the Categorical was not ordered.")
# Let's try again with the ordered version
print("\nRatings >= 'Medium':")
print(s_ordered >= 'Medium')

Output:

Ratings >= 'Medium':
0    False
1     True
2     True
3    False
4     True
5     True
dtype: bool

If you tried this with a standard string Series, you would get a TypeError or incorrect results. With an ordered Categorical, it works perfectly.

Memory Usage Comparison

Let's see the memory savings.

# Create a large Series of repeated strings
large_string_series = pd.Series(['USA'] * 1_000_000 + ['Canada'] * 500_000 + ['Mexico'] * 500_000)
# Convert it to a categorical Series
large_cat_series = large_string_series.astype('category')
print(f"Memory usage of string Series: {large_string_series.memory_usage(deep=True) / 1024**2:.2f} MB")
print(f"Memory usage of categorical Series: {large_cat_series.memory_usage(deep=True) / 1024**2:.2f} MB")

Output (will vary slightly):

Memory usage of string Series: 57.22 MB
Memory usage of categorical Series: 11.53 MB

As you can see, the categorical version uses a fraction of the memory!


Key Methods and Attributes

When you have a Categorical Series, you get access to some useful attributes and methods:

  • .cat.categories: The list of categories.
  • .cat.ordered: A boolean indicating if the categories are ordered.
  • .cat.codes: The integer codes for each value (useful for machine learning models).
print("Categories:", s_ordered.cat.categories)
print("Is ordered?", s_ordered.cat.ordered)
print("Codes:", s_ordered.cat.codes)

Output:

Categories: Index(['Low', 'Medium', 'High'], dtype='object')
Is ordered? True
Codes: [0 2 1 0 1 2]

Common Operations

Adding New Categories

What if you encounter a category that wasn't in your original list?

# By default, trying to add a new category raises an error
try:
    s_ordered[0] = 'Very High'
except ValueError as e:
    print(f"Error: {e}")
# You can add new categories first
s_ordered = s_ordered.cat.add_categories(['Very High'])
print("\nAfter adding 'Very High':")
print(s_ordered.cat.categories)
# Now you can assign it
s_ordered[0] = 'Very High'
print("\nSeries after assignment:")
print(s_ordered)

Removing Categories

# Remove a category
s_ordered = s_ordered.cat.remove_categories(['Very High'])
print("\nAfter removing 'Very High':")
print(s_ordered.cat.categories)
# The value that was 'Very High' becomes NaN
print("\nSeries after removal:")
print(s_ordered)

When to Use Categorical Data?

DO use it for:

  • Low-cardinality string columns: Columns with a small, fixed number of unique values (e.g., country codes, state names, product colors).
  • Ordinal data: Data with a clear, logical order (e.g., 'bad', 'good', 'excellent'; 'small', 'large', 'x-large').

DON'T use it for:

  • High-cardinality strings: Columns with thousands or millions of unique values (e.g., user IDs, full names, product SKUs). The memory savings will be negligible or non-existent, and it can even slow things down.
  • Text data: You can't perform string operations (like .str.contains()) directly on a Categorical Series. You usually have to convert it back to a string type first.

Summary

Feature Description Benefit
Data Type pandas.Categorical Represents fixed, discrete categories.
Memory Stores integer codes, not full strings. Massive memory savings for low-cardinality data.
Performance Faster sorting and grouping. Quicker data analysis.
Ordering Can be ordered or unordered. Enables logical comparisons (>, <) for ordinal data.
Use Cases Gender, days of the week, satisfaction levels, sizes. Makes data analysis more robust and efficient.
分享:
扫描分享到社交APP
上一篇
下一篇