Python如何实现分类数据处理？-杰瑞科技汇

Of course! "Categorical" in Python most commonly refers to the pandas.Categorical data type. It's a powerful and memory-efficient way to handle data that has a fixed, limited number of possible values (categories).

（图片来源网络，侵删）

Let's break it down.

What is a Categorical Data Type?

Think about a column in your data that represents something like:

gender (with values: 'Male', 'Female', 'Other')
day_of_week (with values: 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun')
satisfaction_rating (with values: 'Low', 'Medium', 'High')
size (with values: 'S', 'M', 'L', 'XL')

These are all categorical variables. They have a specific set of categories, and the data is limited to those categories.

Instead of storing these as strings (which are memory-intensive and slow to sort), pandas.Categorical stores them more efficiently using integer codes under the hood.

（图片来源网络，侵删）

Why Use Categorical Data?

Using pandas.Categorical offers three main advantages:

Memory Efficiency: If you have a column with many repeated strings (e.g., a column of 'USA', 'USA', 'Canada', 'USA', 'Mexico'), storing it as a category can use significantly less memory than storing it as a string (object) type. The categories are stored once, and the column holds integer references to them.
Performance: Operations like sorting and grouping are much faster on categorical data than on strings. Sorting is done based on the logical order of the categories, not alphabetical order (which is often not what you want).
Semantic Meaning: It explicitly tells you and your data analysis tools that this column is not just text; it represents a distinct set of categories. This prevents accidental operations that don't make sense (e.g., trying to add two genders together).

How to Use `pandas.Categorical`

Let's walk through the main concepts with code examples.

Creating a Categorical Series

You can create a categorical Series in a few ways.

Method A: Convert from an existing Series

import pandas as pd
import numpy as np
# Create a Series of string data
data = ['Low', 'High', 'Medium', 'Low', 'Medium', 'High']
s = pd.Series(data)
# Convert it to a categorical type
s_cat = s.astype('category')
print("Original Series:")
print(s)
print("\nOriginal Series dtype:", s.dtype)
print("\nCategorical Series:")
print(s_cat)
print("\nCategorical Series dtype:", s_cat.dtype)

Output:

Original Series:
0       Low
1      High
2    Medium
3       Low
4    Medium
5      High
dtype: object
Original Series dtype: object
Categorical Series:
0       Low
1      High
2    Medium
3       Low
4    Medium
5      High
Categories (3, object): ['High', 'Low', 'Medium']  # Note the sorting!
Categorical Series dtype: category

Notice that pandas automatically sorted the categories alphabetically by default.

Specifying Categories and Order

This is one of the most powerful features. You can define the categories yourself and specify an order. This is crucial for things like 'Low', 'Medium', 'High'.

# Define the categories and their logical order
my_categories = ['Low', 'Medium', 'High']
my_ordered_categories = pd.CategoricalDtype(categories=my_categories, ordered=True)
# Create the Series with the new dtype
s_ordered = s.astype(my_ordered_categories)
print(s_ordered)
print("\nCategorical Series dtype:", s_ordered.dtype)

Output:

0       Low
1      High
2    Medium
3       Low
4    Medium
5      High
Categories (3, object): ['Low', 'Medium', 'High']  # Now it's in the correct order!
Categorical Series dtype: category

The ordered=True flag is key. It enables logical comparisons.

Logical Comparisons (with `ordered=True`)

Because our s_ordered is now ordered, we can use comparison operators like <, >, <=, >=.

# This will only work if the Categorical is 'ordered'
try:
    # Find all ratings that are 'Medium' or higher
    print(s_ordered >= 'Medium')
except TypeError as e:
    print(f"Error: {e}")
    print("This happens because the Categorical was not ordered.")
# Let's try again with the ordered version
print("\nRatings >= 'Medium':")
print(s_ordered >= 'Medium')

Output:

Ratings >= 'Medium':
0    False
1     True
2     True
3    False
4     True
5     True
dtype: bool

If you tried this with a standard string Series, you would get a TypeError or incorrect results. With an ordered Categorical, it works perfectly.

Memory Usage Comparison

Let's see the memory savings.

# Create a large Series of repeated strings
large_string_series = pd.Series(['USA'] * 1_000_000 + ['Canada'] * 500_000 + ['Mexico'] * 500_000)
# Convert it to a categorical Series
large_cat_series = large_string_series.astype('category')
print(f"Memory usage of string Series: {large_string_series.memory_usage(deep=True) / 1024**2:.2f} MB")
print(f"Memory usage of categorical Series: {large_cat_series.memory_usage(deep=True) / 1024**2:.2f} MB")

Output (will vary slightly):

Memory usage of string Series: 57.22 MB
Memory usage of categorical Series: 11.53 MB

As you can see, the categorical version uses a fraction of the memory!

Key Methods and Attributes

When you have a Categorical Series, you get access to some useful attributes and methods:

.cat.categories: The list of categories.
.cat.ordered: A boolean indicating if the categories are ordered.
.cat.codes: The integer codes for each value (useful for machine learning models).

print("Categories:", s_ordered.cat.categories)
print("Is ordered?", s_ordered.cat.ordered)
print("Codes:", s_ordered.cat.codes)

Output:

Categories: Index(['Low', 'Medium', 'High'], dtype='object')
Is ordered? True
Codes: [0 2 1 0 1 2]

Common Operations

Adding New Categories

What if you encounter a category that wasn't in your original list?

# By default, trying to add a new category raises an error
try:
    s_ordered[0] = 'Very High'
except ValueError as e:
    print(f"Error: {e}")
# You can add new categories first
s_ordered = s_ordered.cat.add_categories(['Very High'])
print("\nAfter adding 'Very High':")
print(s_ordered.cat.categories)
# Now you can assign it
s_ordered[0] = 'Very High'
print("\nSeries after assignment:")
print(s_ordered)

Removing Categories

# Remove a category
s_ordered = s_ordered.cat.remove_categories(['Very High'])
print("\nAfter removing 'Very High':")
print(s_ordered.cat.categories)
# The value that was 'Very High' becomes NaN
print("\nSeries after removal:")
print(s_ordered)

When to Use Categorical Data?

DO use it for:

Low-cardinality string columns: Columns with a small, fixed number of unique values (e.g., country codes, state names, product colors).
Ordinal data: Data with a clear, logical order (e.g., 'bad', 'good', 'excellent'; 'small', 'large', 'x-large').

DON'T use it for:

High-cardinality strings: Columns with thousands or millions of unique values (e.g., user IDs, full names, product SKUs). The memory savings will be negligible or non-existent, and it can even slow things down.
Text data: You can't perform string operations (like .str.contains()) directly on a Categorical Series. You usually have to convert it back to a string type first.

Summary

Feature	Description	Benefit
Data Type	`pandas.Categorical`	Represents fixed, discrete categories.
Memory	Stores integer codes, not full strings.	Massive memory savings for low-cardinality data.
Performance	Faster sorting and grouping.	Quicker data analysis.
Ordering	Can be `ordered` or `unordered`.	Enables logical comparisons (`>`, `<`) for ordinal data.
Use Cases	Gender, days of the week, satisfaction levels, sizes.	Makes data analysis more robust and efficient.

Python如何实现分类数据处理？

What is a Categorical Data Type?

Why Use Categorical Data?

How to Use `pandas.Categorical`

Creating a Categorical Series

Specifying Categories and Order

Logical Comparisons (with `ordered=True`)

Memory Usage Comparison

Key Methods and Attributes

Common Operations

Adding New Categories

Removing Categories

When to Use Categorical Data?

Summary

99ANYc3cd6

Java double转int会丢失精度吗？

java tomcat 获取路径

SnapConnect Python如何快速上手？

Laravel视频教程哪个更系统？

Python批量插入如何实现高效操作？

男士瑜伽入门，哪些基本动作必学？

Python array flatten有哪些方法？

Java ASCII转String，如何高效实现？

Java webservice如何实现文件下载？

Camtasia Studio 6教程如何快速上手？

Android Studio怎么用？新手入门教程看这里！

Cool Edit 2.1视频教程，如何快速上手？

python tracer(False)

java web 从入门到精通明日科技

Dreamweaver网页教程视频新手如何快速入门？

C如何调用Java的WebService接口？

Python如何实现分类数据处理？

What is a Categorical Data Type?

Why Use Categorical Data?

How to Use pandas.Categorical

Creating a Categorical Series

Specifying Categories and Order

Logical Comparisons (with ordered=True)

Memory Usage Comparison

Key Methods and Attributes

Common Operations

Adding New Categories

Removing Categories

When to Use Categorical Data?

Summary

相关推荐

男士瑜伽入门，哪些基本动作必学？

How to Use `pandas.Categorical`

Logical Comparisons (with `ordered=True`)