Python categorical数据类型如何高效处理？-杰瑞科技汇

Of course! In Python, "categorical" data refers to a variable that can take on one of a limited, and usually fixed, number of possible values. Think of things like:

（图片来源网络，侵删）

Gender: Male, Female, Non-binary
Country: USA, Canada, UK, Germany
Education Level: High School, Bachelor's, Master's, PhD
Product Size: S, M, L, XL

These are different from numerical data, which represents quantities.

Handling categorical data correctly is crucial for most data analysis and machine learning tasks. Here’s a comprehensive guide covering the main ways to work with categorical data in Python.

The `pandas.Categorical` Type

Pandas provides a special data type called Categorical to efficiently store and manipulate categorical data. It's more memory-efficient and can speed up operations compared to using plain strings (the object dtype).

Creating a `Categorical`

You can create a Categorical object in several ways.

（图片来源网络，侵删）

From a list or array:

import pandas as pd
import numpy as np
# Create a Series with string data
data = ['apple', 'orange', 'apple', 'banana', 'orange', 'apple']
s = pd.Series(data, dtype='category')
print(s)
# 0     apple
# 1    orange
# 2     apple
# 3    banana
# 4    orange
# 5     apple
# dtype: category
# Categories (3, object): ['apple', 'banana', 'orange']
# You can also create the Categorical object directly
cat_data = pd.Categorical(data, categories=['apple', 'banana', 'orange', 'grape'])
print(cat_data)
# [apple, orange, apple, banana, orange, apple]
# Categories (4, object): ['apple', 'banana', 'orange', 'grape']

Key Properties of a Categorical Object:

categories: The unique possible values.
ordered: A boolean indicating if the categories have a meaningful order. By default, it's False.

print("Categories:", s.cat.categories)
print("Is ordered?", s.cat.ordered)
# Output:
# Categories: Index(['apple', 'banana', 'orange'], dtype='object')
# Is ordered? False

Setting Order

If your categories have a natural order (e.g., sizes, education levels), you should set ordered=True. This unlocks powerful sorting and comparison operations.

# Create an ordered categorical
size_data = ['S', 'M', 'L', 'S', 'XL', 'M']
size_cat = pd.Categorical(size_data, categories=['S', 'M', 'L', 'XL'], ordered=True)
print(size_cat)
# [S, M, L, S, XL, M]
# Categories (4, object): ['S' < 'M' < 'L' < 'XL']
# Now you can perform comparisons
print(size_cat > 'M')
# [False, False, True, False, True, False]

Converting Data to Categorical

The most common use case is converting existing columns in a DataFrame.

（图片来源网络，侵删）

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Alice'],
    'city': ['New York', 'Paris', 'London', 'New York'],
    'age': [25, 30, 35, 28]
})
# Convert the 'city' column to categorical
df['city'] = df['city'].astype('category')
print(df.info())
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 4 entries, 0 to 3
# Data columns (total 3 columns):
#  #   Column  Non-Null Count  Dtype
# ---  ------  --------------  -----
#  0   name    4 non-null      object
#  1   city    4 non-null      category  <-- Now it's a categorical type
#  2   age     4 non-null      int64
# dtypes: category(1), int64(1), object(1)
# memory usage: 240.0+ bytes

This is great for memory savings, especially with large datasets and many repeated strings.

Why and How to Encode Categorical Data for Machine Learning

Most machine learning algorithms require numerical input. You cannot feed them strings directly. Therefore, you must convert categorical data into a numerical format. This process is called encoding.

Here are the two most common methods:

A) One-Hot Encoding

This is the most popular and safest method. It creates a new binary (0/1) column for each category.

When to use it: For nominal data (categories with no intrinsic order), like City, Gender, or Product Type.
How it works:
- City: New York -> [1, 0, 0]
- City: Paris -> [0, 1, 0]
- City: London -> [0, 0, 1]

Using pandas.get_dummies()

This is the easiest way to perform one-hot encoding.

df = pd.DataFrame({
    'color': ['Red', 'Blue', 'Green', 'Red', 'Blue']
})
# Perform one-hot encoding
df_encoded = pd.get_dummies(df['color'], prefix='color')
print(df_encoded)
#    color_Blue  color_Green  color_Red
# 0           0            0          1
# 1           1            0          0
# 2           0            1          0
# 3           0            0          1
# 4           1            0          0
# You can combine it back with the original DataFrame
df = pd.concat([df, df_encoded], axis=1)
print(df)
#    color  color_Blue  color_Green  color_Red
# 0    Red           0            0          1
# 1   Blue           1            0          0
# 2  Green           0            1          0
# 3    Red           0            0          1
# 4   Blue           1            0          0

B) Label Encoding

This method assigns a unique integer to each category.

When to use it: Primarily for ordinal data (categories with a meaningful order), like Education Level or Size.
Warning: Do not use this for nominal data. The model might incorrectly interpret the integer values as having an order (e.g., Paris=1 is "less than" London=2), which can lead to poor performance.

Using sklearn.preprocessing.LabelEncoder

from sklearn.preprocessing import LabelEncoder
df = pd.DataFrame({
    'size': ['S', 'M', 'L', 'XL', 'M', 'S']
})
# Create a LabelEncoder object
le = LabelEncoder()
# Fit and transform the data
df['size_encoded'] = le.fit_transform(df['size'])
print(df)
#   size  size_encoded
# 0    S             2  # S -> 2
# 1    M             1  # M -> 1
# 2    L             0  # L -> 0
# 3   XL            3  # XL -> 3
# 4    M             1
# 5    S             2
# To see the mapping
print(dict(zip(le.classes_, le.transform(le.classes_))))
# {'L': 0, 'M': 1, 'S': 2, 'XL': 3}

Advanced Encoding Techniques

For high-cardinality categorical features (many unique values), one-hot encoding can create a huge number of columns, which is inefficient.

Target Encoding (Mean Encoding): Replace each category with the mean of the target variable for that category. This is very powerful but can lead to overfitting if not done carefully (e.g., using cross-validation).
Frequency Encoding: Replace each category with its frequency (count) in the dataset.
Embeddings: A deep learning technique where categories are mapped to a low-dimensional dense vector. This is state-of-the-art for handling very high-cardinality features in neural networks.

Summary: Which Method to Use?

Method	Best For	How to Do It	Pros	Cons
`pandas.Categorical`	Storage, memory efficiency, and sorting in Pandas.	`df['col'].astype('category')`	Saves memory, enables sorting/comparisons.	Not a direct input for ML models.
One-Hot Encoding	Nominal data (no order).	`pd.get_dummies()`	No artificial ordering, widely understood.	Can create many columns (curse of dimensionality).
Label Encoding	Ordinal data (has a clear order).	`sklearn.preprocessing.LabelEncoder`	Simple, keeps the feature in one column.	Can mislead models if used on nominal data.
Target Encoding	High-cardinality nominal features.	Manual or with libraries like `category_encoders`.	Handles many categories well, captures target relationship.	High risk of overfitting.

Key Takeaway

Identify your categorical data.
Use pandas.Categorical for efficient storage and operations within Pandas.
Encode your data for machine learning:
- Use One-Hot Encoding for categories with no order.
- Use Label Encoding for categories with a clear order.
- Consider more advanced methods for features with a very large number of unique values.

Python categorical数据类型如何高效处理？

The `pandas.Categorical` Type

Creating a `Categorical`

Setting Order

Converting Data to Categorical

Why and How to Encode Categorical Data for Machine Learning

A) One-Hot Encoding

B) Label Encoding

Advanced Encoding Techniques

Summary: Which Method to Use?

Key Takeaway

99ANYc3cd6

Java 16进制如何转10进制？

Python getnameinfo方法如何使用？

Win系统安装OS教程？步骤详解看这里！

雍俊海.java 程序设计适合哪些读者？

韩顺平Oracle视频教程怎么学？

男士瑜伽入门，哪些基本动作必学？

python能htmlzhengze

volatile如何保证Java线程可见性？

Java Protobuf怎么用？核心优势是什么？

dreamscape 教程

Java核心技术第九版PDF哪里能免费下载？

Java JTable Swing如何高效实现数据交互？

Python Caffe部署步骤是怎样的？

Java中如何创建和操作XMLDocument？

Java连接Oracle 11g，配置步骤与常见问题有哪些？

Python json sortkeys如何实现字典键排序？

Python categorical数据类型如何高效处理？

The pandas.Categorical Type

Creating a Categorical

Setting Order

Converting Data to Categorical

Why and How to Encode Categorical Data for Machine Learning

A) One-Hot Encoding

B) Label Encoding

Advanced Encoding Techniques

Summary: Which Method to Use?

Key Takeaway

相关推荐

男士瑜伽入门，哪些基本动作必学？

The `pandas.Categorical` Type

Creating a `Categorical`