Python预处理有哪些常用方法？-杰瑞科技汇

Of course! "Python preprocess" is a broad topic, but it almost always refers to the process of preparing and cleaning data before feeding it into a machine learning model, an analysis algorithm, or another application.

（图片来源网络，侵删）

Let's break down the concept into:

What is Preprocessing? (The "Why")
Common Preprocessing Tasks (The "What")
A Practical, End-to-End Example (The "How")
Popular Python Libraries for Preprocessing

What is Preprocessing? (The "Why")

Raw data is often messy, incomplete, and in a format that algorithms can't understand. Preprocessing is the crucial step of transforming this raw data into a clean, consistent, and usable format.

Analogy: Think of preprocessing as preparing ingredients before cooking. You don't just throw raw carrots, a whole onion, and a dusty potato into a pot. You wash them, peel them, chop them, and maybe even blanch them. Preprocessing is the "chopping and washing" for your data.

Goals of Preprocessing:

（图片来源网络，侵删）

Handle Missing Values: Deal with empty cells or NaN values.
Correct Data Types: Ensure numbers are numeric, dates are date objects, etc.
Handle Categorical Data: Convert text labels (like "Red", "Green", "Blue") into numbers.
Normalize/Scale Features: Bring different numerical features onto a similar scale.
Remove Outliers: Identify and handle extreme values that can skew results.
Feature Engineering: Create new, more informative features from existing ones.

Common Preprocessing Tasks (The "What")

Here are the most common tasks you'll perform, with explanations and Python code snippets.

A. Handling Missing Values

Real-world datasets often have missing values. You can't just ignore them.

Methods:

Remove: Drop rows or columns with missing values. (Use with caution, as you can lose a lot of data).
Impute: Fill in the missing values with a statistic (mean, median, mode) or a constant.

import pandas as pd
import numpy as np
# Create a sample DataFrame with missing values
data = {'Age': [25, 30, 22, np.nan, 35],
        'Salary': [50000, 60000, 45000, np.nan, 80000],
        'Gender': ['M', 'F', 'F', 'M', np.nan]}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
# --- Method 1: Drop rows with any missing values
df_dropped = df.dropna()
print("\nDataFrame after dropping rows:")
print(df_dropped)
# --- Method 2: Impute missing values
# Impute 'Age' with the mean
df['Age'].fillna(df['Age'].mean(), inplace=True)
# Impute 'Salary' with the median (more robust to outliers)
df['Salary'].fillna(df['Salary'].median(), inplace=True)
# Impute 'Gender' with the mode (most frequent value)
df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)
print("\nDataFrame after imputation:")
print(df)

B. Handling Categorical Data

Machine learning models need numbers, not text strings. You need to encode categorical features.

（图片来源网络，侵删）

Methods:

Label Encoding: Assigns a unique integer to each category (e.g., Red=0, Green=1, Blue=2). Use this for ordinal data (where order matters, like "Low", "Medium", "High").
One-Hot Encoding: Creates a new binary (0/1) column for each category. This is the standard approach for nominal data (where order doesn't matter).

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
# Sample data
df_cat = pd.DataFrame({'Color': ['Red', 'Green', 'Blue', 'Red', 'Green']})
# --- Method 1: Label Encoding
le = LabelEncoder()
df_cat['Color_Label'] = le.fit_transform(df_cat['Color'])
print("DataFrame with Label Encoding:")
print(df_cat)
# --- Method 2: One-Hot Encoding (using pandas)
df_one_hot = pd.get_dummies(df_cat['Color'], prefix='Color')
print("\nOne-Hot Encoded DataFrame (using pandas):")
print(df_one_hot)
# --- Method 2: One-Hot Encoding (using scikit-learn)
# Note: scikit-learn's OneHotEncoder is more powerful for complex pipelines
ohe = OneHotEncoder(sparse_output=False)
color_encoded = ohe.fit_transform(df_cat[['Color']])
df_ohe_sklearn = pd.DataFrame(color_encoded, columns=ohe.get_feature_names_out(['Color']))
print("\nOne-Hot Encoded DataFrame (using scikit-learn):")
print(df_ohe_sklearn)

C. Feature Scaling

Many algorithms (like SVM, K-Nearest Neighbors, and Neural Networks) are sensitive to the scale of features. A feature with a large range (e.g., salary from 30k to 150k) can dominate a feature with a small range (e.g., age from 20 to 60).

Methods:

Normalization (Min-Max Scaling): Rescales features to a range of [0, 1]. Formula: (x - min) / (max - min).
Standardization (Z-score Scaling): Rescales features to have a mean of 0 and a standard deviation of 1. Formula: (x - mean) / std. This is generally preferred as it's less sensitive to outliers.

from sklearn.preprocessing import MinMaxScaler, StandardScaler
# Sample data with different scales
df_scale = pd.DataFrame({'Age': [25, 30, 22, 35],
                         'Salary': [50000, 60000, 45000, 80000]})
# --- Method 1: Normalization (Min-Max Scaler)
scaler_norm = MinMaxScaler()
df_scaled_norm = pd.DataFrame(scaler_norm.fit_transform(df_scale), columns=df_scale.columns)
print("Normalized Data:")
print(df_scaled_norm)
# --- Method 2: Standardization (Standard Scaler)
scaler_std = StandardScaler()
df_scaled_std = pd.DataFrame(scaler_std.fit_transform(df_scale), columns=df_scale.columns)
print("\nStandardized Data:")
print(df_scaled_std)

A Practical, End-to-End Example

Let's combine these concepts to preprocess a sample dataset.

Scenario: We have data about customers and want to predict if they will churn (leave the service).

# Step 1: Load the data
import pandas as pd
import numpy as np
# Create a realistic-looking dataset
data = {
    'customer_id': range(1, 11),
    'age': [23, 45, 56, 78, 32, 21, 19, 41, 29, 55],
    'tenure': [12, 24, 8, np.nan, 36, 5, 2, 18, 30, 22],
    'monthly_charges': [50.0, 100.0, 75.0, 120.0, 60.0, np.nan, 30.0, 90.0, 85.0, 110.0],
    'total_charges': [600.0, 2400.0, 600.0, np.nan, 2160.0, 150.0, 60.0, 1620.0, 2550.0, 2420.0],
    'contract_type': ['Month-to-month', 'One year', 'Month-to-month', 'Two year', 'One year', 'Month-to-month', 'Two year', 'One year', 'Month-to-month', 'Two year'],
    'churn': ['No', 'No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No']
}
df = pd.DataFrame(data)
print("Original Data:")
print(df)
print("\nData Info:")
df.info()
# Step 2: Handle Missing Values
# For 'tenure', let's fill with the median
df['tenure'].fillna(df['tenure'].median(), inplace=True)
# For 'monthly_charges', let's fill with the mean
df['monthly_charges'].fillna(df['monthly_charges'].mean(), inplace=True)
# For 'total_charges', we can calculate it from tenure and monthly_charges
df['total_charges'] = df['tenure'] * df['monthly_charges']
print("\nData after handling missing values:")
print(df)
# Step 3: Encode Categorical Data
# 'churn' is our target variable (Label Encode)
df['churn'] = df['churn'].apply(lambda x: 1 if x == 'Yes' else 0)
# 'contract_type' is a feature (One-Hot Encode)
df = pd.get_dummies(df, columns=['contract_type'], prefix='contract', drop_first=True) # drop_first to avoid multicollinearity
print("\nData after encoding categorical variables:")
print(df)
# Step 4: Scale Numerical Features
# We'll scale 'age', 'tenure', 'monthly_charges'
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
numerical_cols = ['age', 'tenure', 'monthly_charges']
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])
print("\nFinal Preprocessed Data:")
print(df)
# The 'customer_id' is just an identifier and should be dropped for modeling
# The final data is now ready for a machine learning model!

Popular Python Libraries for Preprocessing

Library	Description	Key Use Cases
Pandas	The foundation for data manipulation in Python.	Loading data (CSV, Excel), handling missing values, basic filtering, encoding with `get_dummies`.
NumPy	The fundamental package for numerical computation in Python.	Provides the array structure that Pandas is built on. Used for numerical operations.
Scikit-learn (sklearn)	The go-to library for machine learning in Python.	Provides all essential preprocessing tools: `SimpleImputer`, `StandardScaler`, `MinMaxScaler`, `LabelEncoder`, `OneHotEncoder`, and powerful `Pipeline` objects.
Feature-engine	A fantastic library specifically for feature engineering.	Offers a wide range of transformers for imputation, encoding, scaling, and feature creation, all designed to work seamlessly within Scikit-learn pipelines.
PySpark (Spark ML)	The library for big data processing with Apache Spark.	Used for preprocessing massive datasets that don't fit into a single machine's memory. The API is similar to Scikit-learn but designed for distributed computing.

Best Practices

Never Fit on Test Data: Always split your data into training and testing sets first. Then, fit your preprocessing transformers (like scaler.fit()) only on the training set. Use that fitted transformer to transform both the training and testing sets (scaler.transform()). This prevents "data leakage" from your test set into your model.
Use Pipelines: A Scikit-learn Pipeline chains together preprocessing steps and the model. This makes your code cleaner, prevents mistakes, and ensures that your preprocessing is applied correctly during cross-validation.
Document Your Steps: Keep track of the transformations you apply. This is crucial for reproducibility and for understanding your model's behavior later.

Python预处理有哪些常用方法？

What is Preprocessing? (The "Why")

Common Preprocessing Tasks (The "What")

A. Handling Missing Values

B. Handling Categorical Data

C. Feature Scaling

A Practical, End-to-End Example

Popular Python Libraries for Preprocessing

Best Practices

99ANYc3cd6

Python autoformat工具有哪些？如何选择？

Python插入排序如何实现？

LocalProxy Python如何实现本地代理功能？

Python reversed()怎么用？[::-1]和它有啥区别？

Python关系型数据库操作该选什么库？

Java Socket编程实例具体怎么实现？

安装python2和python3

Java URLRewrite如何配置与使用？

数据结构(java版)pdf

Python中activation函数如何选择与使用？

奥林巴斯em10使用教程

Windows Java远程如何实现？

Java switch和if，哪个场景更适用？

黑苹果cpu自动变频教程

AutoTune怎么用？新手入门教程指南

Python中如何高效操作行与列数据？

Python预处理有哪些常用方法？

What is Preprocessing? (The "Why")

Common Preprocessing Tasks (The "What")

A. Handling Missing Values

B. Handling Categorical Data

C. Feature Scaling

A Practical, End-to-End Example

Popular Python Libraries for Preprocessing

Best Practices

相关推荐

Java Socket编程实例具体怎么实现？