杰瑞科技汇

Python预处理有哪些常用方法?

Of course! "Python preprocess" is a broad topic, but it almost always refers to the process of preparing and cleaning data before feeding it into a machine learning model, an analysis algorithm, or another application.

Python预处理有哪些常用方法?-图1
(图片来源网络,侵删)

Let's break down the concept into:

  1. What is Preprocessing? (The "Why")
  2. Common Preprocessing Tasks (The "What")
  3. A Practical, End-to-End Example (The "How")
  4. Popular Python Libraries for Preprocessing

What is Preprocessing? (The "Why")

Raw data is often messy, incomplete, and in a format that algorithms can't understand. Preprocessing is the crucial step of transforming this raw data into a clean, consistent, and usable format.

Analogy: Think of preprocessing as preparing ingredients before cooking. You don't just throw raw carrots, a whole onion, and a dusty potato into a pot. You wash them, peel them, chop them, and maybe even blanch them. Preprocessing is the "chopping and washing" for your data.

Goals of Preprocessing:

Python预处理有哪些常用方法?-图2
(图片来源网络,侵删)
  • Handle Missing Values: Deal with empty cells or NaN values.
  • Correct Data Types: Ensure numbers are numeric, dates are date objects, etc.
  • Handle Categorical Data: Convert text labels (like "Red", "Green", "Blue") into numbers.
  • Normalize/Scale Features: Bring different numerical features onto a similar scale.
  • Remove Outliers: Identify and handle extreme values that can skew results.
  • Feature Engineering: Create new, more informative features from existing ones.

Common Preprocessing Tasks (The "What")

Here are the most common tasks you'll perform, with explanations and Python code snippets.

A. Handling Missing Values

Real-world datasets often have missing values. You can't just ignore them.

Methods:

  1. Remove: Drop rows or columns with missing values. (Use with caution, as you can lose a lot of data).
  2. Impute: Fill in the missing values with a statistic (mean, median, mode) or a constant.
import pandas as pd
import numpy as np
# Create a sample DataFrame with missing values
data = {'Age': [25, 30, 22, np.nan, 35],
        'Salary': [50000, 60000, 45000, np.nan, 80000],
        'Gender': ['M', 'F', 'F', 'M', np.nan]}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
# --- Method 1: Drop rows with any missing values
df_dropped = df.dropna()
print("\nDataFrame after dropping rows:")
print(df_dropped)
# --- Method 2: Impute missing values
# Impute 'Age' with the mean
df['Age'].fillna(df['Age'].mean(), inplace=True)
# Impute 'Salary' with the median (more robust to outliers)
df['Salary'].fillna(df['Salary'].median(), inplace=True)
# Impute 'Gender' with the mode (most frequent value)
df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)
print("\nDataFrame after imputation:")
print(df)

B. Handling Categorical Data

Machine learning models need numbers, not text strings. You need to encode categorical features.

Python预处理有哪些常用方法?-图3
(图片来源网络,侵删)

Methods:

  1. Label Encoding: Assigns a unique integer to each category (e.g., Red=0, Green=1, Blue=2). Use this for ordinal data (where order matters, like "Low", "Medium", "High").
  2. One-Hot Encoding: Creates a new binary (0/1) column for each category. This is the standard approach for nominal data (where order doesn't matter).
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
# Sample data
df_cat = pd.DataFrame({'Color': ['Red', 'Green', 'Blue', 'Red', 'Green']})
# --- Method 1: Label Encoding
le = LabelEncoder()
df_cat['Color_Label'] = le.fit_transform(df_cat['Color'])
print("DataFrame with Label Encoding:")
print(df_cat)
# --- Method 2: One-Hot Encoding (using pandas)
df_one_hot = pd.get_dummies(df_cat['Color'], prefix='Color')
print("\nOne-Hot Encoded DataFrame (using pandas):")
print(df_one_hot)
# --- Method 2: One-Hot Encoding (using scikit-learn)
# Note: scikit-learn's OneHotEncoder is more powerful for complex pipelines
ohe = OneHotEncoder(sparse_output=False)
color_encoded = ohe.fit_transform(df_cat[['Color']])
df_ohe_sklearn = pd.DataFrame(color_encoded, columns=ohe.get_feature_names_out(['Color']))
print("\nOne-Hot Encoded DataFrame (using scikit-learn):")
print(df_ohe_sklearn)

C. Feature Scaling

Many algorithms (like SVM, K-Nearest Neighbors, and Neural Networks) are sensitive to the scale of features. A feature with a large range (e.g., salary from 30k to 150k) can dominate a feature with a small range (e.g., age from 20 to 60).

Methods:

  1. Normalization (Min-Max Scaling): Rescales features to a range of [0, 1]. Formula: (x - min) / (max - min).
  2. Standardization (Z-score Scaling): Rescales features to have a mean of 0 and a standard deviation of 1. Formula: (x - mean) / std. This is generally preferred as it's less sensitive to outliers.
from sklearn.preprocessing import MinMaxScaler, StandardScaler
# Sample data with different scales
df_scale = pd.DataFrame({'Age': [25, 30, 22, 35],
                         'Salary': [50000, 60000, 45000, 80000]})
# --- Method 1: Normalization (Min-Max Scaler)
scaler_norm = MinMaxScaler()
df_scaled_norm = pd.DataFrame(scaler_norm.fit_transform(df_scale), columns=df_scale.columns)
print("Normalized Data:")
print(df_scaled_norm)
# --- Method 2: Standardization (Standard Scaler)
scaler_std = StandardScaler()
df_scaled_std = pd.DataFrame(scaler_std.fit_transform(df_scale), columns=df_scale.columns)
print("\nStandardized Data:")
print(df_scaled_std)

A Practical, End-to-End Example

Let's combine these concepts to preprocess a sample dataset.

Scenario: We have data about customers and want to predict if they will churn (leave the service).

# Step 1: Load the data
import pandas as pd
import numpy as np
# Create a realistic-looking dataset
data = {
    'customer_id': range(1, 11),
    'age': [23, 45, 56, 78, 32, 21, 19, 41, 29, 55],
    'tenure': [12, 24, 8, np.nan, 36, 5, 2, 18, 30, 22],
    'monthly_charges': [50.0, 100.0, 75.0, 120.0, 60.0, np.nan, 30.0, 90.0, 85.0, 110.0],
    'total_charges': [600.0, 2400.0, 600.0, np.nan, 2160.0, 150.0, 60.0, 1620.0, 2550.0, 2420.0],
    'contract_type': ['Month-to-month', 'One year', 'Month-to-month', 'Two year', 'One year', 'Month-to-month', 'Two year', 'One year', 'Month-to-month', 'Two year'],
    'churn': ['No', 'No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No']
}
df = pd.DataFrame(data)
print("Original Data:")
print(df)
print("\nData Info:")
df.info()
# Step 2: Handle Missing Values
# For 'tenure', let's fill with the median
df['tenure'].fillna(df['tenure'].median(), inplace=True)
# For 'monthly_charges', let's fill with the mean
df['monthly_charges'].fillna(df['monthly_charges'].mean(), inplace=True)
# For 'total_charges', we can calculate it from tenure and monthly_charges
df['total_charges'] = df['tenure'] * df['monthly_charges']
print("\nData after handling missing values:")
print(df)
# Step 3: Encode Categorical Data
# 'churn' is our target variable (Label Encode)
df['churn'] = df['churn'].apply(lambda x: 1 if x == 'Yes' else 0)
# 'contract_type' is a feature (One-Hot Encode)
df = pd.get_dummies(df, columns=['contract_type'], prefix='contract', drop_first=True) # drop_first to avoid multicollinearity
print("\nData after encoding categorical variables:")
print(df)
# Step 4: Scale Numerical Features
# We'll scale 'age', 'tenure', 'monthly_charges'
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
numerical_cols = ['age', 'tenure', 'monthly_charges']
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])
print("\nFinal Preprocessed Data:")
print(df)
# The 'customer_id' is just an identifier and should be dropped for modeling
# The final data is now ready for a machine learning model!

Popular Python Libraries for Preprocessing

Library Description Key Use Cases
Pandas The foundation for data manipulation in Python. Loading data (CSV, Excel), handling missing values, basic filtering, encoding with get_dummies.
NumPy The fundamental package for numerical computation in Python. Provides the array structure that Pandas is built on. Used for numerical operations.
Scikit-learn (sklearn) The go-to library for machine learning in Python. Provides all essential preprocessing tools: SimpleImputer, StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder, and powerful Pipeline objects.
Feature-engine A fantastic library specifically for feature engineering. Offers a wide range of transformers for imputation, encoding, scaling, and feature creation, all designed to work seamlessly within Scikit-learn pipelines.
PySpark (Spark ML) The library for big data processing with Apache Spark. Used for preprocessing massive datasets that don't fit into a single machine's memory. The API is similar to Scikit-learn but designed for distributed computing.

Best Practices

  1. Never Fit on Test Data: Always split your data into training and testing sets first. Then, fit your preprocessing transformers (like scaler.fit()) only on the training set. Use that fitted transformer to transform both the training and testing sets (scaler.transform()). This prevents "data leakage" from your test set into your model.
  2. Use Pipelines: A Scikit-learn Pipeline chains together preprocessing steps and the model. This makes your code cleaner, prevents mistakes, and ensures that your preprocessing is applied correctly during cross-validation.
  3. Document Your Steps: Keep track of the transformations you apply. This is crucial for reproducibility and for understanding your model's behavior later.
分享:
扫描分享到社交APP
上一篇
下一篇