杰瑞科技汇

Python预处理有哪些核心方法?

Of course! "Preprocess Python" is a broad topic, but it usually refers to preparing data for use in a machine learning model or a data analysis task. This is a critical step, as the quality of your data directly impacts the performance of your models.

Python预处理有哪些核心方法?-图1
(图片来源网络,侵删)

Here’s a comprehensive guide to data preprocessing in Python, broken down into key concepts, essential libraries, and a practical step-by-step example.


The Goal of Data Preprocessing

The main objectives are to:

  • Clean Data: Handle missing values, errors, and inconsistencies.
  • Transform Data: Convert data into a format that algorithms can understand (e.g., numbers instead of text).
  • Reduce Data: Remove irrelevant features or reduce dimensionality.
  • Scale Data: Normalize or standardize numerical features to ensure they contribute equally to the model.

Essential Python Libraries

You'll primarily use a combination of these libraries:

Library Purpose
Pandas The core library for loading, manipulating, and exploring data in a DataFrame format.
NumPy The fundamental package for numerical computation in Python. Pandas is built on top of it.
Scikit-learn (sklearn) The go-to library for machine learning in Python. It provides powerful tools for almost every preprocessing task.
Seaborn / Matplotlib Used for data visualization, which is crucial for understanding your data before and after preprocessing.

The Preprocessing Workflow: A Step-by-Step Guide

Let's walk through a typical preprocessing pipeline. We'll use a sample dataset.

Python预处理有哪些核心方法?-图2
(图片来源网络,侵删)

Step 0: Setup and Data Loading

First, install the necessary libraries if you haven't already:

pip install pandas numpy scikit-learn seaborn matplotlib

Now, let's import them and create a sample DataFrame with common data quality issues.

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
# Create a sample DataFrame with real-world data issues
data = {
    'age': [25, 45, 35, 50, 23, 32, np.nan, 40],
    'salary': [50000, 80000, 60000, 120000, 45000, 55000, 90000, np.nan],
    'city': ['New York', 'London', 'New York', 'Paris', 'London', 'Paris', 'New York', 'Tokyo'],
    'purchased': ['Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No']
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

Output:

Original DataFrame:
    age   salary     city purchased
0  25.0  50000.0  New York       Yes
1  45.0  80000.0    London        No
2  35.0  60000.0  New York       Yes
3  50.0 120000.0     Paris        No
4  23.0  45000.0    London       Yes
5  32.0  55000.0     Paris        No
6   NaN  90000.0  New York       Yes
7  40.0      NaN     Tokyo        No

Step 1: Handling Missing Data

Real-world datasets often have missing values (NaN, None, empty strings). You can either remove them or fill them in (imputation).

Python预处理有哪些核心方法?-图3
(图片来源网络,侵删)

Option A: Removing Rows/Columns

# Drop rows with any missing values
df_dropped_rows = df.dropna()
print("\nDataFrame after dropping rows with missing values:")
print(df_dropped_rows)
# Drop columns with any missing values
df_dropped_cols = df.dropna(axis=1)
print("\nDataFrame after dropping columns with missing values:")
print(df_dropped_cols)

Option B: Imputation (Filling Missing Values) This is often preferred. We can use the mean, median, mode, or a constant.

# Impute numerical columns with the mean
df['age'].fillna(df['age'].mean(), inplace=True)
df['salary'].fillna(df['salary'].median(), inplace=True) # Median is less sensitive to outliers
print("\nDataFrame after imputing missing values:")
print(df)

Output:

DataFrame after imputing missing values:
    age   salary     city purchased
0  25.0  50000.0  New York       Yes
1  45.0  80000.0    London        No
2  35.0  60000.0  New York       Yes
3  50.0 120000.0     Paris        No
4  23.0  45000.0    London       Yes
5  32.0  55000.0     Paris        No
6  36.25 90000.0  New York       Yes
7  40.0  67500.0     Tokyo        No

Note: The sklearn.impute.SimpleImputer is a more robust, scikit-learn compatible way to do this, which we'll see in the pipeline.


Step 2: Encoding Categorical Variables

Machine learning models work with numbers, not text. We need to convert categorical columns (city, purchased) into numerical format.

A) Label Encoding: For Ordinal Data Use when there is an order (e.g., 'Low', 'Medium', 'High').

from sklearn.preprocessing import LabelEncoder
# For the 'purchased' column (Yes/No)
le = LabelEncoder()
df['purchased_encoded'] = le.fit_transform(df['purchased'])
print("\nDataFrame after Label Encoding 'purchased':")
print(df[['purchased', 'purchased_encoded']])

Output:

DataFrame after Label Encoding 'purchased':
  purchased  purchased_encoded
0       Yes                  1
1        No                  0
2       Yes                  1
3        No                  0
4       Yes                  1
5        No                  0
6       Yes                  1
7        No                  0

(Note: Yes -> 1, No -> 0)

B) One-Hot Encoding: For Nominal Data Use when there is no order (e.g., 'New York', 'London', 'Paris'). It creates a new binary column for each category.

# For the 'city' column
df_encoded = pd.get_dummies(df, columns=['city'], prefix='city')
print("\nDataFrame after One-Hot Encoding 'city':")
print(df_encoded)

Output:

DataFrame after One-Hot Encoding 'city':
    age   salary  purchased_encoded  city_London  city_New York  city_Paris  city_Tokyo
0  25.0  50000.0                  1            0              1           0           0
1  45.0  80000.0                  0            1              0           0           0
2  35.0  60000.0                  1            0              1           0           0
3  50.0 120000.0                  0            0              0           1           0
4  23.0  45000.0                  1            1              0           0           0
5  32.0  55000.0                  0            0              0           1           0
6  36.25 90000.0                  1            0              1           0           0
7  40.0  67500.0                  0            0              0           0           1

Step 3: Feature Scaling

Many algorithms (like SVM, K-Nearest Neighbors, and Neural Networks) are sensitive to the scale of features. If one feature has a much larger range than others, it will dominate the model's learning process.

A) Standardization (Z-score normalization) Rescales features to have a mean of 0 and a standard deviation of 1. Works well when the data follows a Gaussian distribution.

from sklearn.preprocessing import StandardScaler
# Select numerical features to scale
numerical_features = ['age', 'salary']
scaler = StandardScaler()
# Fit and transform the data
df_encoded[numerical_features] = scaler.fit_transform(df_encoded[numerical_features])
print("\nDataFrame after Standardization:")
print(df_encoded)

Output:

DataFrame after Standardization:
        age    salary  purchased_encoded  city_London  city_New York  city_Paris  city_Tokyo
0 -1.057912 -0.938991                  1            0              1           0           0
1  0.782343  0.799453                  0            1              0           0           0
2 -0.138486 -0.351242                  1            0              1           0           0
3  1.204331  2.239556                  0            0              0           1           0
4 -1.319877 -1.071191                  1            1              0           0           0
5 -0.400449 -0.753470                  0            0              0           1           0
6 -0.576456  1.286253                  1            0              1           0           0
7  0.506626  0.109642                  0            0              0           0           1

B) Normalization (Min-Max Scaling) Rescales features to a range between 0 and 1. Useful when you don't have outliers and want to bound your features to a specific range.

from sklearn.preprocessing import MinMaxScaler
# scaler = MinMaxScaler()
# df_encoded[numerical_features] = scaler.fit_transform(df_encoded[numerical_features])

The Best Practice: Using Pipeline

Doing all these steps manually is error-prone and can lead to data leakage (using information from the test set to preprocess the training set). The best practice is to use Scikit-learn's Pipeline and ColumnTransformer.

This ensures that transformations (like imputation and scaling) are learned from the training data and applied consistently to the test data.

Here's how to build a complete preprocessing pipeline:

# Let's start with the original, messy data again
df_original = pd.DataFrame(data)
# Define features and target
X = df_original.drop('purchased', axis=1)
y = df_original['purchased']
# Identify column types
numerical_features = ['age', 'salary']
categorical_features = ['city']
# Create preprocessing pipelines for each type of data
# 1. Numerical Pipeline: Impute missing values with median, then standardize
numerical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])
# 2. Categorical Pipeline: Impute missing values with most frequent, then one-hot encode
categorical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])
# Combine pipelines using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_pipeline, numerical_features),
        ('cat', categorical_pipeline, categorical_features)
    ])
# Now, you can use this preprocessor on your data
X_processed = preprocessor.fit_transform(X)
# The result is a NumPy array. You can convert it back to a DataFrame if needed.
# Get feature names from the preprocessor
feature_names = preprocessor.get_feature_names_out()
X_processed_df = pd.DataFrame(X_processed, columns=feature_names)
print("\nFinal Processed DataFrame using a Pipeline:")
print(X_processed_df)

Output of the Pipeline:

Final Processed DataFrame using a Pipeline:
      num__age  num__salary  cat__city_London  cat__city_New York  cat__city_Paris  cat__city_Tokyo
0   -1.057912   -0.938991               0.0                  1.0               0.0              0.0
1    0.782343    0.799453               1.0                  0.0               0.0              0.0
2   -0.138486   -0.351242               0.0                  1.0               0.0              0.0
3    1.204331    2.239556               0.0                  0.0               1.0              0.0
4   -1.319877   -1.071191               1.0                  0.0               0.0              0.0
5   -0.400449   -0.753470               0.0                  0.0               1.0              0.0
6    0.000000    1.286253               0.0                  1.0               0.0              0.0
7    0.506626    0.109642               0.0                  0.0               0.0              1.0

Notice how the pipeline automatically handled the missing age and salary values and the categorical city column in one clean, integrated step.


Summary of Key Concepts

Task Why? Common Python Tools
Handling Missing Data Prevents errors and model bias. df.dropna(), df.fillna(), sklearn.impute.SimpleImputer
Encoding Categorical Data Converts text to numbers for algorithms. sklearn.preprocessing.LabelEncoder, pd.get_dummies(), sklearn.preprocessing.OneHotEncoder
Feature Scaling Ensures features contribute equally to the model. sklearn.preprocessing.StandardScaler, sklearn.preprocessing.MinMaxScaler
Creating a Pipeline Prevents data leakage and streamlines the workflow. sklearn.pipeline.Pipeline, sklearn.compose.ColumnTransformer

By following this structured approach, you can robustly and efficiently preprocess any dataset in Python, preparing it for high-performance machine learning.

分享:
扫描分享到社交APP
上一篇
下一篇