Python预处理有哪些核心方法？-杰瑞科技汇

Of course! "Preprocess Python" is a broad topic, but it usually refers to preparing data for use in a machine learning model or a data analysis task. This is a critical step, as the quality of your data directly impacts the performance of your models.

（图片来源网络，侵删）

Here’s a comprehensive guide to data preprocessing in Python, broken down into key concepts, essential libraries, and a practical step-by-step example.

The Goal of Data Preprocessing

The main objectives are to:

Clean Data: Handle missing values, errors, and inconsistencies.
Transform Data: Convert data into a format that algorithms can understand (e.g., numbers instead of text).
Reduce Data: Remove irrelevant features or reduce dimensionality.
Scale Data: Normalize or standardize numerical features to ensure they contribute equally to the model.

Essential Python Libraries

You'll primarily use a combination of these libraries:

Library	Purpose
Pandas	The core library for loading, manipulating, and exploring data in a DataFrame format.
NumPy	The fundamental package for numerical computation in Python. Pandas is built on top of it.
Scikit-learn (sklearn)	The go-to library for machine learning in Python. It provides powerful tools for almost every preprocessing task.
Seaborn / Matplotlib	Used for data visualization, which is crucial for understanding your data before and after preprocessing.

The Preprocessing Workflow: A Step-by-Step Guide

Let's walk through a typical preprocessing pipeline. We'll use a sample dataset.

（图片来源网络，侵删）

Step 0: Setup and Data Loading

First, install the necessary libraries if you haven't already:

pip install pandas numpy scikit-learn seaborn matplotlib

Now, let's import them and create a sample DataFrame with common data quality issues.

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
# Create a sample DataFrame with real-world data issues
data = {
    'age': [25, 45, 35, 50, 23, 32, np.nan, 40],
    'salary': [50000, 80000, 60000, 120000, 45000, 55000, 90000, np.nan],
    'city': ['New York', 'London', 'New York', 'Paris', 'London', 'Paris', 'New York', 'Tokyo'],
    'purchased': ['Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No']
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

Output:

Original DataFrame:
    age   salary     city purchased
0  25.0  50000.0  New York       Yes
1  45.0  80000.0    London        No
2  35.0  60000.0  New York       Yes
3  50.0 120000.0     Paris        No
4  23.0  45000.0    London       Yes
5  32.0  55000.0     Paris        No
6   NaN  90000.0  New York       Yes
7  40.0      NaN     Tokyo        No

Step 1: Handling Missing Data

Real-world datasets often have missing values (NaN, None, empty strings). You can either remove them or fill them in (imputation).

（图片来源网络，侵删）

Option A: Removing Rows/Columns

# Drop rows with any missing values
df_dropped_rows = df.dropna()
print("\nDataFrame after dropping rows with missing values:")
print(df_dropped_rows)
# Drop columns with any missing values
df_dropped_cols = df.dropna(axis=1)
print("\nDataFrame after dropping columns with missing values:")
print(df_dropped_cols)

Option B: Imputation (Filling Missing Values) This is often preferred. We can use the mean, median, mode, or a constant.

# Impute numerical columns with the mean
df['age'].fillna(df['age'].mean(), inplace=True)
df['salary'].fillna(df['salary'].median(), inplace=True) # Median is less sensitive to outliers
print("\nDataFrame after imputing missing values:")
print(df)

Output:

DataFrame after imputing missing values:
    age   salary     city purchased
0  25.0  50000.0  New York       Yes
1  45.0  80000.0    London        No
2  35.0  60000.0  New York       Yes
3  50.0 120000.0     Paris        No
4  23.0  45000.0    London       Yes
5  32.0  55000.0     Paris        No
6  36.25 90000.0  New York       Yes
7  40.0  67500.0     Tokyo        No

Note: The sklearn.impute.SimpleImputer is a more robust, scikit-learn compatible way to do this, which we'll see in the pipeline.

Step 2: Encoding Categorical Variables

Machine learning models work with numbers, not text. We need to convert categorical columns (city, purchased) into numerical format.

A) Label Encoding: For Ordinal Data Use when there is an order (e.g., 'Low', 'Medium', 'High').

from sklearn.preprocessing import LabelEncoder
# For the 'purchased' column (Yes/No)
le = LabelEncoder()
df['purchased_encoded'] = le.fit_transform(df['purchased'])
print("\nDataFrame after Label Encoding 'purchased':")
print(df[['purchased', 'purchased_encoded']])

Output:

DataFrame after Label Encoding 'purchased':
  purchased  purchased_encoded
0       Yes                  1
1        No                  0
2       Yes                  1
3        No                  0
4       Yes                  1
5        No                  0
6       Yes                  1
7        No                  0

(Note: Yes -> 1, No -> 0)

B) One-Hot Encoding: For Nominal Data Use when there is no order (e.g., 'New York', 'London', 'Paris'). It creates a new binary column for each category.

# For the 'city' column
df_encoded = pd.get_dummies(df, columns=['city'], prefix='city')
print("\nDataFrame after One-Hot Encoding 'city':")
print(df_encoded)

Output:

DataFrame after One-Hot Encoding 'city':
    age   salary  purchased_encoded  city_London  city_New York  city_Paris  city_Tokyo
0  25.0  50000.0                  1            0              1           0           0
1  45.0  80000.0                  0            1              0           0           0
2  35.0  60000.0                  1            0              1           0           0
3  50.0 120000.0                  0            0              0           1           0
4  23.0  45000.0                  1            1              0           0           0
5  32.0  55000.0                  0            0              0           1           0
6  36.25 90000.0                  1            0              1           0           0
7  40.0  67500.0                  0            0              0           0           1

Step 3: Feature Scaling

Many algorithms (like SVM, K-Nearest Neighbors, and Neural Networks) are sensitive to the scale of features. If one feature has a much larger range than others, it will dominate the model's learning process.

A) Standardization (Z-score normalization) Rescales features to have a mean of 0 and a standard deviation of 1. Works well when the data follows a Gaussian distribution.

from sklearn.preprocessing import StandardScaler
# Select numerical features to scale
numerical_features = ['age', 'salary']
scaler = StandardScaler()
# Fit and transform the data
df_encoded[numerical_features] = scaler.fit_transform(df_encoded[numerical_features])
print("\nDataFrame after Standardization:")
print(df_encoded)

Output:

DataFrame after Standardization:
        age    salary  purchased_encoded  city_London  city_New York  city_Paris  city_Tokyo
0 -1.057912 -0.938991                  1            0              1           0           0
1  0.782343  0.799453                  0            1              0           0           0
2 -0.138486 -0.351242                  1            0              1           0           0
3  1.204331  2.239556                  0            0              0           1           0
4 -1.319877 -1.071191                  1            1              0           0           0
5 -0.400449 -0.753470                  0            0              0           1           0
6 -0.576456  1.286253                  1            0              1           0           0
7  0.506626  0.109642                  0            0              0           0           1

B) Normalization (Min-Max Scaling) Rescales features to a range between 0 and 1. Useful when you don't have outliers and want to bound your features to a specific range.

from sklearn.preprocessing import MinMaxScaler
# scaler = MinMaxScaler()
# df_encoded[numerical_features] = scaler.fit_transform(df_encoded[numerical_features])

The Best Practice: Using `Pipeline`

Doing all these steps manually is error-prone and can lead to data leakage (using information from the test set to preprocess the training set). The best practice is to use Scikit-learn's Pipeline and ColumnTransformer.

This ensures that transformations (like imputation and scaling) are learned from the training data and applied consistently to the test data.

Here's how to build a complete preprocessing pipeline:

# Let's start with the original, messy data again
df_original = pd.DataFrame(data)
# Define features and target
X = df_original.drop('purchased', axis=1)
y = df_original['purchased']
# Identify column types
numerical_features = ['age', 'salary']
categorical_features = ['city']
# Create preprocessing pipelines for each type of data
# 1. Numerical Pipeline: Impute missing values with median, then standardize
numerical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])
# 2. Categorical Pipeline: Impute missing values with most frequent, then one-hot encode
categorical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])
# Combine pipelines using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_pipeline, numerical_features),
        ('cat', categorical_pipeline, categorical_features)
    ])
# Now, you can use this preprocessor on your data
X_processed = preprocessor.fit_transform(X)
# The result is a NumPy array. You can convert it back to a DataFrame if needed.
# Get feature names from the preprocessor
feature_names = preprocessor.get_feature_names_out()
X_processed_df = pd.DataFrame(X_processed, columns=feature_names)
print("\nFinal Processed DataFrame using a Pipeline:")
print(X_processed_df)

Output of the Pipeline:

Final Processed DataFrame using a Pipeline:
      num__age  num__salary  cat__city_London  cat__city_New York  cat__city_Paris  cat__city_Tokyo
0   -1.057912   -0.938991               0.0                  1.0               0.0              0.0
1    0.782343    0.799453               1.0                  0.0               0.0              0.0
2   -0.138486   -0.351242               0.0                  1.0               0.0              0.0
3    1.204331    2.239556               0.0                  0.0               1.0              0.0
4   -1.319877   -1.071191               1.0                  0.0               0.0              0.0
5   -0.400449   -0.753470               0.0                  0.0               1.0              0.0
6    0.000000    1.286253               0.0                  1.0               0.0              0.0
7    0.506626    0.109642               0.0                  0.0               0.0              1.0

Notice how the pipeline automatically handled the missing age and salary values and the categorical city column in one clean, integrated step.

Summary of Key Concepts

Task	Why?	Common Python Tools
Handling Missing Data	Prevents errors and model bias.	`df.dropna()`, `df.fillna()`, `sklearn.impute.SimpleImputer`
Encoding Categorical Data	Converts text to numbers for algorithms.	`sklearn.preprocessing.LabelEncoder`, `pd.get_dummies()`, `sklearn.preprocessing.OneHotEncoder`
Feature Scaling	Ensures features contribute equally to the model.	`sklearn.preprocessing.StandardScaler`, `sklearn.preprocessing.MinMaxScaler`
Creating a Pipeline	Prevents data leakage and streamlines the workflow.	`sklearn.pipeline.Pipeline`, `sklearn.compose.ColumnTransformer`

By following this structured approach, you can robustly and efficiently preprocess any dataset in Python, preparing it for high-performance machine learning.

Python预处理有哪些核心方法？

The Goal of Data Preprocessing

Essential Python Libraries

The Preprocessing Workflow: A Step-by-Step Guide

Step 0: Setup and Data Loading

Step 1: Handling Missing Data

Step 2: Encoding Categorical Variables

Step 3: Feature Scaling

The Best Practice: Using `Pipeline`

Summary of Key Concepts

99ANYc3cd6

JTextField Java 如何使用？

企业资源计划ERP教程如何快速上手？

Java JTextField如何实现输入限制与事件监听？

python nslackssvm

斗鱼OBS教程，直播推流怎么设置？

Java Socket编程实例具体怎么实现？

matlab入门教程视频

Python如何处理MySQL中的NULL值？

SharePoint如何用Python交互？

Python LazyModule如何实现懒加载机制？

Python error bars如何绘制与自定义？

Python rollback如何实现？

2007版CAD怎么安装？步骤详解？

java private 调用

Python requests代码如何高效发送请求？

零基础如何免费学CDR入门？

Python预处理有哪些核心方法？

The Goal of Data Preprocessing

Essential Python Libraries

The Preprocessing Workflow: A Step-by-Step Guide

Step 0: Setup and Data Loading

Step 1: Handling Missing Data

Step 2: Encoding Categorical Variables

Step 3: Feature Scaling

The Best Practice: Using Pipeline

Summary of Key Concepts

相关推荐

Java Socket编程实例具体怎么实现？

The Best Practice: Using `Pipeline`