Python cleaneddata如何高效处理？-杰瑞科技汇

Of course! The term "cleaned data" in Python refers to the process of preparing and transforming raw, messy data into a structured and usable format for analysis, visualization, or machine learning.

（图片来源网络，侵删）

This is one of the most critical steps in any data science project. As the saying goes, "Garbage in, garbage out."

Here’s a comprehensive guide to data cleaning in Python, covering the concepts, common tasks, and the essential libraries used.

Why is Data Cleaning Necessary?

Raw data is almost never perfect. It often contains issues like:

Missing Values: Empty cells (NaN, None, NA) that can break calculations.
Inconsistent Data: Typos, different formats (e.g., "USA" vs "U.S.A."), and mixed data types in a single column.
Outliers: Extreme values that don't fit the normal pattern of the data and can skew results.
Duplicate Rows: Repeated entries that can bias analysis.
Incorrect Data Types: A column of numbers stored as text, or dates stored as strings.
Irrelevant Data: Columns that don't contribute to the analysis (e.g., an ID column in a dataset where you're looking at user behavior).

The Essential Python Libraries for Data Cleaning

You'll primarily use two powerful libraries:

（图片来源网络，侵删）

Pandas: The go-to library for data manipulation and analysis in Python. It provides data structures like DataFrames (similar to spreadsheets) that make cleaning data intuitive and efficient.
NumPy: A fundamental library for numerical computing in Python. Pandas is built on top of NumPy, and it's often used for high-performance mathematical operations.

The Data Cleaning Workflow: A Step-by-Step Guide

Let's walk through the most common data cleaning tasks using Pandas. We'll start with a sample messy DataFrame.

Sample "Messy" Data

Imagine we have a CSV file named messy_data.csv with the following content:

Name,Age,City,Salary,Join_Date
Alice,25,New York,70000,2025-01-15
Bob,,Los Angeles,NaN,2025-05-20
Charlie,28,New York,85000,2025-01-15
David,120,Chicago,60000,2025-12-01
Eve,32,San Francisco,NaN,2025-02-28
Frank,30,New York,75000,2025-01-15
,,Miami,45000,2025-11-10
Grace,35,Los Angeles,90000,2025-05-20

Let's load this data into a Pandas DataFrame.

import pandas as pd
import numpy as np
# Load the data
df = pd.read_csv('messy_data.csv')
print("Original DataFrame:")
print(df)

Output:

（图片来源网络，侵删）

Original DataFrame:
      Name   Age         City   Salary   Join_Date
0    Alice    25     New York  70000.0  2025-01-15
1      Bob   NaN   Los Angeles      NaN  2025-05-20
2  Charlie    28     New York  85000.0  2025-01-15
3    David   120      Chicago  60000.0  2025-12-01
4      Eve    32  San Francisco      NaN  2025-02-28
5    Frank    30     New York  75000.0  2025-01-15
6      NaN   NaN       Miami  45000.0  2025-11-10
7    Grace    35   Los Angeles  90000.0  2025-05-20

Step 1: Handling Missing Values

Missing values are represented as NaN (Not a Number) in Pandas.

A. Identify Missing Values: Use .isnull() or .isna() to create a boolean mask of missing values, and .sum() to count them.

print("\nMissing Values per Column:")
print(df.isnull().sum())

Output:

Missing Values per Column:
Name          1
Age           2
City          0
Salary        2
Join_Date     0
dtype: int64

B. Decide on a Strategy for Handling Missing Data:

Drop: Remove rows or columns with missing values. Use this if the data is missing completely at random and you have a large dataset.

# Drop rows with any missing values
df_dropped_rows = df.dropna()
print("\nDataFrame after dropping rows with missing values:")
print(df_dropped_rows)
# Drop columns with any missing values (less common)
# df_dropped_cols = df.dropna(axis=1)

Fill/Impute: Replace missing values with a specific number or statistic. This is often better for retaining data.

# Fill missing numerical values with the mean of the column
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Salary'].fillna(df['Salary'].mean(), inplace=True)
# Fill missing categorical/text values with the mode (most frequent value)
df['Name'].fillna(df['Name'].mode()[0], inplace=True)
print("\nDataFrame after filling missing values:")
print(df)

Output (after filling):

  Name        Age         City        Salary   Join_Date
0    Alice  25.000000     New York  70000.000000  2025-01-15
1      Bob  35.285714   Los Angeles  69285.714286  2025-05-20
2  Charlie  28.000000     New York  85000.000000  2025-01-15
3    David 120.000000      Chicago  60000.000000  2025-12-01
4      Eve  32.000000  San Francisco  69285.714286  2025-02-28
5    Frank  30.000000     New York  75000.000000  2025-01-15
6      Unknown  35.285714       Miami  45000.000000  2025-11-10
7    Grace  35.000000   Los Angeles  90000.000000  2025-05-20

Step 2: Correcting Data Types

The Join_Date column is currently a string. For time-series analysis, it should be a datetime object.

# Convert 'Join_Date' to datetime objects
df['Join_Date'] = pd.to_datetime(df['Join_Date'])
print("\nDataFrame with corrected data types:")
print(df.info())

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   Name       8 non-null      object
 1   Age        8 non-null      float64
 2   City       8 non-null      object
 3   Salary     8 non-null      float64
 4   Join_Date  8 non-null      datetime64[ns]
dtypes: datetime64[ns](1), float64(2), object(2)
memory usage: 448.0+ bytes
None

Notice Join_Date is now datetime64[ns] and Age/Salary are float64. We can convert Age back to an integer since it's now a whole number.

df['Age'] = df['Age'].astype(int)

Step 3: Removing Duplicates

We can see that Alice, Charlie, and Frank are all from New York and joined on the same day. Let's check for exact duplicates.

# Check for duplicate rows
print(f"\nNumber of duplicate rows: {df.duplicated().sum()}")
# Remove duplicate rows (keeps the first occurrence)
df_no_duplicates = df.drop_duplicates()
print("\nDataFrame after removing duplicates:")
print(df_no_duplicates)

Output:

Number of duplicate rows: 2
DataFrame after removing duplicates:
      Name  Age         City        Salary   Join_Date
0    Alice   25     New York  70000.000000  2025-01-15
1      Bob   35   Los Angeles  69285.714286  2025-05-20
2  Charlie   28     New York  85000.000000  2025-01-15
3    David  120      Chicago  60000.000000  2025-12-01
4      Eve   32  San Francisco  69285.714286  2025-02-28
6    Unknown   35       Miami  45000.000000  2025-11-10
7    Grace   35   Los Angeles  90000.000000  2025-05-20

Step 4: Handling Outliers

David's age of 120 is likely an outlier. Let's find and handle it.

A. Identify Outliers (using IQR method): The Interquartile Range (IQR) is a common way to define outliers. Any value below Q1 - 1.5*IQR or above Q3 + 1.5*IQR is considered an outlier.

# Calculate Q1, Q3, and IQR for the 'Age' column
Q1 = df['Age'].quantile(0.25)
Q3 = df['Age'].quantile(0.75)
IQR = Q3 - Q1
# Define the bounds for outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
print(f"\nOutlier bounds for Age: < {lower_bound} or > {upper_bound}")
# Identify the outlier
outliers = df[(df['Age'] < lower_bound) | (df['Age'] > upper_bound)]
print("\nOutlier rows:")
print(outliers)

Output:

Outlier bounds for Age: < 16.5 or > 51.5
Outlier rows:
    Name  Age     City   Salary   Join_Date
3  David  120  Chicago  60000.0  2025-12-01

B. Decide on a Strategy for Outliers:

Cap (Winsorize): Replace the outlier with the boundary value.

df['Age'] = np.where(df['Age'] > upper_bound, upper_bound, df['Age'])

Remove: Drop the row containing the outlier.
```
# df = df[df['Age'] <= upper_bound]
```
Transform: Use a log transformation to reduce the impact of the outlier.

Let's cap the age at the upper bound of 51.5 (which we'll round to 52 for simplicity).

df['Age'] = np.where(df['Age'] > 52, 52, df['Age'])
print("\nDataFrame after handling the age outlier:")
print(df)

Output (after capping):

      Name  Age         City        Salary   Join_Date
0    Alice  25     New York  70000.000000  2025-01-15
1      Bob  35   Los Angeles  69285.714286  2025-05-20
2  Charlie  28     New York  85000.000000  2025-01-15
3    David  52      Chicago  60000.000000  2025-12-01
4      Eve  32  San Francisco  69285.714286  2025-02-28
6    Unknown  35       Miami  45000.000000  2025-11-10
7    Grace  35   Los Angeles  90000.000000  2025-05-20

Step 5: Data Standardization/Normalization (Optional)

This step ensures consistency. For example, standardizing city names.

# Standardize city names to title case
df['City'] = df['City'].str.title()
print("\nFinal Cleaned DataFrame:")
print(df)

The Final "Cleaned Data"

After all these steps, our DataFrame is now much cleaner and ready for analysis.

Final Code:

import pandas as pd
import numpy as np
# 1. Load Data
df = pd.read_csv('messy_data.csv')
# 2. Handle Missing Values
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Salary'].fillna(df['Salary'].mean(), inplace=True)
df['Name'].fillna('Unknown', inplace=True)
# 3. Correct Data Types
df['Join_Date'] = pd.to_datetime(df['Join_Date'])
df['Age'] = df['Age'].astype(int)
# 4. Remove Duplicates
df.drop_duplicates(inplace=True)
# 5. Handle Outliers (Capping)
Q1 = df['Age'].quantile(0.25)
Q3 = df['Age'].quantile(0.75)
IQR = Q3 - Q1
upper_bound = Q3 + 1.5 * IQR
df['Age'] = np.where(df['Age'] > upper_bound, int(round(upper_bound)), df['Age'])
# 6. Standardize Data
df['City'] = df['City'].str.title()
print("--- Final Cleaned DataFrame ---")
print(df)
print("\n--- DataFrame Info ---")
df.info()

Final Output:

--- Final Cleaned DataFrame ---
      Name  Age         City        Salary   Join_Date
0    Alice  25     New York  70000.000000  2025-01-15
1      Bob  35   Los Angeles  69285.714286  2025-05-20
2  Charlie  28     New York  85000.000000  2025-01-15
3    David  52      Chicago  60000.000000  2025-12-01
4      Eve  32  San Francisco  69285.714286  2025-02-28
6    Unknown  35       Miami  45000.000000  2025-11-10
7    Grace  35   Los Angeles  90000.000000  2025-05-20
--- DataFrame Info ---
<class 'pandas.core.frame.DataFrame'>
Int64Index: 7 entries, 0 to 7
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   Name       7 non-null      object
 1   Age        7 non-null      int32
 2   City       7 non-null      object
 3   Salary     7 non-null      float64
 4   Join_Date  7 non-null      datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int32(1), object(2)
memory usage: 320.0+ bytes
None

This final DataFrame is free of missing values, duplicates, and obvious errors, with consistent data types. It is now "cleaned data" and ready for the next steps in your data science project.

Python cleaneddata如何高效处理？

Why is Data Cleaning Necessary?

The Essential Python Libraries for Data Cleaning

The Data Cleaning Workflow: A Step-by-Step Guide

Sample "Messy" Data

Step 1: Handling Missing Values

Step 2: Correcting Data Types

Step 3: Removing Duplicates

Step 4: Handling Outliers

Step 5: Data Standardization/Normalization (Optional)

The Final "Cleaned Data"

99ANYc3cd6

Eclipse Java程序运行不了怎么办？

Java中int与char如何互相转换？

Linux shell入门难吗？

Python中unix时间戳如何转本地时间？

mastercam基础教程

男士瑜伽入门，哪些基本动作必学？

Python Freelibrary是什么？如何使用？

Java enum 如何高效转字符串？

python x y python3

Python tightlayout如何自动调整布局间距？

Python Megawidgets，如何轻松构建复杂GUI界面？

Python 中如何实现 lastIndexOf 功能？

JavaScript和Java有何本质区别？

PPT教程视频怎么制作？

Java如何调用MySQL存储过程？

Python多进程如何高效实现并行计算？

Python cleaneddata如何高效处理？

Why is Data Cleaning Necessary?

The Essential Python Libraries for Data Cleaning

The Data Cleaning Workflow: A Step-by-Step Guide

Sample "Messy" Data

Step 1: Handling Missing Values

Step 2: Correcting Data Types

Step 3: Removing Duplicates

Step 4: Handling Outliers

Step 5: Data Standardization/Normalization (Optional)

The Final "Cleaned Data"

相关推荐

男士瑜伽入门，哪些基本动作必学？