杰瑞科技汇

Python cleaneddata如何高效处理?

Of course! The term "cleaned data" in Python refers to the process of preparing and transforming raw, messy data into a structured and usable format for analysis, visualization, or machine learning.

Python cleaneddata如何高效处理?-图1
(图片来源网络,侵删)

This is one of the most critical steps in any data science project. As the saying goes, "Garbage in, garbage out."

Here’s a comprehensive guide to data cleaning in Python, covering the concepts, common tasks, and the essential libraries used.


Why is Data Cleaning Necessary?

Raw data is almost never perfect. It often contains issues like:

  • Missing Values: Empty cells (NaN, None, NA) that can break calculations.
  • Inconsistent Data: Typos, different formats (e.g., "USA" vs "U.S.A."), and mixed data types in a single column.
  • Outliers: Extreme values that don't fit the normal pattern of the data and can skew results.
  • Duplicate Rows: Repeated entries that can bias analysis.
  • Incorrect Data Types: A column of numbers stored as text, or dates stored as strings.
  • Irrelevant Data: Columns that don't contribute to the analysis (e.g., an ID column in a dataset where you're looking at user behavior).

The Essential Python Libraries for Data Cleaning

You'll primarily use two powerful libraries:

Python cleaneddata如何高效处理?-图2
(图片来源网络,侵删)
  1. Pandas: The go-to library for data manipulation and analysis in Python. It provides data structures like DataFrames (similar to spreadsheets) that make cleaning data intuitive and efficient.
  2. NumPy: A fundamental library for numerical computing in Python. Pandas is built on top of NumPy, and it's often used for high-performance mathematical operations.

The Data Cleaning Workflow: A Step-by-Step Guide

Let's walk through the most common data cleaning tasks using Pandas. We'll start with a sample messy DataFrame.

Sample "Messy" Data

Imagine we have a CSV file named messy_data.csv with the following content:

Name,Age,City,Salary,Join_Date
Alice,25,New York,70000,2025-01-15
Bob,,Los Angeles,NaN,2025-05-20
Charlie,28,New York,85000,2025-01-15
David,120,Chicago,60000,2025-12-01
Eve,32,San Francisco,NaN,2025-02-28
Frank,30,New York,75000,2025-01-15
,,Miami,45000,2025-11-10
Grace,35,Los Angeles,90000,2025-05-20

Let's load this data into a Pandas DataFrame.

import pandas as pd
import numpy as np
# Load the data
df = pd.read_csv('messy_data.csv')
print("Original DataFrame:")
print(df)

Output:

Python cleaneddata如何高效处理?-图3
(图片来源网络,侵删)
Original DataFrame:
      Name   Age         City   Salary   Join_Date
0    Alice    25     New York  70000.0  2025-01-15
1      Bob   NaN   Los Angeles      NaN  2025-05-20
2  Charlie    28     New York  85000.0  2025-01-15
3    David   120      Chicago  60000.0  2025-12-01
4      Eve    32  San Francisco      NaN  2025-02-28
5    Frank    30     New York  75000.0  2025-01-15
6      NaN   NaN       Miami  45000.0  2025-11-10
7    Grace    35   Los Angeles  90000.0  2025-05-20

Step 1: Handling Missing Values

Missing values are represented as NaN (Not a Number) in Pandas.

A. Identify Missing Values: Use .isnull() or .isna() to create a boolean mask of missing values, and .sum() to count them.

print("\nMissing Values per Column:")
print(df.isnull().sum())

Output:

Missing Values per Column:
Name          1
Age           2
City          0
Salary        2
Join_Date     0
dtype: int64

B. Decide on a Strategy for Handling Missing Data:

  • Drop: Remove rows or columns with missing values. Use this if the data is missing completely at random and you have a large dataset.

    # Drop rows with any missing values
    df_dropped_rows = df.dropna()
    print("\nDataFrame after dropping rows with missing values:")
    print(df_dropped_rows)
    # Drop columns with any missing values (less common)
    # df_dropped_cols = df.dropna(axis=1)
  • Fill/Impute: Replace missing values with a specific number or statistic. This is often better for retaining data.

    # Fill missing numerical values with the mean of the column
    df['Age'].fillna(df['Age'].mean(), inplace=True)
    df['Salary'].fillna(df['Salary'].mean(), inplace=True)
    # Fill missing categorical/text values with the mode (most frequent value)
    df['Name'].fillna(df['Name'].mode()[0], inplace=True)
    print("\nDataFrame after filling missing values:")
    print(df)

    Output (after filling):

      Name        Age         City        Salary   Join_Date
    0    Alice  25.000000     New York  70000.000000  2025-01-15
    1      Bob  35.285714   Los Angeles  69285.714286  2025-05-20
    2  Charlie  28.000000     New York  85000.000000  2025-01-15
    3    David 120.000000      Chicago  60000.000000  2025-12-01
    4      Eve  32.000000  San Francisco  69285.714286  2025-02-28
    5    Frank  30.000000     New York  75000.000000  2025-01-15
    6      Unknown  35.285714       Miami  45000.000000  2025-11-10
    7    Grace  35.000000   Los Angeles  90000.000000  2025-05-20

Step 2: Correcting Data Types

The Join_Date column is currently a string. For time-series analysis, it should be a datetime object.

# Convert 'Join_Date' to datetime objects
df['Join_Date'] = pd.to_datetime(df['Join_Date'])
print("\nDataFrame with corrected data types:")
print(df.info())

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   Name       8 non-null      object
 1   Age        8 non-null      float64
 2   City       8 non-null      object
 3   Salary     8 non-null      float64
 4   Join_Date  8 non-null      datetime64[ns]
dtypes: datetime64[ns](1), float64(2), object(2)
memory usage: 448.0+ bytes
None

Notice Join_Date is now datetime64[ns] and Age/Salary are float64. We can convert Age back to an integer since it's now a whole number.

df['Age'] = df['Age'].astype(int)

Step 3: Removing Duplicates

We can see that Alice, Charlie, and Frank are all from New York and joined on the same day. Let's check for exact duplicates.

# Check for duplicate rows
print(f"\nNumber of duplicate rows: {df.duplicated().sum()}")
# Remove duplicate rows (keeps the first occurrence)
df_no_duplicates = df.drop_duplicates()
print("\nDataFrame after removing duplicates:")
print(df_no_duplicates)

Output:

Number of duplicate rows: 2
DataFrame after removing duplicates:
      Name  Age         City        Salary   Join_Date
0    Alice   25     New York  70000.000000  2025-01-15
1      Bob   35   Los Angeles  69285.714286  2025-05-20
2  Charlie   28     New York  85000.000000  2025-01-15
3    David  120      Chicago  60000.000000  2025-12-01
4      Eve   32  San Francisco  69285.714286  2025-02-28
6    Unknown   35       Miami  45000.000000  2025-11-10
7    Grace   35   Los Angeles  90000.000000  2025-05-20

Step 4: Handling Outliers

David's age of 120 is likely an outlier. Let's find and handle it.

A. Identify Outliers (using IQR method): The Interquartile Range (IQR) is a common way to define outliers. Any value below Q1 - 1.5*IQR or above Q3 + 1.5*IQR is considered an outlier.

# Calculate Q1, Q3, and IQR for the 'Age' column
Q1 = df['Age'].quantile(0.25)
Q3 = df['Age'].quantile(0.75)
IQR = Q3 - Q1
# Define the bounds for outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
print(f"\nOutlier bounds for Age: < {lower_bound} or > {upper_bound}")
# Identify the outlier
outliers = df[(df['Age'] < lower_bound) | (df['Age'] > upper_bound)]
print("\nOutlier rows:")
print(outliers)

Output:

Outlier bounds for Age: < 16.5 or > 51.5
Outlier rows:
    Name  Age     City   Salary   Join_Date
3  David  120  Chicago  60000.0  2025-12-01

B. Decide on a Strategy for Outliers:

  • Cap (Winsorize): Replace the outlier with the boundary value.
    df['Age'] = np.where(df['Age'] > upper_bound, upper_bound, df['Age'])
  • Remove: Drop the row containing the outlier.
    # df = df[df['Age'] <= upper_bound]
  • Transform: Use a log transformation to reduce the impact of the outlier.

Let's cap the age at the upper bound of 51.5 (which we'll round to 52 for simplicity).

df['Age'] = np.where(df['Age'] > 52, 52, df['Age'])
print("\nDataFrame after handling the age outlier:")
print(df)

Output (after capping):

      Name  Age         City        Salary   Join_Date
0    Alice  25     New York  70000.000000  2025-01-15
1      Bob  35   Los Angeles  69285.714286  2025-05-20
2  Charlie  28     New York  85000.000000  2025-01-15
3    David  52      Chicago  60000.000000  2025-12-01
4      Eve  32  San Francisco  69285.714286  2025-02-28
6    Unknown  35       Miami  45000.000000  2025-11-10
7    Grace  35   Los Angeles  90000.000000  2025-05-20

Step 5: Data Standardization/Normalization (Optional)

This step ensures consistency. For example, standardizing city names.

# Standardize city names to title case
df['City'] = df['City'].str.title()
print("\nFinal Cleaned DataFrame:")
print(df)

The Final "Cleaned Data"

After all these steps, our DataFrame is now much cleaner and ready for analysis.

Final Code:

import pandas as pd
import numpy as np
# 1. Load Data
df = pd.read_csv('messy_data.csv')
# 2. Handle Missing Values
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Salary'].fillna(df['Salary'].mean(), inplace=True)
df['Name'].fillna('Unknown', inplace=True)
# 3. Correct Data Types
df['Join_Date'] = pd.to_datetime(df['Join_Date'])
df['Age'] = df['Age'].astype(int)
# 4. Remove Duplicates
df.drop_duplicates(inplace=True)
# 5. Handle Outliers (Capping)
Q1 = df['Age'].quantile(0.25)
Q3 = df['Age'].quantile(0.75)
IQR = Q3 - Q1
upper_bound = Q3 + 1.5 * IQR
df['Age'] = np.where(df['Age'] > upper_bound, int(round(upper_bound)), df['Age'])
# 6. Standardize Data
df['City'] = df['City'].str.title()
print("--- Final Cleaned DataFrame ---")
print(df)
print("\n--- DataFrame Info ---")
df.info()

Final Output:

--- Final Cleaned DataFrame ---
      Name  Age         City        Salary   Join_Date
0    Alice  25     New York  70000.000000  2025-01-15
1      Bob  35   Los Angeles  69285.714286  2025-05-20
2  Charlie  28     New York  85000.000000  2025-01-15
3    David  52      Chicago  60000.000000  2025-12-01
4      Eve  32  San Francisco  69285.714286  2025-02-28
6    Unknown  35       Miami  45000.000000  2025-11-10
7    Grace  35   Los Angeles  90000.000000  2025-05-20
--- DataFrame Info ---
<class 'pandas.core.frame.DataFrame'>
Int64Index: 7 entries, 0 to 7
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   Name       7 non-null      object
 1   Age        7 non-null      int32
 2   City       7 non-null      object
 3   Salary     7 non-null      float64
 4   Join_Date  7 non-null      datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int32(1), object(2)
memory usage: 320.0+ bytes
None

This final DataFrame is free of missing values, duplicates, and obvious errors, with consistent data types. It is now "cleaned data" and ready for the next steps in your data science project.

分享:
扫描分享到社交APP
上一篇
下一篇