杰瑞科技汇

Pandas的drop方法如何正确使用?

Of course! The drop() method in pandas is a fundamental tool for cleaning and reshaping your DataFrame. It allows you to remove rows or columns based on their labels.

Pandas的drop方法如何正确使用?-图1
(图片来源网络,侵删)

Here’s a comprehensive guide covering the most common use cases, from basic to advanced.


The Core Idea: Labels vs. Positions

First, it's crucial to understand that drop() works with labels, not integer positions by default.

  • Label: The name of the row (index) or column.
  • Position: The 0-based integer location (e.g., the 1st column, 2nd row).

You can make drop() work with positions using the axis parameter, but the default is labels.


Basic Syntax

DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')

The most important parameters are:

Pandas的drop方法如何正确使用?-图2
(图片来源网络,侵删)
  • labels: The row or column labels to drop.
  • axis: Specifies whether to drop from the index (axis=0, the default) or from the columns (axis=1).
  • inplace: If True, modifies the DataFrame directly. If False (the default), returns a new DataFrame with the rows/columns dropped.
  • columns: A convenient way to specify you want to drop columns (equivalent to axis=1).
  • index: A convenient way to specify you want to drop rows (equivalent to axis=0).

Dropping Rows

Let's start with a sample DataFrame.

import pandas as pd
import numpy as np
# Create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
        'Age': [25, 30, 35, 40, 28],
        'City': ['NY', 'LA', 'SF', 'Chicago', 'Boston']}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

Original DataFrame:

      Name  Age     City
0    Alice   25       NY
1      Bob   30       LA
2  Charlie   35       SF
3    David   40  Chicago
4      Eve   28    Boston

a) Dropping a Single Row by Label

Use the row's index label. Let's drop the row with index 2 ('Charlie').

# By default, axis=0 (rows), so it's optional
df_dropped_row = df.drop(labels=2)
print("\nDataFrame after dropping row with label '2':")
print(df_dropped_row)

Result:

Pandas的drop方法如何正确使用?-图3
(图片来源网络,侵删)
    Name  Age     City
0  Alice   25       NY
1    Bob   30       LA
3  David   40  Chicago
4    Eve   28    Boston

Notice the index 2 is now missing. This can be fixed later with df.reset_index().

b) Dropping Multiple Rows by Label

Pass a list of labels to the labels parameter.

# Drop rows with labels 1 and 3
df_dropped_rows = df.drop(labels=[1, 3])
print("\nDataFrame after dropping rows with labels '1' and '3':")
print(df_dropped_rows)

Result:

      Name  Age    City
0    Alice   25      NY
2  Charlie   35      SF
4      Eve   28  Boston

Dropping Columns

Now, let's remove columns. The most common way is to use the columns parameter or axis=1.

a) Dropping a Single Column

Let's drop the Age column.

# Method 1: Using the 'columns' parameter (recommended for clarity)
df_dropped_col = df.drop(columns='Age')
# Method 2: Using 'axis=1'
# df_dropped_col = df.drop(labels='Age', axis=1)
print("\nDataFrame after dropping the 'Age' column:")
print(df_dropped_col)

Result:

      Name     City
0    Alice       NY
1      Bob       LA
2  Charlie       SF
3    David  Chicago
4      Eve    Boston

b) Dropping Multiple Columns

Pass a list of column names to the columns parameter.

# Drop the 'Age' and 'City' columns
df_dropped_cols = df.drop(columns=['Age', 'City'])
print("\nDataFrame after dropping 'Age' and 'City' columns:")
print(df_dropped_cols)

Result:

      Name
0    Alice
1      Bob
2  Charlie
3    David
4      Eve

The inplace Parameter

This is a critical concept. It determines whether you modify the existing DataFrame or create a new one.

  • inplace=False (Default): Returns a new DataFrame. The original df is unchanged.
  • inplace=True: Modifies the DataFrame in place. It returns None, and the original df is permanently changed.

Example with inplace=True

print("Original DataFrame before inplace operation:")
print(df)
# Drop the 'City' column from the original DataFrame
df.drop(columns='City', inplace=True)
print("\nOriginal DataFrame after inplace operation:")
print(df)

Output:

Original DataFrame before inplace operation:
      Name  Age     City
0    Alice   25       NY
1      Bob   30       LA
2  Charlie   35       SF
3    David   40  Chicago
4      Eve   28    Boston
Original DataFrame after inplace operation:
      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35
3    David   40
4      Eve   28

Warning: Using inplace=True can be risky. If you make a mistake, the change is permanent and you might need to reload your data. For this reason, many experienced developers prefer the default (inplace=False) for its safety and clarity.


Advanced Use Cases

a) Handling Errors

What if you try to drop a label that doesn't exist? By default, pandas raises a KeyError.

# This will raise a KeyError
# df.drop(columns='Country') 

You can change this behavior with the errors parameter:

  • errors='raise' (default): Raises an error.
  • errors='ignore': Does nothing if the label is not found.
# This will not raise an error, but also won't change the DataFrame
df.drop(columns='Country', errors='ignore')
print("\nDataFrame after trying to drop a non-existent column (ignored):")
print(df)

b) Dropping from a MultiIndex DataFrame

If your DataFrame has a hierarchical index, you can drop from a specific level.

# Create a MultiIndex DataFrame
index = pd.MultiIndex.from_tuples([('A', 1), ('A', 2), ('B', 1), ('B', 2)], names=['letter', 'number'])
df_multi = pd.DataFrame({'Data': [10, 20, 30, 40]}, index=index)
print("\nOriginal MultiIndex DataFrame:")
print(df_multi)
# Drop all rows where the 'letter' level is 'A'
df_multi_dropped = df_multi.drop(labels='A', level='letter')
print("\nMultiIndex DataFrame after dropping level 'letter' == 'A':")
print(df_multi_dropped)

Result:

Original MultiIndex DataFrame:
      Data
letter number
A      1        10
       2        20
B      1        30
       2        40
MultiIndex DataFrame after dropping level 'letter' == 'A':
      Data
letter number
B      1        30
       2        40

Common Alternatives to drop()

While drop() is the most direct method, other functions achieve similar goals.

a) Dropping Missing Values (dropna())

This is a specialized way to drop rows or columns that contain NaN (Not a Number) values.

# Create a DataFrame with missing values
df_nan = pd.DataFrame({'A': [1, 2, np.nan], 'B': [5, np.nan, np.nan], 'C': [1, 2, 3]})
print("\nDataFrame with NaNs:")
print(df_nan)
# Drop any row that has at least one NaN value
print("\nDataFrame after dropping rows with any NaNs:")
print(df_nan.dropna())
# Drop any column that has at least one NaN value
print("\nDataFrame after dropping columns with any NaNs:")
print(df_nan.dropna(axis=1))

b) Dropping Duplicates (drop_duplicates())

This is used to remove duplicate rows.

# Create a DataFrame with duplicate rows
df_dup = pd.DataFrame({'A': [1, 2, 2, 3], 'B': [4, 5, 5, 6]})
print("\nDataFrame with duplicates:")
print(df_dup)
# Keep the first occurrence of each duplicate
print("\nDataFrame after dropping duplicates:")
print(df_dup.drop_duplicates())

Summary Table

Goal Method Example
Drop a single row by index df.drop(label=2)
Drop multiple rows by index df.drop(labels=[1, 3])
Drop a single column by name df.drop(columns='Age')
Drop multiple columns by name df.drop(columns=['Age', 'City'])
Modify the DataFrame directly df.drop(..., inplace=True)
Drop rows with missing values df.dropna()
Drop duplicate rows df.drop_duplicates()
分享:
扫描分享到社交APP
上一篇
下一篇