杰瑞科技汇

Python中ffill与bfill如何使用?

Of course! ffill and bfill are essential methods in Python for handling missing data, especially when working with pandas DataFrames and Series.

Python中ffill与bfill如何使用?-图1
(图片来源网络,侵删)

Let's break them down with clear explanations, examples, and when to use each.

The Core Problem: Missing Data (NaN)

In data analysis, it's very common to have missing values, which are often represented as NaN (Not a Number). These gaps can cause errors in calculations or skew the results of a model. ffill and bfill are two popular techniques to fill these gaps.


ffill (Forward Fill)

ffill stands for "Forward Fill". It propagates the last valid observation forward to fill the missing values.

How it Works:

Imagine a column of data. When ffill encounters a NaN, it looks at the value immediately above it (the previous row) and copies that value down into the NaN.

Python中ffill与bfill如何使用?-图2
(图片来源网络,侵删)

Analogy:

Think of it like a "carry-forward" rule. If a student is absent on Tuesday (NaN), you assume they still have the same score they had on Monday.

Python Example with Pandas:

import pandas as pd
import numpy as np
# Create a DataFrame with missing values (NaN)
data = {'Product': ['A', 'A', 'A', 'B', 'B', 'B'],
        'Sales': [100, np.nan, 120, np.nan, 150, 160]}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

Output:

Original DataFrame:
  Product  Sales
0       A  100.0
1       A    NaN
2       A  120.0
3       B    NaN
4       B  150.0
5       B  160.0

Now, let's use ffill() on the 'Sales' column:

# Forward fill the missing values
df_ffilled = df.ffill()
print("\nDataFrame after ffill():")
print(df_ffilled)

Output:

Python中ffill与bfill如何使用?-图3
(图片来源网络,侵删)
DataFrame after ffill():
  Product  Sales
0       A  100.0
1       A  100.0  <-- NaN filled with the previous value (100.0)
2       A  120.0
3       B  120.0  <-- NaN filled with the previous value (120.0)
4       B  150.0
5       B  160.0

bfill (Backward Fill)

bfill stands for "Backward Fill". It propagates the next valid observation backward to fill the missing values.

How it Works:

When bfill encounters a NaN, it looks at the value immediately below it (the next row) and copies that value up into the NaN.

Analogy:

This is like a "pull-backward" rule. If a student's score for Tuesday is missing (NaN), you use their score from Wednesday to fill it in.

Python Example with Pandas:

Using the same original DataFrame:

print("Original DataFrame:")
print(df)

Now, let's use bfill() on the 'Sales' column:

# Backward fill the missing values
df_bfilled = df.bfill()
print("\nDataFrame after bfill():")
print(df_bfilled)

Output:

DataFrame after bfill():
  Product  Sales
0       A  100.0
1       A  120.0  <-- NaN filled with the next value (120.0)
2       A  120.0
3       B  150.0  <-- NaN filled with the next value (150.0)
4       B  150.0
5       B  160.0

Key Differences and When to Use Each

Feature ffill (Forward Fill) bfill (Backward Fill)
Direction Fills missing values from the top down. Fills missing values from the bottom up.
Uses The last known value. The next known value.
Best For Time-series data where you assume the value remains constant until a new measurement is taken (e.g., sensor readings, stock prices at close). When you have data that is "collected in advance" or you can reasonably infer a past value from a future one (less common).
Leading NaN A leading NaN (at the very top) cannot be filled because there is no previous value. A leading NaN can be filled by the next available value.
Trailing NaN A trailing NaN (at the very bottom) can be filled by the previous value. A trailing NaN cannot be filled because there is no next value.

Example of Leading/Trailing NaN:

s = pd.Series([1, np.nan, np.nan, 4, np.nan, 6])
print("Original Series:\n", s)
print("\nffill result:\n", s.ffill())
# Output: 1, 1, 1, 4, 4, 6  (The last NaN is filled)
print("\nbfill result:\n", s.bfill())
# Output: 1, 4, 4, 4, 6, 6  (The first two NaNs are filled)

Practical Considerations and Parameters

Both ffill and bfill have useful parameters:

  1. axis: Specifies the axis to fill along.

    • axis=0 (default): Fill values down the rows (vertically).
    • axis=1: Fill values across the columns (horizontally).
    # Example of filling horizontally (axis=1)
    df_h = pd.DataFrame({'A': [1, np.nan], 'B': [np.nan, 2]})
    print(df_h)
    #    A    B
    # 0  1.0  NaN
    # 1  NaN  2.0
    df_h_filled = df_h.ffill(axis=1)
    print(df_h_filled)
    #    A    B
    # 0  1.0  1.0  <-- NaN in B is filled with value from A in the same row
    # 1  NaN  2.0  <-- NaN in A has no value to its left, so it remains NaN
  2. limit: The maximum number of consecutive NaN values to fill. This is very useful for preventing over-propagation.

    s = pd.Series([1, np.nan, np.nan, np.nan, 5, np.nan])
    print("Original Series:\n", s)
    # Fill a maximum of 2 consecutive NaNs
    print("\nffill with limit=2:\n", s.ffill(limit=2))
    # Output: 1, 1, 1, NaN, 5, 5
    # The first two NaNs are filled, but the third one is skipped because the limit is reached.

Summary: Which one should you choose?

  • Use ffill for most time-series data. It's the most common and intuitive method for filling gaps when you expect values to stay the same until a new event occurs.
  • Use bfill when your data collection process means that future values are known and can be used to infer past ones (e.g., filling in preliminary data with final results).
  • Use limit to control how far the fill propagates, which is a good practice to avoid creating long stretches of identical, potentially inaccurate, data.
  • Consider chained methods: Often, you'll see these methods chained. For example, df.fillna(method='ffill').fillna(method='bfill') is a common pattern to ensure there are no leading or trailing NaNs left in the DataFrame. It first forward-fills everything it can, then backward-fills any remaining leading NaNs.
分享:
扫描分享到社交APP
上一篇
下一篇