杰瑞科技汇

python pandas diff

Of course! The pandas.DataFrame.diff() method is a powerful tool for calculating the difference between elements in a DataFrame or Series. It's extremely useful for time-series analysis, financial data, and any situation where you need to understand the change between consecutive data points.

python pandas diff-图1
(图片来源网络,侵删)

Here's a comprehensive guide to using pandas.diff(), from the basics to advanced examples.

What is diff()?

The diff() method computes the difference of a DataFrame object with another object, in this case, by default, it computes the difference between the current element and the element from the previous row.

The formula is simple: result[i] = element[i] - element[i-1]


Basic Syntax

The diff() method can be called on a DataFrame or a Series.

python pandas diff-图2
(图片来源网络,侵删)
# For a Series
Series.diff(periods=1, axis=0)
# For a DataFrame
DataFrame.diff(periods=1, axis=0)

Key Parameters:

  • periods (int, default 1): The number of positions to shift for calculating the difference.
    • periods=1 (default): Difference with the previous row.
    • periods=2: Difference with the row two places back.
    • periods=-1: Difference with the next row (looks forward).
  • axis ({0 or 'index', 1 or 'columns'}, default 0): The axis to take the difference along.
    • axis=0 or 'index': Calculates the difference between rows (the default).
    • axis=1 or 'columns': Calculates the difference between columns.
  • inplace (bool, default False): If True, do the operation in-place and return None.

Examples on a Series

Let's start with a simple Series to understand the core functionality.

import pandas as pd
import numpy as np
# Create a sample Series
s = pd.Series([10, 12, 15, 14, 18, 20])
print("Original Series:")
print(s)

Original Series:

0    10
1    12
2    15
3    14
4    18
5    20
dtype: int64

Default Behavior (periods=1)

Calculates the difference from the previous element.

# Default: difference with the previous element
s_diff_default = s.diff()
print("\nDefault diff (periods=1):")
print(s_diff_default)

Output:

python pandas diff-图3
(图片来源网络,侵删)
0     NaN  # No previous element for the first item
1     2.0  # 12 - 10
2     3.0  # 15 - 12
3    -1.0  # 14 - 15
4     4.0  # 18 - 14
5     2.0  # 20 - 18
dtype: float64

Notice the first value is NaN (Not a Number) because there's no element before it to subtract from.

Using periods

You can change how many steps back to look.

# Difference with the element two places back (periods=2)
s_diff_period2 = s.diff(periods=2)
print("\nDiff with periods=2:")
print(s_diff_period2)
# Difference with the next element (periods=-1)
s_diff_next = s.diff(periods=-1)
print("\nDiff with periods=-1 (looking forward):")
print(s_diff_next)

Output:

# Diff with periods=2:
0     NaN  # Not enough history
1     NaN  # Not enough history
2     5.0  # 15 - 10
3     2.0  # 14 - 12
4     3.0  # 18 - 15
5     4.0  # 20 - 14
dtype: float64
# Diff with periods=-1 (looking forward):
0   -2.0  # 10 - 12
1   -3.0  # 12 - 15
2    1.0  # 15 - 14
3   -4.0  # 14 - 18
4   -2.0  # 18 - 20
5     NaN  # No next element
dtype: float64

Examples on a DataFrame

diff() is even more useful on DataFrames. You can apply the difference operation either row-wise (axis=0) or column-wise (axis=1).

# Create a sample DataFrame
data = {'A': [100, 102, 105, 107, 110],
        'B': [5, 7, 6, 8, 9],
        'C': [50, 52, 51, 53, 55]}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

Original DataFrame:

     A  B   C
0  100  5  50
1  102  7  52
2  105  6  51
3  107  8  53
4  110  9  55

Row-wise Difference (axis=0, the default)

This is the most common use case. It calculates the difference for each column between consecutive rows.

# Difference between rows for each column
df_diff_rows = df.diff()
print("\nDataFrame diff (axis=0):")
print(df_diff_rows)

Output:

# DataFrame diff (axis=0):
      A    B     C
0   NaN  NaN   NaN
1   2.0  2.0   2.0
2   3.0 -1.0  -1.0
3   2.0  2.0   2.0
4   3.0  1.0   2.0

Each cell (i, j) contains the value df[i, j] - df[i-1, j].

Column-wise Difference (axis=1)

This calculates the difference between columns for each row.

# Difference between columns for each row
df_diff_cols = df.diff(axis=1)
print("\nDataFrame diff (axis=1):")
print(df_diff_cols)

Output:

# DataFrame diff (axis=1):
      A    B     C
0   NaN -95.0 -45.0
1   NaN -95.0 -45.0
2   NaN -99.0 -45.0
3   NaN -99.0 -45.0
4   NaN -101.0 -46.0

Each cell (i, j) contains the value df[i, j] - df[i, j-1]. The first column is NaN because there's no preceding column.


Handling Missing Data (NaN)

diff() propagates NaN values. If a value in the original data is NaN, the difference calculation for the next row will also be NaN.

# Create a DataFrame with a missing value
df_nan = pd.DataFrame({'A': [10, 12, np.nan, 18, 20]})
print("\nDataFrame with NaN:")
print(df_nan)
df_nan_diff = df_nan.diff()
print("\nDiff of DataFrame with NaN:")
print(df_nan_diff)

Output:

# DataFrame with NaN:
      A
0  10.0
1  12.0
2   NaN
3  18.0
4  20.0
# Diff of DataFrame with NaN:
      A
0   NaN
1   2.0
2   NaN  # 12.0 - NaN = NaN
3   NaN  # 18.0 - NaN = NaN
4   2.0

Practical Use Cases

Use Case 1: Time-Series Analysis (Daily Price Change)

This is a classic application. Imagine you have daily stock prices.

import pandas as pd
# Create a time-series DataFrame
dates = pd.to_datetime(['2025-01-01', '2025-01-02', '2025-01-03', '2025-01-04', '2025-01-05'])
prices = {'Open': [150, 152, 151, 155, 160],
          'Close': [152, 151, 155, 158, 162]}
df_prices = pd.DataFrame(prices, index=dates)
print("Daily Stock Prices:")
print(df_prices)
# Calculate the daily price change
df_prices['Daily_Change'] = df_prices['Close'].diff()
print("\nDaily Price Change:")
print(df_prices)

Output:

Daily Stock Prices:
            Open  Close
2025-01-01    150    152
2025-01-02    152    151
2025-01-03    151    155
2025-01-04    155    158
2025-01-05    160    162
Daily Price Change:
            Open  Close  Daily_Change
2025-01-01    150    152           NaN
2025-01-02    152    151          -1.0
2025-01-03    151    155           4.0
2025-01-04    155    158           3.0
2025-01-05    160    162           4.0

Use Case 2: Calculating Percentage Change

While pandas has a dedicated .pct_change() method, you can also calculate it using diff() and shift().

# pct_change is simply the difference divided by the previous value
df_prices['Pct_Change_manual'] = df_prices['Close'].diff() / df_prices['Close'].shift(1)
# For comparison, let's use the built-in method
df_prices['Pct_Change_builtin'] = df_prices['Close'].pct_change()
print("\nPercentage Change Calculation:")
print(df_prices[['Close', 'Pct_Change_manual', 'Pct_Change_builtin']])

Output:

Percentage Change Calculation:
            Close  Pct_Change_manual  Pct_Change_builtin
2025-01-01    152                NaN                 NaN
2025-01-02    151           -0.006578           -0.006578
2025-01-03    155            0.026490            0.026490
2025-01-04    158            0.019355            0.019355
2025-01-05    162            0.025316            0.025316

Notice how shift(1) is used to get the previous value to divide by. df['Close'].shift(1) is essentially the same as the result of df['Close'].diff() but without the subtraction.

Summary

Feature Description Example
Core Function Calculates the difference between elements. df['col'].diff()
periods Controls how many steps to shift for the difference. df.diff(periods=2)
axis Controls direction of difference (rows or columns). df.diff(axis=1)
NaN Handling Propagates NaN values forward. diff() on a series with NaN
Common Use Time-series analysis, financial calculations, change detection. stock_prices['Close'].diff()
分享:
扫描分享到社交APP
上一篇
下一篇