杰瑞科技汇

Python如何计算Spearman相关系数?

Of course! Let's dive deep into spearmanr() from Python's scipy.stats module.

Python如何计算Spearman相关系数?-图1
(图片来源网络,侵删)

What is spearmanr()?

spearmanr() is a function used to calculate the Spearman rank correlation coefficient. This is a non-parametric measure of the monotonic relationship between two datasets.

In simple terms, it tells you how well the relationship between two variables can be described by a monotonic function. A monotonic function is one that is either entirely non-increasing or non-decreasing.


Key Concepts: Pearson vs. Spearman

To understand spearmanr(), it's helpful to compare it with the more common Pearson correlation coefficient.

Feature Pearson Correlation (pearsonr) Spearman Correlation (spearmanr)
Type of Relationship Measures linear relationships. Measures monotonic relationships (linear or non-linear, as long as it's consistently increasing/decreasing).
Data Type Works on the raw data values. Works on the rank of the data values. It first converts the data into ranks (1st, 2nd, 3rd, etc.).
Robustness Sensitive to outliers. A single extreme value can dramatically change the result. Robust to outliers, since an outlier's rank is just its position in the sorted list, not its actual extreme value.
Assumptions Assumes the data is roughly normally distributed and has a linear relationship. Makes no assumptions about the distribution of the data. It's non-parametric.

When to use spearmanr()?

Python如何计算Spearman相关系数?-图2
(图片来源网络,侵删)
  • When your data is not normally distributed.
  • When you have ordinal data (e.g., rankings like "low, medium, high").
  • When you suspect a non-linear but monotonic relationship (e.g., an exponential growth curve).
  • When your data contains significant outliers.

How to Use spearmanr()

Import the Function

First, you need to import it from scipy.stats.

from scipy.stats import spearmanr
import numpy as np
import matplotlib.pyplot as plt

Basic Syntax

The basic syntax is spearmanr(x, y), where x and y are arrays or lists of data.

# Two lists of data
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y = [2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
# Calculate the Spearman correlation coefficient
corr, p_value = spearmanr(x, y)
print(f"Spearman correlation coefficient: {corr:.4f}")
print(f"P-value: {p_value:.4f}")

Output:

Spearman correlation coefficient: 1.0000
P-value: 0.0000

This is a perfect positive monotonic relationship. As x increases, y also increases perfectly.


Understanding the Output

The spearmanr() function returns two values:

  1. corr (The Correlation Coefficient):

    • Ranges from -1 to +1.
    • +1: Perfect positive monotonic relationship (as one variable increases, the other always increases).
    • -1: Perfect negative monotonic relationship (as one variable increases, the other always decreases).
    • 0: No monotonic relationship.
  2. p_value (The P-value):

    • This tests the null hypothesis that the two datasets are uncorrelated.
    • A small p_value (typically < 0.05) indicates that you can reject the null hypothesis. In other words, there is a statistically significant correlation.
    • A large p_value (>= 0.05) suggests that there is not enough evidence to conclude that a correlation exists.

Practical Examples

Let's look at different scenarios.

Example 1: Strong Positive Monotonic Relationship (Non-Linear)

This is a classic case where spearmanr() excels over pearsonr.

# Create a non-linear but monotonic relationship (exponential)
x = np.linspace(0, 10, 50)
y = np.exp(x) + np.random.normal(0, 5, 50) # Add some noise
# Calculate Spearman correlation
corr_s, p_s = spearmanr(x, y)
# For comparison, calculate Pearson correlation
from scipy.stats import pearsonr
corr_p, p_p = pearsonr(x, y)
print(f"Spearman Correlation: {corr_s:.4f} (p-value: {p_s:.4f})")
print(f"Pearson Correlation:  {corr_p:.4f} (p-value: {p_p:.4f})")
# Visualize the relationship
plt.figure(figsize=(10, 5))
plt.scatter(x, y)"Exponential Relationship with Noise")
plt.xlabel("x")
plt.ylabel("y")
plt.show()

Output:

Spearman Correlation: 0.9878 (p-value: 0.0000)
Pearson Correlation:  0.8902 (p-value: 0.0000)

The Spearman correlation (0.988) is much closer to 1 than the Pearson correlation (0.890), because Spearman correctly identifies the strong, consistent increasing trend, while Pearson is weakened by the non-linearity of the relationship.

Example 2: Strong Negative Monotonic Relationship

x = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
y = [100, 90, 80, 70, 60, 50, 40, 30, 20, 10]
corr, p_value = spearmanr(x, y)
print(f"Spearman correlation coefficient: {corr:.4f}")
print(f"P-value: {p_value:.4f}")

Output:

Spearman correlation coefficient: -1.0000
P-value: 0.0000

This is a perfect negative monotonic relationship.

Example 3: No Relationship

import random
x = [random.randint(1, 100) for _ in range(50)]
y = [random.randint(1, 100) for _ in range(50)]
corr, p_value = spearmanr(x, y)
print(f"Spearman correlation coefficient: {corr:.4f}")
print(f"P-value: {p_value:.4f}")

Output (will vary due to randomness):

Spearman correlation coefficient: 0.0871
P-value: 0.5529

The coefficient is close to 0, and the p-value is high (> 0.05), indicating no significant correlation.

Example 4: The Effect of Outliers

This example shows why Spearman is more robust.

# Data with a strong linear trend
x1 = np.linspace(1, 10, 20)
y1 = 2 * x1 + np.random.normal(0, 1, 20)
# Add a massive outlier
x2 = np.append(x1, 15)
y2 = np.append(y1, 100) # This point is way off the trend
# Calculate correlations
corr_s_clean, _ = spearmanr(x1, y1)
corr_s_outlier, _ = spearmanr(x2, y2)
corr_p_clean, _ = pearsonr(x1, y1)
corr_p_outlier, _ = pearsonr(x2, y2)
print("--- Without Outlier ---")
print(f"Spearman: {corr_s_clean:.4f}")
print(f"Pearson:  {corr_p_clean:.4f}")
print("\n--- With Outlier ---")
print(f"Spearman: {corr_s_outlier:.4f}")
print(f"Pearson:  {corr_p_outlier:.4f}")

Output:

--- Without Outlier ---
Spearman: 0.9878
Pearson:  0.9872
--- With Outlier ---
Spearman: 0.8353
Pearson:  0.5179

Notice how the Pearson correlation drops much more significantly (from 0.99 to 0.52) than the Spearman correlation (from 0.99 to 0.84) when the outlier is introduced. This demonstrates Spearman's robustness.


Handling Ties

If your data has duplicate values (ties), spearmanr() handles them by assigning the average rank. For example, if two values are tied for 2nd and 3rd place, they both receive a rank of 2.5.

x = [1, 2, 2, 3, 4]
y = [5, 6, 7, 8, 9]
# In x, the two '2's are tied. Their ranks are (2+3)/2 = 2.5
# Ranks of x: [1, 2.5, 2.5, 4, 5]
# Ranks of y: [1, 2, 3, 4, 5]
corr, p_value = spearmanr(x, y)
print(f"Spearman correlation with ties: {corr:.4f}")

Output:

Spearman correlation with ties: 0.9000

The function is designed to handle this automatically. You can also explicitly tell it to use a correction for ties by setting nan_policy='propagate' if needed, but the default behavior is correct for most cases.

Summary

Use Case Best Function Why?
You want to measure the strength of a linear relationship. pearsonr() It's specifically designed for linear correlation.
You want to measure the strength of a monotonic relationship (linear or non-linear). spearmanr() It's based on ranks and captures any consistent upward/downward trend.
Your data has outliers. spearmanr() It's robust to outliers because it uses ranks.
Your data is ordinal (ranked) or not normally distributed. spearmanr() It's non-parametric and makes no distributional assumptions.
分享:
扫描分享到社交APP
上一篇
下一篇