杰瑞科技汇

Python中如何正确计算percentile百分位数?

Of course! The percentile() function is a crucial tool in data analysis and statistics, used to find a value below which a certain percentage of data in a dataset falls.

Python中如何正确计算percentile百分位数?-图1
(图片来源网络,侵删)

Let's break it down, from the basic concept to practical examples in Python.

What is a Percentile?

Imagine you have a list of exam scores for 100 students. If you are in the 90th percentile, it means you scored better than 90% of the students. In other words, only 10% of the students scored higher than you.

Key Points:

  • It's a measure of relative standing.
  • The 50th percentile is the same as the median (the middle value of the data).
  • The 25th percentile is also known as the first quartile (Q1).
  • The 75th percentile is the third quartile (Q3).

How to Calculate Percentile in Python

There are two primary ways to calculate percentiles in Python:

Python中如何正确计算percentile百分位数?-图2
(图片来源网络,侵删)
  1. Using the NumPy library: The most common and recommended method for numerical data, especially when working with large arrays or dataframes.
  2. Using the statistics module: A built-in Python module, good for simple lists but less flexible than NumPy.

Method 1: Using NumPy (Recommended)

NumPy is the standard for numerical computing in Python. Its numpy.percentile() function is powerful and efficient.

Installation

If you don't have NumPy installed, open your terminal or command prompt and run:

pip install numpy

Syntax

numpy.percentile(a, q, axis=None, out=None, overwrite_input=False, method='linear', keepdims=False)
  • a: The array or list of numbers to compute the percentile for.
  • q: The percentile to compute. This can be a single number (e.g., 90) or a sequence of numbers (e.g., [25, 50, 75]).
  • axis (optional): The axis along which the percentiles are computed. Useful for multi-dimensional arrays (like pandas DataFrames).
  • method (optional): The method to use for interpolation when the desired percentile is between two data points. The default is 'linear', which is what most people need.

Example: Basic Usage

Let's find the 90th percentile of a list of exam scores.

import numpy as np
# A list of exam scores
scores = [55, 62, 68, 72, 75, 78, 80, 82, 85, 90, 95, 100]
# Calculate the 90th percentile
p90 = np.percentile(scores, 90)
print(f"The list of scores: {scores}")
print(f"The 90th percentile is: {p90}")

Output:

Python中如何正确计算percentile百分位数?-图3
(图片来源网络,侵删)
The list of scores: [55, 62, 68, 72, 75, 78, 80, 82, 85, 90, 95, 100]
The 90th percentile is: 94.5

Explanation: The 90th percentile is a value such that 90% of the data is below it. To find this, NumPy sorts the data and then calculates the position. For 12 data points, the 90th percentile falls between the 10th and 11th values (90 and 95). By default, it uses linear interpolation, resulting in (90 + 95) / 2 = 94.5.

Example: Multiple Percentiles at Once

You can easily calculate several percentiles in one go by passing a list for the q parameter.

import numpy as np
scores = [55, 62, 68, 72, 75, 78, 80, 82, 85, 90, 95, 100]
# Calculate the 25th, 50th (median), and 75th percentiles
quartiles = np.percentile(scores, [25, 50, 75])
print(f"Quartiles (25th, 50th, 75th): {quartiles}")

Output:

Quartiles (25th, 50th, 75th): [72.75 78.5  88.25]
  • 25th Percentile (Q1): 72.75
  • 50th Percentile (Median): 78.5
  • 75th Percentile (Q3): 88.25

Method 2: Using the statistics Module

This module is part of Python's standard library, so no installation is needed. It's simpler but less feature-rich than NumPy.

Syntax

statistics.quantiles(data, *, n=4, method='exclusive')

The quantiles function is the most direct way to get percentiles. It returns a list of n-1 cut points that divide the data into n equal-sized groups.

  • data: The list of numbers.
  • n: The number of equal-sized groups to create. To get percentiles, you'd use n=100 for the 99 cut points (1st to 99th percentile).
  • method: 'inclusive' or 'exclusive'. 'inclusive' includes the min and max in the calculations, which is generally more intuitive.

Example: Basic Usage

To get a specific percentile, you can call quantiles and then pick the value you need.

import statistics
scores = [55, 62, 68, 72, 75, 78, 80, 82, 85, 90, 95, 100]
# To get the 90th percentile, we need the 90th cut point (n=100)
# The result is a list of 99 values.
all_percentiles = statistics.quantiles(scores, n=100, method='inclusive')
# The 90th percentile is the 90th element in the list (index 89)
p90_stats = all_percentiles[89]
print(f"The 90th percentile using statistics.quantiles is: {p90_stats}")

Output:

The 90th percentile using statistics.quantiles is: 94.5

Note: The statistics module does not have a direct percentile() function like NumPy. You have to use quantiles and handle the indexing yourself, which can be less convenient.


Comparison: NumPy vs. statistics

Feature NumPy (np.percentile) statistics (quantiles)
Ease of Use Excellent. Direct percentile() function. Good, but less direct. Requires using quantiles and indexing.
Performance Very fast. Optimized for large arrays. Slower for large datasets.
Flexibility Excellent. Handles multi-dimensional arrays, different interpolation methods. Basic. Primarily for 1D lists.
Dependencies Requires numpy installation. Built-in to Python (no installation needed).
Best For Data analysis, machine learning, large datasets, scientific computing. Simple, quick calculations on small lists without external libraries.

Summary and Recommendation

For any serious data analysis, use NumPy. It's the industry standard, more powerful, and more efficient.

  • Start with numpy.percentile() for all your percentile needs.
  • Use the statistics module only if you are in an environment where you cannot install external packages and your dataset is very small.
分享:
扫描分享到社交APP
上一篇
下一篇