杰瑞科技汇

Python Spark sortBy如何高效实现多字段排序?

Of course! In PySpark, there isn't a single function named sortBy. Instead, Spark provides a powerful and flexible method called orderBy() on the DataFrame and a more performant, specific function called sort().

Python Spark sortBy如何高效实现多字段排序?-图1
(图片来源网络,侵删)

Let's break down the differences, use cases, and provide clear examples.

The Main Functions: orderBy() vs. sort()

Both functions are used to sort a DataFrame, but they have key differences.

Feature orderBy() sort()
Primary Use General-purpose sorting, often used with SQL-style expressions. Often preferred for simple, multi-column sorting.
Performance Can be slightly less optimized as it's more flexible. Generally more performant, especially for multi-column sorts.
Argument Type Accepts Column objects or String column names. Accepts Column objects, String column names, or a list of Column objects.
Clustered Sort Can be used with DistributeBy or ClusteredBy for partition-aware sorting. Can be used with DistributeBy or ClusteredBy.

In short:

  • Use orderBy() when you want to sort by a complex expression (e.g., col("age") + 5, year("date_col")).
  • Use sort() for straightforward sorting by one or more column names, as it's often faster.

Basic Sorting with sort()

This is the most common use case. You specify one or more columns and the sort direction (asc for ascending, desc for descending).

Python Spark sortBy如何高效实现多字段排序?-图2
(图片来源网络,侵删)

Example Setup

First, let's create a sample SparkSession and DataFrame.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, desc, asc
# Create a SparkSession
spark = SparkSession.builder.appName("SortByExample").getOrCreate()
data = [("Alice", 34, 80000),
        ("Bob", 45, 120000),
        ("Charlie", 29, 75000),
        ("David", 45, 110000), # Same age as Bob
        ("Eve", 34, 90000)]   # Same age as Alice
columns = ["name", "age", "salary"]
df = spark.createDataFrame(data, columns)
df.show()

Output:

+-------+---+------+
|   name|age|salary|
+-------+---+------+
|  Alice| 34| 80000|
|    Bob| 45|120000|
|Charlie| 29| 75000|
|  David| 45|110000|
|    Eve| 34| 90000|
+-------+---+------+

Example 1: Sort by a Single Column (Ascending)

# Sort by 'age' in ascending order (default)
sorted_df_asc = df.sort("age")
# Or using the 'asc' function
# sorted_df_asc = df.sort(asc("age"))
sorted_df_asc.show()

Output:

+-------+---+------+
|   name|age|salary|
+-------+---+------+
|Charlie| 29| 75000|
|  Alice| 34| 80000|
|    Eve| 34| 90000|
|    Bob| 45|120000|
|  David| 45|110000|
+-------+---+------+

Notice that for the same age (34 and 45), the original row order is preserved. This is called a stable sort.

Python Spark sortBy如何高效实现多字段排序?-图3
(图片来源网络,侵删)

Example 2: Sort by a Single Column (Descending)

# Sort by 'salary' in descending order
sorted_df_desc = df.sort(desc("salary"))
# Or using the 'desc' function
# sorted_df_desc = df.sort(col("salary").desc())
sorted_df_desc.show()

Output:

+-------+---+------+
|   name|age|salary|
+-------+---+------+
|    Bob| 45|120000|
|    Eve| 34| 90000|
|  David| 45|110000|
|  Alice| 34| 80000|
|Charlie| 29| 75000|
+-------+---+------+

Example 3: Sort by Multiple Columns

You can pass a list of columns to sort(). The order of the list determines the sort priority.

# Sort by 'age' (ascending), and then by 'salary' (descending)
# This will group people by age, and within each age group, sort them by salary from highest to lowest.
multi_sorted_df = df.sort(["age", desc("salary")])
multi_sorted_df.show()

Output:

+-------+---+------+
|   name|age|salary|
+-------+---+------+
|Charlie| 29| 75000|
|    Eve| 34| 90000|
|  Alice| 34| 80000|
|    Bob| 45|120000|
|  David| 45|110000|
+-------+---+------+

Explanation:

  1. First, the DataFrame is sorted by age (29, 34, 34, 45, 45).
  2. Then, for the rows with the same age (34 and 45), they are further sorted by salary in descending order. Hence, Eve (90000) comes before Alice (80000), and Bob (120000) comes before David (110000).

General Sorting with orderBy()

orderBy() works very similarly to sort() but is more powerful when you need to sort by complex expressions.

Example 4: Sorting with orderBy() and Expressions

Let's say we want to sort by a calculation, like salary / age (average salary per year of age).

from pyspark.sql.functions import lit
# Calculate salary per age and sort by it
# We use 'lit(1)' to create a constant column for the calculation
expr_sorted_df = df.withColumn("salary_per_age", col("salary") / col("age")) \
                   .orderBy(desc("salary_per_age"))
expr_sorted_df.show()

Output:

+-------+---+------+------------------+
|   name|age|salary|    salary_per_age|
+-------+---+------+------------------+
|    Bob| 45|120000|2666.6666666666665|
|  David| 45|110000|2444.4444444444446|
|    Eve| 34| 90000|2647.0588235294117|
|  Alice| 34| 80000|2352.9411764705883|
|Charlie| 29| 75000|2586.2068965517242|
+-------+---+------+------------------+

Here, orderBy() is the natural choice because we are sorting by a column that was just created (salary_per_age). While you could use sort(), orderBy() is more idiomatic for this kind of expression-based sorting.


Performance Consideration: global_sort

By default, Spark performs a local sort. This means each partition is sorted independently, but the entire dataset is not globally sorted. You can see this if you look at the physical plan.

To perform a global sort (which is a full, expensive shuffle operation), you can use the global_sort option on sort() or orderBy().

# This will perform a full shuffle to ensure the entire DataFrame is sorted globally.
# It's much more expensive but guarantees a global order.
globally_sorted_df = df.sort("age").global_sort()
# You can also do this with orderBy
# globally_sorted_df = df.orderBy("age").global_sort()
print("Default Sort Plan (Local Sort):")
df.sort("age").explain()
print("\nGlobal Sort Plan (Full Shuffle):")
df.sort("age").global_sort().explain()

When you run explain(), you will see a Sort operator in the plan. For global_sort, you'll also see a ShuffleExchange before the Sort, indicating the data is being moved across the network to be co-located for a true global sort.


Summary and Best Practices

Your Goal Recommended Function Example
Simple sort by column name(s) sort() df.sort("age", desc("salary"))
Sort by a complex expression orderBy() df.orderBy(col("salary") / col("age"))
Sort by SQL functions orderBy() df.orderBy(year("date_column"))
Need a guaranteed global order sort().global_sort() or orderBy().global_sort() df.sort("age").global_sort()
Sort within partitions (no shuffle) Use sort() or orderBy() without global_sort. This is the default. df.repartition(5).sortWithinPartitions("age")

Final Recommendation: For most day-to-day tasks, start with sort(). It's concise and performant. If you find yourself needing to sort by a non-trivial expression, switch to orderBy(). Both are essential tools in any PySpark developer's toolkit.

分享:
扫描分享到社交APP
上一篇
下一篇