Python Spark sortBy如何高效实现多字段排序？-杰瑞科技汇

Of course! In PySpark, there isn't a single function named sortBy. Instead, Spark provides a powerful and flexible method called orderBy() on the DataFrame and a more performant, specific function called sort().

（图片来源网络，侵删）

Let's break down the differences, use cases, and provide clear examples.

The Main Functions: `orderBy()` vs. `sort()`

Both functions are used to sort a DataFrame, but they have key differences.

Feature	`orderBy()`	`sort()`
Primary Use	General-purpose sorting, often used with SQL-style expressions.	Often preferred for simple, multi-column sorting.
Performance	Can be slightly less optimized as it's more flexible.	Generally more performant, especially for multi-column sorts.
Argument Type	Accepts Column objects or String column names.	Accepts Column objects, String column names, or a list of `Column` objects.
Clustered Sort	Can be used with `DistributeBy` or `ClusteredBy` for partition-aware sorting.	Can be used with `DistributeBy` or `ClusteredBy`.

In short:

Use orderBy() when you want to sort by a complex expression (e.g., col("age") + 5, year("date_col")).
Use sort() for straightforward sorting by one or more column names, as it's often faster.

Basic Sorting with `sort()`

This is the most common use case. You specify one or more columns and the sort direction (asc for ascending, desc for descending).

（图片来源网络，侵删）

Example Setup

First, let's create a sample SparkSession and DataFrame.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, desc, asc
# Create a SparkSession
spark = SparkSession.builder.appName("SortByExample").getOrCreate()
data = [("Alice", 34, 80000),
        ("Bob", 45, 120000),
        ("Charlie", 29, 75000),
        ("David", 45, 110000), # Same age as Bob
        ("Eve", 34, 90000)]   # Same age as Alice
columns = ["name", "age", "salary"]
df = spark.createDataFrame(data, columns)
df.show()

Output:

+-------+---+------+
|   name|age|salary|
+-------+---+------+
|  Alice| 34| 80000|
|    Bob| 45|120000|
|Charlie| 29| 75000|
|  David| 45|110000|
|    Eve| 34| 90000|
+-------+---+------+

Example 1: Sort by a Single Column (Ascending)

# Sort by 'age' in ascending order (default)
sorted_df_asc = df.sort("age")
# Or using the 'asc' function
# sorted_df_asc = df.sort(asc("age"))
sorted_df_asc.show()

Output:

+-------+---+------+
|   name|age|salary|
+-------+---+------+
|Charlie| 29| 75000|
|  Alice| 34| 80000|
|    Eve| 34| 90000|
|    Bob| 45|120000|
|  David| 45|110000|
+-------+---+------+

Notice that for the same age (34 and 45), the original row order is preserved. This is called a stable sort.

（图片来源网络，侵删）

Example 2: Sort by a Single Column (Descending)

# Sort by 'salary' in descending order
sorted_df_desc = df.sort(desc("salary"))
# Or using the 'desc' function
# sorted_df_desc = df.sort(col("salary").desc())
sorted_df_desc.show()

Output:

+-------+---+------+
|   name|age|salary|
+-------+---+------+
|    Bob| 45|120000|
|    Eve| 34| 90000|
|  David| 45|110000|
|  Alice| 34| 80000|
|Charlie| 29| 75000|
+-------+---+------+

Example 3: Sort by Multiple Columns

You can pass a list of columns to sort(). The order of the list determines the sort priority.

# Sort by 'age' (ascending), and then by 'salary' (descending)
# This will group people by age, and within each age group, sort them by salary from highest to lowest.
multi_sorted_df = df.sort(["age", desc("salary")])
multi_sorted_df.show()

Output:

+-------+---+------+
|   name|age|salary|
+-------+---+------+
|Charlie| 29| 75000|
|    Eve| 34| 90000|
|  Alice| 34| 80000|
|    Bob| 45|120000|
|  David| 45|110000|
+-------+---+------+

Explanation:

First, the DataFrame is sorted by age (29, 34, 34, 45, 45).
Then, for the rows with the same age (34 and 45), they are further sorted by salary in descending order. Hence, Eve (90000) comes before Alice (80000), and Bob (120000) comes before David (110000).

General Sorting with `orderBy()`

orderBy() works very similarly to sort() but is more powerful when you need to sort by complex expressions.

Example 4: Sorting with `orderBy()` and Expressions

Let's say we want to sort by a calculation, like salary / age (average salary per year of age).

from pyspark.sql.functions import lit
# Calculate salary per age and sort by it
# We use 'lit(1)' to create a constant column for the calculation
expr_sorted_df = df.withColumn("salary_per_age", col("salary") / col("age")) \
                   .orderBy(desc("salary_per_age"))
expr_sorted_df.show()

Output:

+-------+---+------+------------------+
|   name|age|salary|    salary_per_age|
+-------+---+------+------------------+
|    Bob| 45|120000|2666.6666666666665|
|  David| 45|110000|2444.4444444444446|
|    Eve| 34| 90000|2647.0588235294117|
|  Alice| 34| 80000|2352.9411764705883|
|Charlie| 29| 75000|2586.2068965517242|
+-------+---+------+------------------+

Here, orderBy() is the natural choice because we are sorting by a column that was just created (salary_per_age). While you could use sort(), orderBy() is more idiomatic for this kind of expression-based sorting.

Performance Consideration: `global_sort`

By default, Spark performs a local sort. This means each partition is sorted independently, but the entire dataset is not globally sorted. You can see this if you look at the physical plan.

To perform a global sort (which is a full, expensive shuffle operation), you can use the global_sort option on sort() or orderBy().

# This will perform a full shuffle to ensure the entire DataFrame is sorted globally.
# It's much more expensive but guarantees a global order.
globally_sorted_df = df.sort("age").global_sort()
# You can also do this with orderBy
# globally_sorted_df = df.orderBy("age").global_sort()
print("Default Sort Plan (Local Sort):")
df.sort("age").explain()
print("\nGlobal Sort Plan (Full Shuffle):")
df.sort("age").global_sort().explain()

When you run explain(), you will see a Sort operator in the plan. For global_sort, you'll also see a ShuffleExchange before the Sort, indicating the data is being moved across the network to be co-located for a true global sort.

Summary and Best Practices

Your Goal	Recommended Function	Example
Simple sort by column name(s)	`sort()`	`df.sort("age", desc("salary"))`
Sort by a complex expression	`orderBy()`	`df.orderBy(col("salary") / col("age"))`
Sort by SQL functions	`orderBy()`	`df.orderBy(year("date_column"))`
Need a guaranteed global order	`sort().global_sort()` or `orderBy().global_sort()`	`df.sort("age").global_sort()`
Sort within partitions (no shuffle)	Use `sort()` or `orderBy()` without `global_sort`. This is the default.	`df.repartition(5).sortWithinPartitions("age")`

Final Recommendation: For most day-to-day tasks, start with sort(). It's concise and performant. If you find yourself needing to sort by a non-trivial expression, switch to orderBy(). Both are essential tools in any PySpark developer's toolkit.

Python Spark sortBy如何高效实现多字段排序？

The Main Functions: `orderBy()` vs. `sort()`

Basic Sorting with `sort()`

Example Setup

Example 1: Sort by a Single Column (Ascending)

Example 2: Sort by a Single Column (Descending)

Example 3: Sort by Multiple Columns

General Sorting with `orderBy()`

Example 4: Sorting with `orderBy()` and Expressions

Performance Consideration: `global_sort`

Summary and Best Practices

99ANYc3cd6

Struts2下载文件如何实现？配置与代码要点？

Java多线程Socket编程如何实现高效通信？

Java Tomcat连接池如何高效配置？

AMD5000如何开启四核？

联想G470拆机教程视频，步骤详细吗？

Java Socket编程实例具体怎么实现？

python sift landmark

Metasploit教程怎么学？入门到精通要多久？

Python 3.5如何用MySQL？

python phantomjs配置

Java版MC下载去哪找？

Java如何正确打开二进制文件？

Java初始化String数组有哪些方法？

Excel VBA教程哪里下载？

Python OpenCV中如何使用SIFT提取特征？

Python Selenium如何切换窗口/iframe/标签页？

Python Spark sortBy如何高效实现多字段排序？

The Main Functions: orderBy() vs. sort()

Basic Sorting with sort()

Example Setup

Example 1: Sort by a Single Column (Ascending)

Example 2: Sort by a Single Column (Descending)

Example 3: Sort by Multiple Columns

General Sorting with orderBy()

Example 4: Sorting with orderBy() and Expressions

Performance Consideration: global_sort

Summary and Best Practices

相关推荐

Java Socket编程实例具体怎么实现？

The Main Functions: `orderBy()` vs. `sort()`

Basic Sorting with `sort()`

General Sorting with `orderBy()`

Example 4: Sorting with `orderBy()` and Expressions

Performance Consideration: `global_sort`