Of course! In PySpark, there isn't a single function named sortBy. Instead, Spark provides a powerful and flexible method called orderBy() on the DataFrame and a more performant, specific function called sort().

Let's break down the differences, use cases, and provide clear examples.
The Main Functions: orderBy() vs. sort()
Both functions are used to sort a DataFrame, but they have key differences.
| Feature | orderBy() |
sort() |
|---|---|---|
| Primary Use | General-purpose sorting, often used with SQL-style expressions. | Often preferred for simple, multi-column sorting. |
| Performance | Can be slightly less optimized as it's more flexible. | Generally more performant, especially for multi-column sorts. |
| Argument Type | Accepts Column objects or String column names. | Accepts Column objects, String column names, or a list of Column objects. |
| Clustered Sort | Can be used with DistributeBy or ClusteredBy for partition-aware sorting. |
Can be used with DistributeBy or ClusteredBy. |
In short:
- Use
orderBy()when you want to sort by a complex expression (e.g.,col("age") + 5,year("date_col")). - Use
sort()for straightforward sorting by one or more column names, as it's often faster.
Basic Sorting with sort()
This is the most common use case. You specify one or more columns and the sort direction (asc for ascending, desc for descending).

Example Setup
First, let's create a sample SparkSession and DataFrame.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, desc, asc
# Create a SparkSession
spark = SparkSession.builder.appName("SortByExample").getOrCreate()
data = [("Alice", 34, 80000),
("Bob", 45, 120000),
("Charlie", 29, 75000),
("David", 45, 110000), # Same age as Bob
("Eve", 34, 90000)] # Same age as Alice
columns = ["name", "age", "salary"]
df = spark.createDataFrame(data, columns)
df.show()
Output:
+-------+---+------+
| name|age|salary|
+-------+---+------+
| Alice| 34| 80000|
| Bob| 45|120000|
|Charlie| 29| 75000|
| David| 45|110000|
| Eve| 34| 90000|
+-------+---+------+
Example 1: Sort by a Single Column (Ascending)
# Sort by 'age' in ascending order (default)
sorted_df_asc = df.sort("age")
# Or using the 'asc' function
# sorted_df_asc = df.sort(asc("age"))
sorted_df_asc.show()
Output:
+-------+---+------+
| name|age|salary|
+-------+---+------+
|Charlie| 29| 75000|
| Alice| 34| 80000|
| Eve| 34| 90000|
| Bob| 45|120000|
| David| 45|110000|
+-------+---+------+
Notice that for the same age (34 and 45), the original row order is preserved. This is called a stable sort.

Example 2: Sort by a Single Column (Descending)
# Sort by 'salary' in descending order
sorted_df_desc = df.sort(desc("salary"))
# Or using the 'desc' function
# sorted_df_desc = df.sort(col("salary").desc())
sorted_df_desc.show()
Output:
+-------+---+------+
| name|age|salary|
+-------+---+------+
| Bob| 45|120000|
| Eve| 34| 90000|
| David| 45|110000|
| Alice| 34| 80000|
|Charlie| 29| 75000|
+-------+---+------+
Example 3: Sort by Multiple Columns
You can pass a list of columns to sort(). The order of the list determines the sort priority.
# Sort by 'age' (ascending), and then by 'salary' (descending)
# This will group people by age, and within each age group, sort them by salary from highest to lowest.
multi_sorted_df = df.sort(["age", desc("salary")])
multi_sorted_df.show()
Output:
+-------+---+------+
| name|age|salary|
+-------+---+------+
|Charlie| 29| 75000|
| Eve| 34| 90000|
| Alice| 34| 80000|
| Bob| 45|120000|
| David| 45|110000|
+-------+---+------+
Explanation:
- First, the DataFrame is sorted by
age(29, 34, 34, 45, 45). - Then, for the rows with the same
age(34 and 45), they are further sorted bysalaryin descending order. Hence, Eve (90000) comes before Alice (80000), and Bob (120000) comes before David (110000).
General Sorting with orderBy()
orderBy() works very similarly to sort() but is more powerful when you need to sort by complex expressions.
Example 4: Sorting with orderBy() and Expressions
Let's say we want to sort by a calculation, like salary / age (average salary per year of age).
from pyspark.sql.functions import lit
# Calculate salary per age and sort by it
# We use 'lit(1)' to create a constant column for the calculation
expr_sorted_df = df.withColumn("salary_per_age", col("salary") / col("age")) \
.orderBy(desc("salary_per_age"))
expr_sorted_df.show()
Output:
+-------+---+------+------------------+
| name|age|salary| salary_per_age|
+-------+---+------+------------------+
| Bob| 45|120000|2666.6666666666665|
| David| 45|110000|2444.4444444444446|
| Eve| 34| 90000|2647.0588235294117|
| Alice| 34| 80000|2352.9411764705883|
|Charlie| 29| 75000|2586.2068965517242|
+-------+---+------+------------------+
Here, orderBy() is the natural choice because we are sorting by a column that was just created (salary_per_age). While you could use sort(), orderBy() is more idiomatic for this kind of expression-based sorting.
Performance Consideration: global_sort
By default, Spark performs a local sort. This means each partition is sorted independently, but the entire dataset is not globally sorted. You can see this if you look at the physical plan.
To perform a global sort (which is a full, expensive shuffle operation), you can use the global_sort option on sort() or orderBy().
# This will perform a full shuffle to ensure the entire DataFrame is sorted globally.
# It's much more expensive but guarantees a global order.
globally_sorted_df = df.sort("age").global_sort()
# You can also do this with orderBy
# globally_sorted_df = df.orderBy("age").global_sort()
print("Default Sort Plan (Local Sort):")
df.sort("age").explain()
print("\nGlobal Sort Plan (Full Shuffle):")
df.sort("age").global_sort().explain()
When you run explain(), you will see a Sort operator in the plan. For global_sort, you'll also see a ShuffleExchange before the Sort, indicating the data is being moved across the network to be co-located for a true global sort.
Summary and Best Practices
| Your Goal | Recommended Function | Example |
|---|---|---|
| Simple sort by column name(s) | sort() |
df.sort("age", desc("salary")) |
| Sort by a complex expression | orderBy() |
df.orderBy(col("salary") / col("age")) |
| Sort by SQL functions | orderBy() |
df.orderBy(year("date_column")) |
| Need a guaranteed global order | sort().global_sort() or orderBy().global_sort() |
df.sort("age").global_sort() |
| Sort within partitions (no shuffle) | Use sort() or orderBy() without global_sort. This is the default. |
df.repartition(5).sortWithinPartitions("age") |
Final Recommendation:
For most day-to-day tasks, start with sort(). It's concise and performant. If you find yourself needing to sort by a non-trivial expression, switch to orderBy(). Both are essential tools in any PySpark developer's toolkit.
