杰瑞科技汇

Python表格合并,如何高效实现?

Of course! Merging tables (or more accurately, DataFrames) is a fundamental operation in data analysis, especially when using the Pandas library in Python. It's similar to SQL's JOIN operations.

Python表格合并,如何高效实现?-图1
(图片来源网络,侵删)

Let's break down the different ways to merge tables in Python, focusing on Pandas, which is the standard for this task.

The Core Concept: Keys and Joins

Merging combines two DataFrames based on the values in one or more columns. These columns are called the "key" or "on" columns. The way you specify how to combine the rows based on these keys is called the "join type".


Setup: Creating Sample DataFrames

First, let's create two simple DataFrames to work with.

import pandas as pd
# DataFrame 1: Employees
employees = pd.DataFrame({
    'employee_id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'department_id': [101, 102, 101, 103]
})
# DataFrame 2: Departments
departments = pd.DataFrame({
    'department_id': [101, 102, 104], # Note: dept 103 is missing, dept 104 is new
    'department_name': ['Sales', 'Engineering', 'HR']
})
print("--- Employees ---")
print(employees)
print("\n--- Departments ---")
print(departments)

Output:

Python表格合并,如何高效实现?-图2
(图片来源网络,侵删)
--- Employees ---
   employee_id     name  department_id
0            1    Alice           101
1            2      Bob           102
2            3  Charlie           101
3            4    David           103
--- Departments ---
   department_id department_name
0            101           Sales
1            102     Engineering
2            104              HR

Now, let's explore the different merge types using these DataFrames.


pd.merge() - The Main Function

The primary function for merging is pd.merge(). The most important parameters are:

  • left: The left DataFrame.
  • right: The right DataFrame.
  • on: The column name(s) to join on. If not specified, Pandas will look for common columns.
  • how: The type of join to perform. This is the most critical parameter.

The how Parameter: Different Types of Joins

This is where you control how the tables are combined.

a) Inner Join (how='inner')

This is the default type of merge. It returns only the rows where the key exists in both DataFrames.

Python表格合并,如何高效实现?-图3
(图片来源网络,侵删)
  • Result: Departments 101 and 102 are present in both tables. Department 103 (only in employees) and 104 (only in departments) are dropped.
inner_merge = pd.merge(employees, departments, on='department_id', how='inner')
print("--- Inner Merge ---")
print(inner_merge)

Output:

--- Inner Merge ---
   employee_id     name  department_id department_name
0            1    Alice           101           Sales
1            3  Charlie           101           Sales
2            2      Bob           102     Engineering

b) Left Join (how='left')

This returns all rows from the left DataFrame (employees), and matched rows from the right DataFrame (departments). If there's no match in the right DataFrame, the result is NaN (Not a Number).

  • Result: All employees are kept. David (dept 103) gets NaN for department_name. Department 104 is dropped.
left_merge = pd.merge(employees, departments, on='department_id', how='left')
print("\n--- Left Merge ---")
print(left_merge)

Output:

--- Left Merge ---
   employee_id     name  department_id department_name
0            1    Alice           101           Sales
1            2      Bob           102     Engineering
2            3  Charlie           101           Sales
3            4    David           103             NaN

c) Right Join (how='right')

This returns all rows from the right DataFrame (departments), and matched rows from the left DataFrame (employees). If there's no match in the left DataFrame, the result is NaN.

  • Result: All departments are kept. HR (dept 104) gets NaN for the employee columns. David (dept 103) is dropped.
right_merge = pd.merge(employees, departments, on='department_id', how='right')
print("\n--- Right Merge ---")
print(right_merge)

Output:

--- Right Merge ---
   employee_id     name  department_id department_name
0            1    Alice           101           Sales
1            3  Charlie           101           Sales
2            2      Bob           102     Engineering
3          NaN      NaN           104              HR

d) Outer Join (how='outer')

This returns all rows from both DataFrames. Where there is no match, the result is NaN.

  • Result: All employees and all departments are included. David (dept 103) and HR (dept 104) both get NaN in the columns from the other table.
outer_merge = pd.merge(employees, departments, on='department_id', how='outer')
print("\n--- Outer Merge ---")
print(outer_merge)

Output:

--- Outer Merge ---
   employee_id     name  department_id department_name
0            1    Alice           101           Sales
1            3  Charlie           101           Sales
2            2      Bob           102     Engineering
3            4    David           103             NaN
4          NaN      NaN           104              HR

Handling Different Key Column Names

What if the key columns have different names in the two DataFrames? You can use left_on and right_on.

Let's change the departments DataFrame to use dept_id as its key.

departments_renamed = departments.rename(columns={'department_id': 'dept_id'})
print("\n--- Renamed Departments ---")
print(departments_renamed)

Output:

--- Renamed Departments ---
   dept_id department_name
0      101           Sales
1      102     Engineering
2      104              HR

Now, we merge using left_on and right_on:

# Merge employees with the renamed departments
different_key_merge = pd.merge(
    employees,
    departments_renamed,
    left_on='department_id',
    right_on='dept_id',
    how='left'
)
print("\n--- Merge with Different Key Names ---")
print(different_key_merge)

Output:

--- Merge with Different Key Names ---
   employee_id     name  department_id  dept_id department_name
0            1    Alice           101      101           Sales
1            2      Bob           102      102     Engineering
2            3  Charlie           101      101           Sales
3            4    David           103      103             NaN

Merging on Multiple Keys (Composite Keys)

Sometimes you need to merge on more than one column. For example, if you have order lines and products, you might need to match on both order_id and product_sku.

# DataFrame 1: Order Lines
order_lines = pd.DataFrame({
    'order_id': [101, 101, 102, 103],
    'product_sku': ['A-123', 'B-456', 'A-123', 'C-789'],
    'quantity': [2, 1, 5, 1]
})
# DataFrame 2: Products
products = pd.DataFrame({
    'product_sku': ['A-123', 'B-456', 'C-789'],
    'price': [10.00, 15.50, 8.99]
})
# Merge on both order_id and product_sku
# (Note: In this simple case, product_sku is enough, but this demonstrates the concept)
# Let's assume a more complex scenario where we also have order_date
order_lines_complex = pd.DataFrame({
    'order_id': [101, 101, 102, 103],
    'order_date': ['2025-01-01', '2025-01-01', '2025-01-02', '2025-01-03'],
    'product_sku': ['A-123', 'B-456', 'A-123', 'C-789'],
    'quantity': [2, 1, 5, 1]
})
products_complex = pd.DataFrame({
    'product_sku': ['A-123', 'B-456', 'C-789'],
    'price': [10.00, 15.50, 8.99]
})
# We can merge on product_sku, and Pandas will keep the order_id and order_date from the left
multi_key_merge = pd.merge(
    order_lines_complex,
    products_complex,
    on='product_sku', # We can pass a list of keys: on=['order_id', 'product_sku']
    how='left'
)
print("\n--- Multi-Key Merge (conceptual) ---")
print(multi_key_merge)

Output:

--- Multi-Key Merge (conceptual) ---
   order_id order_date product_sku  quantity  price
0       101  2025-01-01       A-123         2  10.00
1       101  2025-01-01       B-456         1  15.50
2       102  2025-01-02       A-123         5  10.00
3       103  2025-01-03       C-789         1   8.99

Alternative: DataFrame.join()

Pandas also has a .join() method, which is a convenience function. It's most useful when you want to join on the index of one DataFrame to the column(s) of another.

Key difference:

  • merge() is more general and flexible. You join on any column(s).
  • join() is syntactic sugar for merge(). It defaults to joining on the index of the left DataFrame and the index of the right DataFrame, but you can change this with the on and lsuffix/rsuffix parameters.
# Set 'employee_id' as the index for the employees DataFrame
employees_indexed = employees.set_index('employee_id')
# DataFrame with a different index
salaries = pd.DataFrame({
    'salary': [70000, 80000, 120000, 95000]
}, index=[1, 2, 3, 5]) # Note: David (id 4) is missing, and id 5 is new
# Use join to combine employees and their salaries
# We need suffixes because 'department_id' is in both DataFrames
joined_df = employees_indexed.join(salaries, how='left', lsuffix='_emp', rsuffix='_sal')
print("\n--- Using DataFrame.join() ---")
print(joined_df)

Output:

--- Using DataFrame.join() ---
           name  department_id  salary
employee_id
1        Alice           101  70000.0
2          Bob           102  80000.0
3      Charlie           101 120000.0
4        David           103      NaN
5          NaN             NaN  95000.0

Summary: merge() vs. join()

Feature pd.merge() DataFrame.join()
Primary Use General-purpose merging on columns. Joining an index to a column(s).
Keys Explicitly defined with on, left_on, right_on. Defaults to indexes. on can specify a column in the right DataFrame.
Flexibility High. Can join on any combination of columns/indexes. Less flexible. Designed for index-to-column joins.
Suffixes Handled automatically if column names clash. Requires lsuffix and rsuffix if column names clash.
Recommendation Use this 90% of the time. It's the standard and most powerful tool. Use for specific cases, like joining a DataFrame to its own summary statistics (which are often indexed).

Final Recommendation

For almost all table merging tasks in Python with Pandas, start with pd.merge(). It is the most explicit, flexible, and widely understood method. Learn the how parameter (inner, left, right, outer) well, as it controls the logic of your merge.

分享:
扫描分享到社交APP
上一篇
下一篇