14 Join/Merge

1 Two Main Approaches in Python

Approach 1: The pd.merge() function

  • The most explicit way to join two tables.
  • Clearly specifies left table, right table, join type, and keys.

Approach 2: The .merge() method

  • Example: df1.merge(df2, ...)
  • Same result; some find it more natural in method-chaining workflows.

2 Pandas Merge Syntax

pd.merge(
    left=df1,
    right=df2,
    how="inner",        # join type
    on="id"             # key variable(s)
)

Or:

df1.merge(df2, how="inner", on="id")

3 Types of Joins

R concept pandas argument explanation
inner how="inner" rows with matches in both
left how="left" keep all rows from the left
right how="right" keep all rows from the right
full how="outer" union of keys from both tables
semi filtering after merge keep matching rows from left
anti filtering after merge keep non-matching rows from left

4 Example Data

import pandas as pd
students = pd.DataFrame({
    "id": [1, 2, 3, 4],
    "name": ["A", "B", "C", "D"]
})
students.head()
id name
0 1 A
1 2 B
2 3 C
3 4 D
scores = pd.DataFrame({
    "id": [2, 3, 3, 5],
    "score": [80, 90, 85, 70]
})

scores.head()
id score
0 2 80
1 3 90
2 3 85
3 5 70

5 Inner Join

students.merge(scores, how="inner", on="id")
id name score
0 2 B 80
1 3 C 90
2 3 C 85

This returns rows where the key appears in both tables.

Note

If a key appears more than once in either table, pandas automatically performs a many-to-many join, producing all pairwise combinations of matches.

6 Left Join

students.merge(scores, how="left", on="id")
id name score
0 1 A NaN
1 2 B 80.0
2 3 C 90.0
3 3 C 85.0
4 4 D NaN

Keeps all rows from students; unmatched scores show as NaN.

7 Right Join

students.merge(scores, how="right", on="id")
id name score
0 2 B 80
1 3 C 90
2 3 C 85
3 5 NaN 70

Less commonly used but works the same way.

8 Full/Outer Join

students.merge(scores, how="outer", on="id")
id name score
0 1 A NaN
1 2 B 80.0
2 3 C 90.0
3 3 C 85.0
4 4 D NaN
5 5 NaN 70.0

9 Joining by Different Column Names

Equivalent to R’s left_join(df1, df2, by = c("x" = "y"))

Python:

df1.merge(df2, left_on="x", right_on="y")

10 Joining by Multiple Keys

Equivalent to R’s left_join(df1, df2, by = c("id", "year"))

Python:

df1.merge(df2, on=["id", "year"])

11 Semi Join in Python

Keep only rows in df1 that have a match in df2:

df1[df1["id"].isin(df2["id"])]

Or:

df1.merge(df2[["id"]], on="id")

12 Anti Join in Python

Keep only rows in df1 that do not have a match in df2:

df1[~df1["id"].isin(df2["id"])]

13 Summary

Concept tidyverse pandas
Join function left_join() .merge()
Join type left, right, inner, full how=
Keys by= on=, left_on=, right_on=
Semi/anti semi_join(), anti_join() isin(), boolean filtering