import pandas as pd
students = pd.DataFrame({
"id": [1, 2, 3, 4],
"name": ["A", "B", "C", "D"]
})
students.head()| id | name | |
|---|---|---|
| 0 | 1 | A |
| 1 | 2 | B |
| 2 | 3 | C |
| 3 | 4 | D |
merge().pd.merge() function.merge() methoddf1.merge(df2, ...)pd.merge(
left=df1,
right=df2,
how="inner", # join type
on="id" # key variable(s)
)Or:
df1.merge(df2, how="inner", on="id")| R concept | pandas argument | explanation |
|---|---|---|
| inner | how="inner" |
rows with matches in both |
| left | how="left" |
keep all rows from the left |
| right | how="right" |
keep all rows from the right |
| full | how="outer" |
union of keys from both tables |
| semi | filtering after merge | keep matching rows from left |
| anti | filtering after merge | keep non-matching rows from left |
import pandas as pd
students = pd.DataFrame({
"id": [1, 2, 3, 4],
"name": ["A", "B", "C", "D"]
})
students.head()| id | name | |
|---|---|---|
| 0 | 1 | A |
| 1 | 2 | B |
| 2 | 3 | C |
| 3 | 4 | D |
scores = pd.DataFrame({
"id": [2, 3, 3, 5],
"score": [80, 90, 85, 70]
})
scores.head()| id | score | |
|---|---|---|
| 0 | 2 | 80 |
| 1 | 3 | 90 |
| 2 | 3 | 85 |
| 3 | 5 | 70 |
students.merge(scores, how="inner", on="id")| id | name | score | |
|---|---|---|---|
| 0 | 2 | B | 80 |
| 1 | 3 | C | 90 |
| 2 | 3 | C | 85 |
This returns rows where the key appears in both tables.
If a key appears more than once in either table, pandas automatically performs a many-to-many join, producing all pairwise combinations of matches.
students.merge(scores, how="left", on="id")| id | name | score | |
|---|---|---|---|
| 0 | 1 | A | NaN |
| 1 | 2 | B | 80.0 |
| 2 | 3 | C | 90.0 |
| 3 | 3 | C | 85.0 |
| 4 | 4 | D | NaN |
Keeps all rows from students; unmatched scores show as NaN.
students.merge(scores, how="right", on="id")| id | name | score | |
|---|---|---|---|
| 0 | 2 | B | 80 |
| 1 | 3 | C | 90 |
| 2 | 3 | C | 85 |
| 3 | 5 | NaN | 70 |
Less commonly used but works the same way.
students.merge(scores, how="outer", on="id")| id | name | score | |
|---|---|---|---|
| 0 | 1 | A | NaN |
| 1 | 2 | B | 80.0 |
| 2 | 3 | C | 90.0 |
| 3 | 3 | C | 85.0 |
| 4 | 4 | D | NaN |
| 5 | 5 | NaN | 70.0 |
Equivalent to R’s left_join(df1, df2, by = c("x" = "y"))
Python:
df1.merge(df2, left_on="x", right_on="y")Equivalent to R’s left_join(df1, df2, by = c("id", "year"))
Python:
df1.merge(df2, on=["id", "year"])Keep only rows in df1 that have a match in df2:
df1[df1["id"].isin(df2["id"])]Or:
df1.merge(df2[["id"]], on="id")Keep only rows in df1 that do not have a match in df2:
df1[~df1["id"].isin(df2["id"])]| Concept | tidyverse | pandas |
|---|---|---|
| Join function | left_join() |
.merge() |
| Join type | left, right, inner, full |
how= |
| Keys | by= |
on=, left_on=, right_on= |
| Semi/anti | semi_join(), anti_join() |
isin(), boolean filtering |