Introduction
Missing data in pandas follows a similar logic to tidyverse:
- Missing value representations:
NaN (Not a Number, from NumPy)
None (Python’s null object)
- pandas treats both as missing.
NaN is a float-based missing value and is not equal to anything, even itself.
- Missing values are handled with functions analogous to tidyverse.
| identify missing |
is.na() |
isna() |
| drop missing |
drop_na() |
dropna() |
| replace missing |
replace_na() |
fillna() |
Creating Missing Values
import pandas as pd
import numpy as np
df = pd.DataFrame({
"x": [1, 2, np.nan, 4],
"y": ["A", None, "C", "D"],
"z": [10, np.nan, np.nan, 40]
})
df
| 0 |
1.0 |
A |
10.0 |
| 1 |
2.0 |
None |
NaN |
| 2 |
NaN |
C |
NaN |
| 3 |
4.0 |
D |
40.0 |
Identifying Missing Values
- Equivalent of
is.na(df) in tidyverse:
df.isna()
df.isnull() # same thing
| 0 |
False |
False |
False |
| 1 |
False |
True |
True |
| 2 |
True |
False |
True |
| 3 |
False |
False |
False |
Dropping Missing Values
- To drop any row that has at least one missing value:
| 0 |
1.0 |
A |
10.0 |
| 3 |
4.0 |
D |
40.0 |
- Drop rows only if all values are missing:
| 0 |
1.0 |
A |
10.0 |
| 1 |
2.0 |
None |
NaN |
| 2 |
NaN |
C |
NaN |
| 3 |
4.0 |
D |
40.0 |
- Drop missing based on specific columns (like R’s
drop_na(cols=c("x","y"))):
df.dropna(subset=["x", "y"])
| 0 |
1.0 |
A |
10.0 |
| 3 |
4.0 |
D |
40.0 |
Replacing Missing Values
- Replace with a single value:
| 0 |
1.0 |
A |
10.0 |
| 1 |
2.0 |
0 |
0.0 |
| 2 |
0.0 |
C |
0.0 |
| 3 |
4.0 |
D |
40.0 |
- Replace column-wise using a dictionary:
df.fillna({"x": 0, "y": "missing"})
| 0 |
1.0 |
A |
10.0 |
| 1 |
2.0 |
missing |
NaN |
| 2 |
0.0 |
C |
NaN |
| 3 |
4.0 |
D |
40.0 |
- Replace using summary statistics:
df["x"].fillna(df["x"].mean())
df["x"].fillna(df["x"].median())
0 1.0
1 2.0
2 2.0
3 4.0
Name: x, dtype: float64
Summaries with Missing Data
- Mean ignoring missing (default behavior):
np.float64(2.3333333333333335)
df["x"].mean(skipna=False)
- Count missing per column:
0 0
1 2
2 2
3 0
dtype: int64
x 25.0
y 25.0
z 50.0
dtype: float64
Example
df = pd.DataFrame({
"id": [1, 2, 3, 4],
"age": [23, None, 30, None],
"income": [50000, 55000, None, 62000]
})
df
| 0 |
1 |
23.0 |
50000.0 |
| 1 |
2 |
NaN |
55000.0 |
| 2 |
3 |
30.0 |
NaN |
| 3 |
4 |
NaN |
62000.0 |
clean = (
df
.assign(age = lambda x: x.age.fillna(x.age.median()))
.assign(income = lambda x: x.income.fillna(0))
)
clean
| 0 |
1 |
23.0 |
50000.0 |
| 1 |
2 |
26.5 |
55000.0 |
| 2 |
3 |
30.0 |
0.0 |
| 3 |
4 |
26.5 |
62000.0 |