12 Handle Missing Data

1 Introduction

Missing data in pandas follows a similar logic to tidyverse:

  • Missing value representations:
    • NaN (Not a Number, from NumPy)
    • None (Python’s null object)
  • pandas treats both as missing.
  • NaN is a float-based missing value and is not equal to anything, even itself.
  • Missing values are handled with functions analogous to tidyverse.
Concept tidyverse pandas
identify missing is.na() isna()
drop missing drop_na() dropna()
replace missing replace_na() fillna()

2 Creating Missing Values

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "x": [1, 2, np.nan, 4],
    "y": ["A", None, "C", "D"],
    "z": [10, np.nan, np.nan, 40]
})
df
x y z
0 1.0 A 10.0
1 2.0 None NaN
2 NaN C NaN
3 4.0 D 40.0

3 Identifying Missing Values

  • Equivalent of is.na(df) in tidyverse:
df.isna()
df.isnull()   # same thing
x y z
0 False False False
1 False True True
2 True False True
3 False False False
  • Count missing:
df.isna().sum()
x    1
y    1
z    2
dtype: int64
  • Count non-missing:
df.notna().sum()
x    3
y    3
z    2
dtype: int64

4 Dropping Missing Values

  • To drop any row that has at least one missing value:
df.dropna()
x y z
0 1.0 A 10.0
3 4.0 D 40.0
  • Drop rows only if all values are missing:
df.dropna(how="all")
x y z
0 1.0 A 10.0
1 2.0 None NaN
2 NaN C NaN
3 4.0 D 40.0
  • Drop missing based on specific columns (like R’s drop_na(cols=c("x","y"))):
df.dropna(subset=["x", "y"])
x y z
0 1.0 A 10.0
3 4.0 D 40.0

5 Replacing Missing Values

  • Replace with a single value:
df.fillna(0)
x y z
0 1.0 A 10.0
1 2.0 0 0.0
2 0.0 C 0.0
3 4.0 D 40.0
  • Replace column-wise using a dictionary:
df.fillna({"x": 0, "y": "missing"})
x y z
0 1.0 A 10.0
1 2.0 missing NaN
2 0.0 C NaN
3 4.0 D 40.0
  • Replace using summary statistics:
df["x"].fillna(df["x"].mean())
df["x"].fillna(df["x"].median())
0    1.0
1    2.0
2    2.0
3    4.0
Name: x, dtype: float64

6 Summaries with Missing Data

  • Mean ignoring missing (default behavior):
df["x"].mean()
np.float64(2.3333333333333335)
  • Mean keeping missing:
df["x"].mean(skipna=False)
np.float64(nan)
  • Count missing per column:
df.isna().sum()
x    1
y    1
z    2
dtype: int64
  • Count missing per row:
df.isna().sum(axis=1)
0    0
1    2
2    2
3    0
dtype: int64
  • Percentage missing:
df.isna().mean() * 100
x    25.0
y    25.0
z    50.0
dtype: float64

7 Example

df = pd.DataFrame({
    "id": [1, 2, 3, 4],
    "age": [23, None, 30, None],
    "income": [50000, 55000, None, 62000]
})
df
id age income
0 1 23.0 50000.0
1 2 NaN 55000.0
2 3 30.0 NaN
3 4 NaN 62000.0
clean = (
    df
      .assign(age = lambda x: x.age.fillna(x.age.median()))
      .assign(income = lambda x: x.income.fillna(0))
)
clean
id age income
0 1 23.0 50000.0
1 2 26.5 55000.0
2 3 30.0 0.0
3 4 26.5 62000.0