12 Handle Missing Data – Python for Data Science

1 Introduction

Missing data in pandas follows a similar logic to tidyverse:

Missing value representations:
- NaN (Not a Number, from NumPy)
- None (Python’s null object)
pandas treats both as missing.
NaN is a float-based missing value and is not equal to anything, even itself.
Missing values are handled with functions analogous to tidyverse.

Concept	tidyverse	pandas
identify missing	`is.na()`	`isna()`
drop missing	`drop_na()`	`dropna()`
replace missing	`replace_na()`	`fillna()`

2 Creating Missing Values

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "x": [1, 2, np.nan, 4],
    "y": ["A", None, "C", "D"],
    "z": [10, np.nan, np.nan, 40]
})
df

	x	y	z
0	1.0	A	10.0
1	2.0	None	NaN
2	NaN	C	NaN
3	4.0	D	40.0

3 Identifying Missing Values

Equivalent of is.na(df) in tidyverse:

df.isna()
df.isnull()   # same thing

	x	y	z
0	False	False	False
1	False	True	True
2	True	False	True
3	False	False	False

Count missing:

df.isna().sum()

x    1
y    1
z    2
dtype: int64

Count non-missing:

df.notna().sum()

x    3
y    3
z    2
dtype: int64

4 Dropping Missing Values

To drop any row that has at least one missing value:

df.dropna()

	x	y	z
0	1.0	A	10.0
3	4.0	D	40.0

Drop rows only if all values are missing:

df.dropna(how="all")

	x	y	z
0	1.0	A	10.0
1	2.0	None	NaN
2	NaN	C	NaN
3	4.0	D	40.0

Drop missing based on specific columns (like R’s drop_na(cols=c("x","y"))):

df.dropna(subset=["x", "y"])

	x	y	z
0	1.0	A	10.0
3	4.0	D	40.0

5 Replacing Missing Values

Replace with a single value:

df.fillna(0)

	x	y	z
0	1.0	A	10.0
1	2.0	0	0.0
2	0.0	C	0.0
3	4.0	D	40.0

Replace column-wise using a dictionary:

df.fillna({"x": 0, "y": "missing"})

	x	y	z
0	1.0	A	10.0
1	2.0	missing	NaN
2	0.0	C	NaN
3	4.0	D	40.0

Replace using summary statistics:

df["x"].fillna(df["x"].mean())
df["x"].fillna(df["x"].median())

0    1.0
1    2.0
2    2.0
3    4.0
Name: x, dtype: float64

6 Summaries with Missing Data

Mean ignoring missing (default behavior):

df["x"].mean()

np.float64(2.3333333333333335)

Mean keeping missing:

df["x"].mean(skipna=False)

np.float64(nan)

Count missing per column:

df.isna().sum()

x    1
y    1
z    2
dtype: int64

Count missing per row:

df.isna().sum(axis=1)

0    0
1    2
2    2
3    0
dtype: int64

Percentage missing:

df.isna().mean() * 100

x    25.0
y    25.0
z    50.0
dtype: float64

7 Example

df = pd.DataFrame({
    "id": [1, 2, 3, 4],
    "age": [23, None, 30, None],
    "income": [50000, 55000, None, 62000]
})
df

	id	age	income
0	1	23.0	50000.0
1	2	NaN	55000.0
2	3	30.0	NaN
3	4	NaN	62000.0

clean = (
    df
      .assign(age = lambda x: x.age.fillna(x.age.median()))
      .assign(income = lambda x: x.income.fillna(0))
)
clean

	id	age	income
0	1	23.0	50000.0
1	2	26.5	55000.0
2	3	30.0	0.0
3	4	26.5	62000.0