11 Pandas DataFrame

The pandas package provides the DataFrame data type for working with structured data — that is, data organized in rows and columns.

While NumPy arrays and Python dictionaries can also represent structured data, pandas DataFrames are much more flexible and powerful for data manipulation, analysis, and visualization.

If you are familiar with R’s tidyverse, pandas is Python’s equivalent to the dplyr + tibble ecosystem.

This lecture presents pandas operations using your well-known Palmer Penguins dataset.

1 Introduction to DataFrames

import pandas as pd
import numpy as np

A pandas DataFrame is like a tibble — it stores data in a table with rows and columns, allowing both numeric and text data.

Creating a DataFrame

employee_dict = {
    'eid': [214, 174, 126, 227, 151, 175, 193, 146],
    'name': ['Drew', 'Faye', 'Beth', 'Chad', 'Hana', 'Gary', 'Alex', 'Emma'],
    'age': [25, 19, 42, 31, 25, 28, 31, 25],
    'rate': [11.50, 12.30, 14.60, 12.50, None, 15.60, 13.50, 14.30],
    'hours': [38, 32, 40, 29, 40, 36, 24, 20]
}

employee_df = pd.DataFrame(employee_dict)
employee_df.head()

	eid	name	age	rate	hours
0	214	Drew	25	11.5	38
1	174	Faye	19	12.3	32
2	126	Beth	42	14.6	40
3	227	Chad	31	12.5	29
4	151	Hana	25	NaN	40

2 Data Import

# Importing CSV files:
df = pd.read_csv("data.csv") 

# Common other CSV options:
df = pd.read_csv("file.csv", sep=",")
df = pd.read_csv("file.csv", header=0)
df = pd.read_csv("file.csv", na_values=["", "NA"])
df = pd.read_csv("file.csv", encoding="utf-8")

# Importing TSV or other delimiters:
df = pd.read_csv("data.tsv", sep="\t")

# Importing from URLs:
df = pd.read_csv("https://example.com/data.csv")

# Importing Excel files
# Requires `openpyxl` or `xlrd` package.
df = pd.read_excel("data.xlsx", sheet_name="Sheet1")

# Import multiple sheets
dfs = pd.read_excel("data.xlsx", sheet_name=None)

# Stata (.dta)
df = pd.read_stata("data.dta")

# SPSS (.sav), SAS (.sas7bdat)
import pyreadstat
df, meta = pyreadstat.read_sav("data.sav")
df, meta = pyreadstat.read_sas7bdat("data.sas7bdat")

# R (.RData)
import pyreadr
result = pyreadr.read_r("file.RData")

3 Penguins data

Download penguins data and import it.

penguins = pd.read_csv("data/penguins.csv")

Quick look at the data:

penguins.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
 7   year               344 non-null    int64  
dtypes: float64(4), int64(1), object(3)
memory usage: 21.6+ KB

penguins.head()

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	male	2007
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	female	2007
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	female	2007
3	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN	2007
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	female	2007

Using describe() we can get a summary of numeric variables:

penguins.describe()

	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	year
count	342.000000	342.000000	342.000000	342.000000	344.000000
mean	43.921930	17.151170	200.915205	4201.754386	2008.029070
std	5.459584	1.974793	14.061714	801.954536	0.818356
min	32.100000	13.100000	172.000000	2700.000000	2007.000000
25%	39.225000	15.600000	190.000000	3550.000000	2007.000000
50%	44.450000	17.300000	197.000000	4050.000000	2008.000000
75%	48.500000	18.700000	213.000000	4750.000000	2009.000000
max	59.600000	21.500000	231.000000	6300.000000	2009.000000

4 Accessing Elements

Each DataFrame has row labels (index) and column labels (names).
Use iloc[] to access by integer position and loc[] to access by label.

penguins.iloc[2, 3]        # 3rd row, 4th column (position-based)
penguins.iloc[4, 1]        # 5th row, 2nd column (position-based)
penguins.loc[4, 'island']  # label-based

'Torgersen'

Note

Here, the row label (4) happens to match its position, but this is not always true. So use loc carefully.

Both loc and iloc can take two arguments:

df.loc[row_selector , column_selector]
df.iloc[row_selector , column_selector]

The column selector is optional. If you don’t specify it, pandas assumes “all columns”. For example, df.loc[1] is equivalent to df.loc[1, :] — both select all columns from the row with label 1.

5 Selecting Columns

In pandas, use column selection (df[['col']] or .loc) when you want to keep only certain columns from a DataFrame.

# select single column (Series)
species = penguins['species']
type(species), species.head()

(pandas.core.series.Series,
 0    Adelie
 1    Adelie
 2    Adelie
 3    Adelie
 4    Adelie
 Name: species, dtype: object)

# select multiple columns (DataFrame)
peng_sel = penguins[['species', 'island', 'sex', 'bill_length_mm', 'body_mass_g']]
peng_sel.head()

	species	island	sex	bill_length_mm	body_mass_g
0	Adelie	Torgersen	male	39.1	3750.0
1	Adelie	Torgersen	female	39.5	3800.0
2	Adelie	Torgersen	female	40.3	3250.0
3	Adelie	Torgersen	NaN	NaN	NaN
4	Adelie	Torgersen	female	36.7	3450.0

You can also use .loc for column selection:

penguins.loc[:, ['species', 'sex']].head()

	species	sex
0	Adelie	male
1	Adelie	female
2	Adelie	female
3	Adelie	NaN
4	Adelie	female

Select a range of columns (inclusive of both endpoints):

penguins.loc[1:3, 'bill_length_mm':'body_mass_g'].head()

	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g
1	39.5	17.4	186.0	3800.0
2	40.3	18.0	195.0	3250.0
3	NaN	NaN	NaN	NaN

6 Filtering Rows

In pandas, use row filtering (df[condition] or .query() ) when you want to keep only the rows that satisfy a condition.

# boolean mask
adults = penguins[penguins['sex'] == 'male']
adults.shape

(168, 8)

# multiple conditions: males with body mass > 4000 g
sel = (penguins['sex'] == 'male') & (penguins['body_mass_g'] > 3500)
penguins.loc[sel, ['species', 'sex', 'body_mass_g']].head()

	species	sex	body_mass_g
0	Adelie	male	3750.0
5	Adelie	male	3650.0
7	Adelie	male	4675.0
13	Adelie	male	3800.0
14	Adelie	male	4400.0

# using query() (tidy-like syntax)
penguins.query("sex == 'female' and body_mass_g > 3500").head()

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	female	2007
6	Adelie	Torgersen	38.9	17.8	181.0	3625.0	female	2007
15	Adelie	Torgersen	36.6	17.8	185.0	3700.0	female	2007
22	Adelie	Biscoe	35.9	19.2	189.0	3800.0	female	2007
25	Adelie	Biscoe	35.3	18.9	187.0	3800.0	female	2007

7 Arranging Rows

In pandas, use sort_values() when you want to sort the rows of a DataFrame by one or more columns.

# sort by body_mass_g descending
penguins.sort_values(by='body_mass_g', ascending=False).head()

# sort by species then bill_length_mm (asc, desc)
penguins.sort_values(by=['species', 'bill_length_mm'], ascending=[True, False]).head()

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
19	Adelie	Torgersen	46.0	21.5	194.0	4200.0	male	2007
73	Adelie	Torgersen	45.8	18.9	197.0	4150.0	male	2008
111	Adelie	Biscoe	45.6	20.3	191.0	4600.0	male	2009
43	Adelie	Dream	44.1	19.7	196.0	4400.0	male	2007
129	Adelie	Torgersen	44.1	18.0	210.0	4000.0	male	2009

8 Slicing Rows

In pandas, use iloc row slicing when you want to select rows by their numeric positions (such as first n rows or a specific range).
Equivalent of R’s slice() / slice_head() / slice_sample():

# first 6 rows (like slice_head(n=6))
penguins.iloc[:6]

# rows 10–15 (2-based indexing in R => here 10:16)
penguins.iloc[10:16]

# random sample of 5 rows (slice_sample)
penguins.sample(n=5, random_state=42)

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
194	Gentoo	Biscoe	45.3	13.7	210.0	4300.0	female	2008
157	Gentoo	Biscoe	46.5	13.5	210.0	4550.0	female	2007
225	Gentoo	Biscoe	46.5	14.8	217.0	5200.0	female	2008
208	Gentoo	Biscoe	43.8	13.9	208.0	4300.0	female	2008
318	Chinstrap	Dream	50.9	19.1	196.0	3550.0	male	2008

9 Pulling Columns

In pandas, use df['col'] or df['col'].tolist() when you want to extract a single column as a Series or a Python list.

ages = penguins['body_mass_g']          # pandas Series
names_list = penguins['species'].tolist()  # Python list
type(ages), type(names_list), names_list[:6]

(pandas.core.series.Series,
 list,
 ['Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie'])

10 Mutating Columns

In pandas, use assign() or direct column assignment when you want to create new variables or modify existing ones.

# create a new column: bill_ratio = bill_length_mm / bill_depth_mm
peng = penguins.copy()
peng = peng.assign(bill_ratio = peng['bill_length_mm'] / peng['bill_depth_mm'])
peng[['bill_length_mm','bill_depth_mm','bill_ratio']].head()

	bill_length_mm	bill_depth_mm	bill_ratio
0	39.1	18.7	2.090909
1	39.5	17.4	2.270115
2	40.3	18.0	2.238889
3	NaN	NaN	NaN
4	36.7	19.3	1.901554

Recoding with `np.select()`

Use np.select() or pd.cut() for binning. np.select() mimics dplyr::case_when().

conds = [
    peng['body_mass_g'].isna(),
    peng['body_mass_g'] < 3500,
    (peng['body_mass_g'] >= 3500) & (peng['body_mass_g'] < 4500),
    peng['body_mass_g'] >= 4500
]
choices = ['missing', 'light', 'medium', 'heavy']

peng['mass_category'] = np.select(conds, choices, default='unknown')
peng[['body_mass_g','mass_category']].head(10)

	body_mass_g	mass_category
0	3750.0	medium
1	3800.0	medium
2	3250.0	light
3	NaN	missing
4	3450.0	light
5	3650.0	medium
6	3625.0	medium
7	4675.0	heavy
8	3475.0	light
9	4250.0	medium

Alternative with pd.cut():

bins = [0, 3500, 4500, np.inf]
labels = ['light', 'medium', 'heavy']
peng['mass_bin'] = pd.cut(peng['body_mass_g'], bins=bins, labels=labels)
peng[['body_mass_g','mass_bin']].head(10)

	body_mass_g	mass_bin
0	3750.0	medium
1	3800.0	medium
2	3250.0	light
3	NaN	NaN
4	3450.0	light
5	3650.0	medium
6	3625.0	medium
7	4675.0	heavy
8	3475.0	light
9	4250.0	medium

11 Renaming Columns

In pandas, use rename(columns={...}) when you want to change column names.

peng = peng.rename(columns={'bill_length_mm': 'bill_length',
                            'bill_depth_mm': 'bill_depth',
                            'body_mass_g': 'mass_g'})
peng.columns

Index(['species', 'island', 'bill_length', 'bill_depth', 'flipper_length_mm',
       'mass_g', 'sex', 'year', 'bill_ratio', 'mass_category', 'mass_bin'],
      dtype='object')

12 Grouping and Aggregation

In pandas, use groupby().agg() when you want to calculate summary statistics for groups of rows.
Equivalent to group_by() + summarise() of tidyverse.

# mean bill length and mass by species
peng.groupby('species', as_index=False).agg(
    mean_bill_length=('bill_length', 'mean'),
    mean_mass=('mass_g', 'mean'),
    n=('species', 'count')
)

	species	mean_bill_length	mean_mass	n
0	Adelie	38.791391	3700.662252	152
1	Chinstrap	48.833824	3733.088235	68
2	Gentoo	47.504878	5076.016260	124

Multiple aggregations on same column:

peng.groupby(['species', 'island'], as_index=False).agg(
    n=('species', 'count'),
    mean_mass=('mass_g', 'mean'),
    sd_mass=('mass_g', 'std'),
    median_bill=('bill_length', 'median')
).sort_values(by='n', ascending=False)

	species	island	n	mean_mass	sd_mass	median_bill
4	Gentoo	Biscoe	124	5076.016260	504.116237	47.30
3	Chinstrap	Dream	68	3733.088235	384.335081	49.55
1	Adelie	Dream	56	3688.392857	455.146437	38.55
2	Adelie	Torgersen	52	3706.372549	445.107940	38.90
0	Adelie	Biscoe	44	3709.659091	487.733722	38.70

Use .filter()-like behavior to keep groups with enough observations:

grouped = peng.groupby('species').filter(lambda df: df['mass_g'].notna().sum() >= 30)
grouped['species'].unique()

array(['Adelie', 'Gentoo', 'Chinstrap'], dtype=object)

13 row/column operations (axis)

Row/column reductions:

# column means (for numeric cols)
peng.select_dtypes('number').mean()

bill_length            43.921930
bill_depth             17.151170
flipper_length_mm     200.915205
mass_g               4201.754386
year                 2008.029070
bill_ratio              2.605649
dtype: float64

# sum across row (example creating a fake matrix)
M = np.arange(12).reshape(3,4)
M

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

pd.DataFrame(M).sum(axis=0)

0    12
1    15
2    18
3    21
dtype: int64

pd.DataFrame(M).sum(axis=1)

0     6
1    22
2    38
dtype: int64

14 Chaining

Chaining with method chaining:

summary = (
    peng
    .dropna(subset=['mass_g','bill_length'])
    .assign(bill_ratio=lambda df: df['bill_length'] / df['bill_depth'])
    .groupby(['species'], as_index=False)
    .agg(mean_mass=('mass_g','mean'), mean_bill_ratio=('bill_ratio','mean'))
)
summary

	species	mean_mass	mean_bill_ratio
0	Adelie	3700.662252	2.119726
1	Chinstrap	3733.088235	2.653756
2	Gentoo	5076.016260	3.175592

15 Counting Values

In pandas, use value_counts() when you want to count how many times each category or value appears.

penguins.value_counts('species')

species
Adelie       152
Gentoo       124
Chinstrap     68
Name: count, dtype: int64

For multiple columns, pass a list. You can also turn it into a DataFrame and reset the index for tidy output:

penguins.value_counts(['species', 'sex']).reset_index()

	species	sex	count
0	Adelie	female	73
1	Adelie	male	73
2	Gentoo	male	61
3	Gentoo	female	58
4	Chinstrap	male	34
5	Chinstrap	female	34

16 Combining DataFrames

pd.concat() is a general function for joining DataFrames vertically or horizontally.
It keeps column names, data types, and indexes by default.
Data is aligned by column names, and missing values are filled with NaN when needed.

Row-binding (stacking rows)

# Example using penguins data
a = penguins.iloc[0:2]
b = penguins.iloc[2:4]

pd.concat([a, b], axis=0, ignore_index=True)

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	male	2007
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	female	2007
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	female	2007
3	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN	2007

Column-binding (stacking columns)

x = penguins[['bill_length_mm']]
y = penguins[['bill_depth_mm']]

pd.concat([x, y], axis=1)

	bill_length_mm	bill_depth_mm
0	39.1	18.7
1	39.5	17.4
2	40.3	18.0
3	NaN	NaN
4	36.7	19.3
...	...	...
339	55.8	19.8
340	43.5	18.1
341	49.6	18.2
342	50.8	19.0
343	50.2	18.7

344 rows × 2 columns

axis=0 → add rows
axis=1 → add columns

17 Useful tidy equivalents

tidyverse (R)	pandas (Python)
`select()`	`df[['col1','col2']]` or `df.loc[:, cols]`
`filter()`	`df[cond]` or `df.query()`
`arrange()`	`df.sort_values()`
`mutate()`	`df['new'] = ...` or `df.assign()`
`case_when()`	`np.select()` or `pd.cut()`
`rename()`	`df.rename(columns={})`
`pull()`	`df['col']` or `df['col'].tolist()`
`slice()`	`df.iloc[...]`
`group_by() %>% summarise()`	`df.groupby().agg()`

18 Exercises

Select only the columns: species, island, bill_length, bill_depth, mass_g.
Filter penguins of species "Adelie" and body mass > 3500 g. Show species, island, mass_g.
Arrange penguins by mass_g descending and show top 5.
Slice the middle 10 rows of the dataset (use .iloc).
Mutate: create bill_area = bill_length * bill_depth.
Case_when: create size_category with: mass < 3500 → 'small', 3500 ≤ mass < 4500 → 'medium', mass ≥ 4500 → 'large', NA → 'missing'.
Rename mass_g → body_mass_g and bill_length → bill_len_mm.
Pull the species column as a Python list.
Group by species and compute mean body_mass_g and standard deviation, plus sample size.

1 Introduction to DataFrames

Creating a DataFrame

2 Data Import

3 Penguins data

4 Accessing Elements

5 Selecting Columns

6 Filtering Rows

7 Arranging Rows

8 Slicing Rows

9 Pulling Columns

10 Mutating Columns

Recoding with np.select()

11 Renaming Columns

12 Grouping and Aggregation

13 row/column operations (axis)

14 Chaining

15 Counting Values

16 Combining DataFrames

Row-binding (stacking rows)

Column-binding (stacking columns)

17 Useful tidy equivalents

18 Exercises

Recoding with `np.select()`