The pandas package provides the DataFrame data type for working with structured data — that is, data organized in rows and columns.
While NumPy arrays and Python dictionaries can also represent structured data, pandas DataFrames are much more flexible and powerful for data manipulation, analysis, and visualization.
If you are familiar with R’s tidyverse, pandas is Python’s equivalent to the dplyr + tibble ecosystem.
This lecture presents pandas operations using your well-known Palmer Penguins dataset.
1 Introduction to DataFrames
import pandas as pdimport numpy as np
A pandas DataFrame is like a tibble — it stores data in a table with rows and columns, allowing both numeric and text data.
The column selector is optional. If you don’t specify it, pandas assumes “all columns”. For example, df.loc[1] is equivalent to df.loc[1, :] — both select all columns from the row with label 1.
5 Selecting Columns
In pandas, use column selection (df[['col']] or .loc) when you want to keep only certain columns from a DataFrame.
# select single column (Series)species = penguins['species']type(species), species.head()
# multiple conditions: males with body mass > 4000 gsel = (penguins['sex'] =='male') & (penguins['body_mass_g'] >3500)penguins.loc[sel, ['species', 'sex', 'body_mass_g']].head()
species
sex
body_mass_g
0
Adelie
male
3750.0
5
Adelie
male
3650.0
7
Adelie
male
4675.0
13
Adelie
male
3800.0
14
Adelie
male
4400.0
# using query() (tidy-like syntax)penguins.query("sex == 'female' and body_mass_g > 3500").head()
species
island
bill_length_mm
bill_depth_mm
flipper_length_mm
body_mass_g
sex
year
1
Adelie
Torgersen
39.5
17.4
186.0
3800.0
female
2007
6
Adelie
Torgersen
38.9
17.8
181.0
3625.0
female
2007
15
Adelie
Torgersen
36.6
17.8
185.0
3700.0
female
2007
22
Adelie
Biscoe
35.9
19.2
189.0
3800.0
female
2007
25
Adelie
Biscoe
35.3
18.9
187.0
3800.0
female
2007
7 Arranging Rows
In pandas, use sort_values() when you want to sort the rows of a DataFrame by one or more columns.
# sort by body_mass_g descendingpenguins.sort_values(by='body_mass_g', ascending=False).head()# sort by species then bill_length_mm (asc, desc)penguins.sort_values(by=['species', 'bill_length_mm'], ascending=[True, False]).head()
species
island
bill_length_mm
bill_depth_mm
flipper_length_mm
body_mass_g
sex
year
19
Adelie
Torgersen
46.0
21.5
194.0
4200.0
male
2007
73
Adelie
Torgersen
45.8
18.9
197.0
4150.0
male
2008
111
Adelie
Biscoe
45.6
20.3
191.0
4600.0
male
2009
43
Adelie
Dream
44.1
19.7
196.0
4400.0
male
2007
129
Adelie
Torgersen
44.1
18.0
210.0
4000.0
male
2009
8 Slicing Rows
In pandas, use iloc row slicing when you want to select rows by their numeric positions (such as first n rows or a specific range).
Equivalent of R’s slice() / slice_head() / slice_sample():
# first 6 rows (like slice_head(n=6))penguins.iloc[:6]# rows 10–15 (2-based indexing in R => here 10:16)penguins.iloc[10:16]# random sample of 5 rows (slice_sample)penguins.sample(n=5, random_state=42)
species
island
bill_length_mm
bill_depth_mm
flipper_length_mm
body_mass_g
sex
year
194
Gentoo
Biscoe
45.3
13.7
210.0
4300.0
female
2008
157
Gentoo
Biscoe
46.5
13.5
210.0
4550.0
female
2007
225
Gentoo
Biscoe
46.5
14.8
217.0
5200.0
female
2008
208
Gentoo
Biscoe
43.8
13.9
208.0
4300.0
female
2008
318
Chinstrap
Dream
50.9
19.1
196.0
3550.0
male
2008
9 Pulling Columns
In pandas, use df['col'] or df['col'].tolist() when you want to extract a single column as a Series or a Python list.
In pandas, use groupby().agg() when you want to calculate summary statistics for groups of rows.
Equivalent to group_by() + summarise() of tidyverse.
# mean bill length and mass by speciespeng.groupby('species', as_index=False).agg( mean_bill_length=('bill_length', 'mean'), mean_mass=('mass_g', 'mean'), n=('species', 'count'))