11 Pandas DataFrame

The pandas package provides the DataFrame data type for working with structured data — that is, data organized in rows and columns.

While NumPy arrays and Python dictionaries can also represent structured data, pandas DataFrames are much more flexible and powerful for data manipulation, analysis, and visualization.

If you are familiar with R’s tidyverse, pandas is Python’s equivalent to the dplyr + tibble ecosystem.

This lecture presents pandas operations using your well-known Palmer Penguins dataset.

1 Introduction to DataFrames

import pandas as pd
import numpy as np

A pandas DataFrame is like a tibble — it stores data in a table with rows and columns, allowing both numeric and text data.

Creating a DataFrame

employee_dict = {
    'eid': [214, 174, 126, 227, 151, 175, 193, 146],
    'name': ['Drew', 'Faye', 'Beth', 'Chad', 'Hana', 'Gary', 'Alex', 'Emma'],
    'age': [25, 19, 42, 31, 25, 28, 31, 25],
    'rate': [11.50, 12.30, 14.60, 12.50, None, 15.60, 13.50, 14.30],
    'hours': [38, 32, 40, 29, 40, 36, 24, 20]
}

employee_df = pd.DataFrame(employee_dict)
employee_df.head()
eid name age rate hours
0 214 Drew 25 11.5 38
1 174 Faye 19 12.3 32
2 126 Beth 42 14.6 40
3 227 Chad 31 12.5 29
4 151 Hana 25 NaN 40

2 Data Import

# Importing CSV files:
df = pd.read_csv("data.csv") 

# Common other CSV options:
df = pd.read_csv("file.csv", sep=",")
df = pd.read_csv("file.csv", header=0)
df = pd.read_csv("file.csv", na_values=["", "NA"])
df = pd.read_csv("file.csv", encoding="utf-8")

# Importing TSV or other delimiters:
df = pd.read_csv("data.tsv", sep="\t")

# Importing from URLs:
df = pd.read_csv("https://example.com/data.csv")
# Importing Excel files
# Requires `openpyxl` or `xlrd` package.
df = pd.read_excel("data.xlsx", sheet_name="Sheet1")

# Import multiple sheets
dfs = pd.read_excel("data.xlsx", sheet_name=None) 
# Stata (.dta)
df = pd.read_stata("data.dta") 
# SPSS (.sav), SAS (.sas7bdat)
import pyreadstat
df, meta = pyreadstat.read_sav("data.sav")
df, meta = pyreadstat.read_sas7bdat("data.sas7bdat")
# R (.RData)
import pyreadr
result = pyreadr.read_r("file.RData")

3 Penguins data

penguins = pd.read_csv("data/penguins.csv") 
  • Quick look at the data:
penguins.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
 7   year               344 non-null    int64  
dtypes: float64(4), int64(1), object(3)
memory usage: 21.6+ KB
penguins.head()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 male 2007
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 female 2007
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 female 2007
3 Adelie Torgersen NaN NaN NaN NaN NaN 2007
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 female 2007
  • Using describe() we can get a summary of numeric variables:
penguins.describe()
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g year
count 342.000000 342.000000 342.000000 342.000000 344.000000
mean 43.921930 17.151170 200.915205 4201.754386 2008.029070
std 5.459584 1.974793 14.061714 801.954536 0.818356
min 32.100000 13.100000 172.000000 2700.000000 2007.000000
25% 39.225000 15.600000 190.000000 3550.000000 2007.000000
50% 44.450000 17.300000 197.000000 4050.000000 2008.000000
75% 48.500000 18.700000 213.000000 4750.000000 2009.000000
max 59.600000 21.500000 231.000000 6300.000000 2009.000000

4 Accessing Elements

  • Each DataFrame has row labels (index) and column labels (names).
  • Use iloc[] to access by integer position and loc[] to access by label.
penguins.iloc[2, 3]        # 3rd row, 4th column (position-based)
penguins.iloc[4, 1]        # 5th row, 2nd column (position-based)
penguins.loc[4, 'island']  # label-based
'Torgersen'
Note

Here, the row label (4) happens to match its position, but this is not always true. So use loc carefully.

  • Both loc and iloc can take two arguments:
df.loc[row_selector , column_selector]
df.iloc[row_selector , column_selector]

The column selector is optional. If you don’t specify it, pandas assumes “all columns”. For example, df.loc[1] is equivalent to df.loc[1, :] — both select all columns from the row with label 1.

5 Selecting Columns

  • In pandas, use column selection (df[['col']] or .loc) when you want to keep only certain columns from a DataFrame.
# select single column (Series)
species = penguins['species']
type(species), species.head()
(pandas.core.series.Series,
 0    Adelie
 1    Adelie
 2    Adelie
 3    Adelie
 4    Adelie
 Name: species, dtype: object)
# select multiple columns (DataFrame)
peng_sel = penguins[['species', 'island', 'sex', 'bill_length_mm', 'body_mass_g']]
peng_sel.head()
species island sex bill_length_mm body_mass_g
0 Adelie Torgersen male 39.1 3750.0
1 Adelie Torgersen female 39.5 3800.0
2 Adelie Torgersen female 40.3 3250.0
3 Adelie Torgersen NaN NaN NaN
4 Adelie Torgersen female 36.7 3450.0
  • You can also use .loc for column selection:
penguins.loc[:, ['species', 'sex']].head()
species sex
0 Adelie male
1 Adelie female
2 Adelie female
3 Adelie NaN
4 Adelie female
  • Select a range of columns (inclusive of both endpoints):
penguins.loc[1:3, 'bill_length_mm':'body_mass_g'].head()
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
1 39.5 17.4 186.0 3800.0
2 40.3 18.0 195.0 3250.0
3 NaN NaN NaN NaN

6 Filtering Rows

  • In pandas, use row filtering (df[condition] or .query() ) when you want to keep only the rows that satisfy a condition.
# boolean mask
adults = penguins[penguins['sex'] == 'male']
adults.shape
(168, 8)
# multiple conditions: males with body mass > 4000 g
sel = (penguins['sex'] == 'male') & (penguins['body_mass_g'] > 3500)
penguins.loc[sel, ['species', 'sex', 'body_mass_g']].head()
species sex body_mass_g
0 Adelie male 3750.0
5 Adelie male 3650.0
7 Adelie male 4675.0
13 Adelie male 3800.0
14 Adelie male 4400.0
# using query() (tidy-like syntax)
penguins.query("sex == 'female' and body_mass_g > 3500").head()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 female 2007
6 Adelie Torgersen 38.9 17.8 181.0 3625.0 female 2007
15 Adelie Torgersen 36.6 17.8 185.0 3700.0 female 2007
22 Adelie Biscoe 35.9 19.2 189.0 3800.0 female 2007
25 Adelie Biscoe 35.3 18.9 187.0 3800.0 female 2007

7 Arranging Rows

  • In pandas, use sort_values() when you want to sort the rows of a DataFrame by one or more columns.
# sort by body_mass_g descending
penguins.sort_values(by='body_mass_g', ascending=False).head()

# sort by species then bill_length_mm (asc, desc)
penguins.sort_values(by=['species', 'bill_length_mm'], ascending=[True, False]).head()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
19 Adelie Torgersen 46.0 21.5 194.0 4200.0 male 2007
73 Adelie Torgersen 45.8 18.9 197.0 4150.0 male 2008
111 Adelie Biscoe 45.6 20.3 191.0 4600.0 male 2009
43 Adelie Dream 44.1 19.7 196.0 4400.0 male 2007
129 Adelie Torgersen 44.1 18.0 210.0 4000.0 male 2009

8 Slicing Rows

  • In pandas, use iloc row slicing when you want to select rows by their numeric positions (such as first n rows or a specific range).

  • Equivalent of R’s slice() / slice_head() / slice_sample():

# first 6 rows (like slice_head(n=6))
penguins.iloc[:6]

# rows 10–15 (2-based indexing in R => here 10:16)
penguins.iloc[10:16]

# random sample of 5 rows (slice_sample)
penguins.sample(n=5, random_state=42)
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
194 Gentoo Biscoe 45.3 13.7 210.0 4300.0 female 2008
157 Gentoo Biscoe 46.5 13.5 210.0 4550.0 female 2007
225 Gentoo Biscoe 46.5 14.8 217.0 5200.0 female 2008
208 Gentoo Biscoe 43.8 13.9 208.0 4300.0 female 2008
318 Chinstrap Dream 50.9 19.1 196.0 3550.0 male 2008

9 Pulling Columns

  • In pandas, use df['col'] or df['col'].tolist() when you want to extract a single column as a Series or a Python list.
ages = penguins['body_mass_g']          # pandas Series
names_list = penguins['species'].tolist()  # Python list
type(ages), type(names_list), names_list[:6]
(pandas.core.series.Series,
 list,
 ['Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie'])

10 Mutating Columns

  • In pandas, use assign() or direct column assignment when you want to create new variables or modify existing ones.
# create a new column: bill_ratio = bill_length_mm / bill_depth_mm
peng = penguins.copy()
peng = peng.assign(bill_ratio = peng['bill_length_mm'] / peng['bill_depth_mm'])
peng[['bill_length_mm','bill_depth_mm','bill_ratio']].head()
bill_length_mm bill_depth_mm bill_ratio
0 39.1 18.7 2.090909
1 39.5 17.4 2.270115
2 40.3 18.0 2.238889
3 NaN NaN NaN
4 36.7 19.3 1.901554

Recoding with np.select()

  • Use np.select() or pd.cut() for binning. np.select() mimics dplyr::case_when().
conds = [
    peng['body_mass_g'].isna(),
    peng['body_mass_g'] < 3500,
    (peng['body_mass_g'] >= 3500) & (peng['body_mass_g'] < 4500),
    peng['body_mass_g'] >= 4500
]
choices = ['missing', 'light', 'medium', 'heavy']

peng['mass_category'] = np.select(conds, choices, default='unknown')
peng[['body_mass_g','mass_category']].head(10)
body_mass_g mass_category
0 3750.0 medium
1 3800.0 medium
2 3250.0 light
3 NaN missing
4 3450.0 light
5 3650.0 medium
6 3625.0 medium
7 4675.0 heavy
8 3475.0 light
9 4250.0 medium
  • Alternative with pd.cut():
bins = [0, 3500, 4500, np.inf]
labels = ['light', 'medium', 'heavy']
peng['mass_bin'] = pd.cut(peng['body_mass_g'], bins=bins, labels=labels)
peng[['body_mass_g','mass_bin']].head(10)
body_mass_g mass_bin
0 3750.0 medium
1 3800.0 medium
2 3250.0 light
3 NaN NaN
4 3450.0 light
5 3650.0 medium
6 3625.0 medium
7 4675.0 heavy
8 3475.0 light
9 4250.0 medium

11 Renaming Columns

  • In pandas, use rename(columns={...}) when you want to change column names.
peng = peng.rename(columns={'bill_length_mm': 'bill_length',
                            'bill_depth_mm': 'bill_depth',
                            'body_mass_g': 'mass_g'})
peng.columns
Index(['species', 'island', 'bill_length', 'bill_depth', 'flipper_length_mm',
       'mass_g', 'sex', 'year', 'bill_ratio', 'mass_category', 'mass_bin'],
      dtype='object')

12 Grouping and Aggregation

  • In pandas, use groupby().agg() when you want to calculate summary statistics for groups of rows.
  • Equivalent to group_by() + summarise() of tidyverse.
# mean bill length and mass by species
peng.groupby('species', as_index=False).agg(
    mean_bill_length=('bill_length', 'mean'),
    mean_mass=('mass_g', 'mean'),
    n=('species', 'count')
)
species mean_bill_length mean_mass n
0 Adelie 38.791391 3700.662252 152
1 Chinstrap 48.833824 3733.088235 68
2 Gentoo 47.504878 5076.016260 124
  • Multiple aggregations on same column:
peng.groupby(['species', 'island'], as_index=False).agg(
    n=('species', 'count'),
    mean_mass=('mass_g', 'mean'),
    sd_mass=('mass_g', 'std'),
    median_bill=('bill_length', 'median')
).sort_values(by='n', ascending=False)
species island n mean_mass sd_mass median_bill
4 Gentoo Biscoe 124 5076.016260 504.116237 47.30
3 Chinstrap Dream 68 3733.088235 384.335081 49.55
1 Adelie Dream 56 3688.392857 455.146437 38.55
2 Adelie Torgersen 52 3706.372549 445.107940 38.90
0 Adelie Biscoe 44 3709.659091 487.733722 38.70
  • Use .filter()-like behavior to keep groups with enough observations:
grouped = peng.groupby('species').filter(lambda df: df['mass_g'].notna().sum() >= 30)
grouped['species'].unique()
array(['Adelie', 'Gentoo', 'Chinstrap'], dtype=object)

13 row/column operations (axis)

  • Row/column reductions:
# column means (for numeric cols)
peng.select_dtypes('number').mean()
bill_length            43.921930
bill_depth             17.151170
flipper_length_mm     200.915205
mass_g               4201.754386
year                 2008.029070
bill_ratio              2.605649
dtype: float64
# sum across row (example creating a fake matrix)
M = np.arange(12).reshape(3,4)
M
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
pd.DataFrame(M).sum(axis=0)
0    12
1    15
2    18
3    21
dtype: int64
pd.DataFrame(M).sum(axis=1)
0     6
1    22
2    38
dtype: int64

14 Chaining

  • Chaining with method chaining:
summary = (
    peng
    .dropna(subset=['mass_g','bill_length'])
    .assign(bill_ratio=lambda df: df['bill_length'] / df['bill_depth'])
    .groupby(['species'], as_index=False)
    .agg(mean_mass=('mass_g','mean'), mean_bill_ratio=('bill_ratio','mean'))
)
summary
species mean_mass mean_bill_ratio
0 Adelie 3700.662252 2.119726
1 Chinstrap 3733.088235 2.653756
2 Gentoo 5076.016260 3.175592

15 Counting Values

  • In pandas, use value_counts() when you want to count how many times each category or value appears.
penguins.value_counts('species')
species
Adelie       152
Gentoo       124
Chinstrap     68
Name: count, dtype: int64
  • For multiple columns, pass a list. You can also turn it into a DataFrame and reset the index for tidy output:
penguins.value_counts(['species', 'sex']).reset_index()
species sex count
0 Adelie female 73
1 Adelie male 73
2 Gentoo male 61
3 Gentoo female 58
4 Chinstrap male 34
5 Chinstrap female 34

16 Combining DataFrames

  • pd.concat() is a general function for joining DataFrames vertically or horizontally.
  • It keeps column names, data types, and indexes by default.
  • Data is aligned by column names, and missing values are filled with NaN when needed.

Row-binding (stacking rows)

# Example using penguins data
a = penguins.iloc[0:2]
b = penguins.iloc[2:4]

pd.concat([a, b], axis=0, ignore_index=True)
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 male 2007
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 female 2007
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 female 2007
3 Adelie Torgersen NaN NaN NaN NaN NaN 2007

Column-binding (stacking columns)

x = penguins[['bill_length_mm']]
y = penguins[['bill_depth_mm']]

pd.concat([x, y], axis=1)
bill_length_mm bill_depth_mm
0 39.1 18.7
1 39.5 17.4
2 40.3 18.0
3 NaN NaN
4 36.7 19.3
... ... ...
339 55.8 19.8
340 43.5 18.1
341 49.6 18.2
342 50.8 19.0
343 50.2 18.7

344 rows × 2 columns

  • axis=0 → add rows
  • axis=1 → add columns

17 Useful tidy equivalents

tidyverse (R) pandas (Python)
select() df[['col1','col2']] or df.loc[:, cols]
filter() df[cond] or df.query()
arrange() df.sort_values()
mutate() df['new'] = ... or df.assign()
case_when() np.select() or pd.cut()
rename() df.rename(columns={})
pull() df['col'] or df['col'].tolist()
slice() df.iloc[...]
group_by() %>% summarise() df.groupby().agg()

18 Exercises

  1. Select only the columns: species, island, bill_length, bill_depth, mass_g.
  2. Filter penguins of species "Adelie" and body mass > 3500 g. Show species, island, mass_g.
  3. Arrange penguins by mass_g descending and show top 5.
  4. Slice the middle 10 rows of the dataset (use .iloc).
  5. Mutate: create bill_area = bill_length * bill_depth.
  6. Case_when: create size_category with: mass < 3500 → 'small', 3500 ≤ mass < 4500 → 'medium', mass ≥ 4500 → 'large', NA → 'missing'.
  7. Rename mass_gbody_mass_g and bill_lengthbill_len_mm.
  8. Pull the species column as a Python list.
  9. Group by species and compute mean body_mass_g and standard deviation, plus sample size.