17 SciPy + Statsmodel

1 Introduction

SciPy and Statsmodels are complementary Python libraries widely used for statistical analysis, but they serve different purposes within the analytical workflow.

  • SciPy (scipy.stats) provides core statistical functionality, including probability distributions, summary statistics, and classical hypothesis tests such as t-tests, chi-square tests, and ANOVA. It is typically used for standalone inferential procedures.
  • Statsmodels is designed for statistical modeling and inference. It offers a unified framework for estimating regression models, conducting hypothesis tests within models, and performing diagnostic checks, with a formula interface similar to R.
  • Practical distinction: Use SciPy when the goal is to perform individual hypothesis tests. Use Statsmodels when the objective is to model relationships among variables and obtain detailed inferential output.

Load data:

import numpy as np
import pandas as pd
penguins = pd.read_csv("data/penguins.csv").dropna()
penguins.head()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 male 2007
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 female 2007
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 female 2007
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 female 2007
5 Adelie Torgersen 39.3 20.6 190.0 3650.0 male 2007

2 One Sample \(t\) Test

Importing scipy.stats:

from scipy import stats

Research question: Is the mean body mass different from 4000 g?

stats.ttest_1samp(penguins["body_mass_g"], popmean=4000)
TtestResult(statistic=np.float64(4.6924522789847325), pvalue=np.float64(3.952443195315995e-06), df=np.int64(332))

Decision rule: Reject \(H_0\) if \(p < 0.05\).


3 Independent Two Sample \(t\) Test

Research question: Do male and female penguins differ in mean body mass?

male = penguins.loc[penguins.sex == "male", "body_mass_g"]
female = penguins.loc[penguins.sex == "female", "body_mass_g"]

stats.ttest_ind(male, female, equal_var=False)
TtestResult(statistic=np.float64(8.554537231165762), pvalue=np.float64(4.793891255051492e-16), df=np.float64(323.89588102864843))

4 Chi Square Test of Independence

Research question: Is species associated with sex?

table = pd.crosstab(penguins["species"], penguins["sex"])
stats.chi2_contingency(table)
Chi2ContingencyResult(statistic=np.float64(0.04860717014078318), pvalue=np.float64(0.9759893689765846), dof=2, expected_freq=array([[72.34234234, 73.65765766],
       [33.69369369, 34.30630631],
       [58.96396396, 60.03603604]]))

5 One Way ANOVA

Research question: Does mean body mass differ across species?

adelie = penguins.loc[penguins.species == "Adelie", "body_mass_g"]
chinstrap = penguins.loc[penguins.species == "Chinstrap", "body_mass_g"]
gentoo = penguins.loc[penguins.species == "Gentoo", "body_mass_g"]

stats.f_oneway(adelie, chinstrap, gentoo)
F_onewayResult(statistic=np.float64(341.8948949481461), pvalue=np.float64(3.74450512630046e-81))

6 Simple Linear Regression

Statsmodels uses a formula interface similar to R, which simplifies handling of categorical variables and interactions.

Model: Body mass predicted by flipper length

import statsmodels.formula.api as smf

model1 = smf.ols(
  "body_mass_g ~ flipper_length_mm",
  data=penguins
).fit()

model1.summary()
OLS Regression Results
Dep. Variable: body_mass_g R-squared: 0.762
Model: OLS Adj. R-squared: 0.761
Method: Least Squares F-statistic: 1060.
Date: Sat, 21 Feb 2026 Prob (F-statistic): 3.13e-105
Time: 00:24:52 Log-Likelihood: -2461.1
No. Observations: 333 AIC: 4926.
Df Residuals: 331 BIC: 4934.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -5872.0927 310.285 -18.925 0.000 -6482.472 -5261.713
flipper_length_mm 50.1533 1.540 32.562 0.000 47.123 53.183
Omnibus: 5.922 Durbin-Watson: 2.116
Prob(Omnibus): 0.052 Jarque-Bera (JB): 5.876
Skew: 0.325 Prob(JB): 0.0530
Kurtosis: 3.025 Cond. No. 2.90e+03


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.9e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

7 Multiple Linear Regression

model2 = smf.ols(
  "body_mass_g ~ flipper_length_mm + bill_length_mm + C(sex)",
  data=penguins
).fit()

model2.summary()
OLS Regression Results
Dep. Variable: body_mass_g R-squared: 0.807
Model: OLS Adj. R-squared: 0.805
Method: Least Squares F-statistic: 457.1
Date: Sat, 21 Feb 2026 Prob (F-statistic): 5.89e-117
Time: 00:24:52 Log-Likelihood: -2426.7
No. Observations: 333 AIC: 4861.
Df Residuals: 329 BIC: 4877.
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -5433.5336 286.558 -18.961 0.000 -5997.251 -4869.816
C(sex)[T.male] 358.6309 41.572 8.627 0.000 276.851 440.411
flipper_length_mm 48.2093 1.841 26.179 0.000 44.587 51.832
bill_length_mm -5.2012 4.860 -1.070 0.285 -14.762 4.360
Omnibus: 0.549 Durbin-Watson: 1.751
Prob(Omnibus): 0.760 Jarque-Bera (JB): 0.664
Skew: 0.081 Prob(JB): 0.718
Kurtosis: 2.852 Cond. No. 3.03e+03


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.03e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

8 Residual Diagnostics

import matplotlib.pyplot as plt

resid = model2.resid
fitted = model2.fittedvalues

plt.scatter(fitted, resid)
plt.axhline(0)
plt.xlabel("Fitted values")
plt.ylabel("Residuals")
plt.show()


9 Normality of Residuals

import statsmodels.api as sm

sm.qqplot(resid, line="45")
plt.show()