Statsmodels Python Guide: Statistical Analysis for Machine Learning

You’ve built a machine learning model with 85% accuracy, and your manager asks the question you’ve been dreading: “But why does it work? Which features actually matter? Are these relationships statistically significant?” You stare at your random forest, which is basically a black box that just says “trust me bro,” and realize scikit-learn isn’t built for answering these questions.

I hit this wall on my third professional project. Built a great predictive model, got asked for statistical rigor, and had no idea how to provide it. That’s when I discovered statsmodels — the library that bridges the gap between traditional statistics and modern machine learning. It’s like having a statistician and a machine learning engineer in the same toolkit.

Let me show you how to actually understand your models statistically, because prediction accuracy isn’t the whole story.

Why Statsmodels Exists (And Why You Need It)

Scikit-learn is phenomenal for prediction. You feed it data, it gives you predictions, and you measure accuracy. But it’s deliberately light on statistical inference — p-values, confidence intervals, hypothesis tests, model diagnostics. That stuff isn’t scikit-learn’s job.

Statsmodels fills that gap. It’s built by statisticians who care deeply about understanding relationships in your data, not just predicting outcomes. When you need to explain which features truly matter or quantify uncertainty in your predictions, statsmodels is your answer.

What statsmodels gives you:

P-values and confidence intervals for coefficients
Detailed model diagnostics and assumption checks
Statistical hypothesis testing
Time series analysis and forecasting
Proper treatment of categorical variables
Publication-ready statistical summaries

I use both libraries constantly. Scikit-learn for building production models, statsmodels for understanding what’s actually happening in my data.

Loving the article? ☕
If you’d like to help me keep writing stories like this, consider supporting me on Buy Me a Coffee: https://buymeacoffee.com/samaustin. Even a small contribution means a lot!

Installation and Setup

Getting started is straightforward:

python

pip install statsmodels

You’ll also want the usual suspects:

python

pip install numpy pandas matplotlib scipy

Statsmodels plays nicely with pandas DataFrames, which is where it shines. Most examples you’ll find (including mine) assume you’re working with pandas.

python

import statsmodels.api as sm
import statsmodels.formula.api as smf
import pandas as pd
import numpy as np

The two main imports — sm for the API approach and smf for the formula approach—give you different ways to specify models. We'll explore both.

Linear Regression: The Foundation

Let’s start with the basics. Linear regression in statsmodels gives you way more information than scikit-learn’s version.

The API Approach

python

import statsmodels.api as sm
import pandas as pd
import numpy as np

# Generate sample data
np.random.seed(42)
X = pd.DataFrame({
    'feature1': np.random.randn(100),
    'feature2': np.random.randn(100),
    'feature3': np.random.randn(100)
})
y = 2 * X['feature1'] + 3 * X['feature2'] + np.random.randn(100)

# Add constant term (intercept)
X = sm.add_constant(X)

# Fit model
model = sm.OLS(y, X).fit()

# Get comprehensive summary
print(model.summary())

That summary() output is gold. It shows you coefficients, standard errors, t-statistics, p-values, confidence intervals, R-squared, and diagnostic tests—everything a statistician could want.

Understanding the Summary Output

The summary table answers critical questions:

Coefficient values: How much does each feature affect the outcome?

P-values: Is each relationship statistically significant? (p < 0.05 usually means yes)

Confidence intervals: What’s the range of plausible values for each coefficient?

R-squared: What percentage of variance does the model explain?

I’ve used this to defend model choices in meetings. “Feature X has a coefficient of 2.3 with p-value 0.001, meaning we’re 99.9% confident it’s not random chance.” That shuts down the skeptics quickly.

The Formula Approach (R-style)

If you’re coming from R or prefer formula notation, statsmodels has you covered:

python

import statsmodels.formula.api as smf

# Create DataFrame with all variables
df = pd.DataFrame({
    'y': y,
    'feature1': X['feature1'],
    'feature2': X['feature2'],
    'feature3': X['feature3']
})

# Fit using formula notation
model = smf.ols('y ~ feature1 + feature2 + feature3', data=df).fit()
print(model.summary())

The formula approach automatically adds the intercept and handles categorical variables beautifully. I prefer it for exploratory analysis because it’s more readable.

Logistic Regression: Statistical Classification

Scikit-learn does logistic regression, but statsmodels tells you why your model works.

Binary Classification with Statistics

python

import statsmodels.api as sm
from sklearn.datasets import load_breast_cancer
import pandas as pd

# Load data
cancer = load_breast_cancer()
X = pd.DataFrame(cancer.data, columns=cancer.feature_names)
y = cancer.target

# Select a few features for interpretability
X_subset = X[['mean radius', 'mean texture', 'mean smoothness']]
X_subset = sm.add_constant(X_subset)

# Fit logistic regression
logit_model = sm.Logit(y, X_subset).fit()

# Get detailed summary
print(logit_model.summary())

# Get odds ratios (more interpretable than raw coefficients)
print("\nOdds Ratios:")
print(np.exp(logit_model.params))

The odds ratios are incredibly useful. An odds ratio of 2.5 for a feature means a one-unit increase in that feature multiplies the odds of the positive class by 2.5. Way more interpretable than raw coefficients.

Predictions with Confidence Intervals

Unlike scikit-learn, statsmodels gives you prediction intervals:

python

# Make predictions with confidence intervals
predictions = logit_model.get_prediction(X_subset)

# Get predicted probabilities
pred_probs = predictions.predicted_mean

# Get confidence intervals for predictions
pred_intervals = predictions.conf_int()

print(f"Predicted probability: {pred_probs[0]:.3f}")
print(f"95% CI: [{pred_intervals[0, 0]:.3f}, {pred_intervals[0, 1]:.3f}]")

This is huge for business applications. You’re not just saying “70% chance of conversion” — you’re saying “70% chance with 95% confidence the true probability is between 65% and 75%.” That’s proper quantification of uncertainty.

Handling Categorical Variables Properly

Statsmodels makes categorical variables easy, especially with the formula API.

Automatic Dummy Encoding

python

import statsmodels.formula.api as smf
import pandas as pd

# Sample data with categorical variable
df = pd.DataFrame({
    'sales': [100, 150, 200, 180, 220, 250],
    'region': ['North', 'South', 'North', 'South', 'West', 'West'],
    'marketing_spend': [10, 15, 12, 18, 20, 22]
})

# Formula automatically creates dummy variables
model = smf.ols('sales ~ C(region) + marketing_spend', data=df).fit()
print(model.summary())

The C(region) notation tells statsmodels to treat region as categorical. It automatically creates dummy variables and handles the reference category. No manual one-hot encoding needed.

Interactions Between Variables

Testing if effects vary by category is trivial:

python

# Test if marketing spend effect differs by region
model = smf.ols('sales ~ C(region) * marketing_spend', data=df).fit()
print(model.summary())

The * creates interaction terms. Now you can see if marketing spend has different effects in different regions—critical for targeted strategies.

I used this to show that our marketing campaigns worked great in urban areas but flopped in rural ones. Changed our entire regional strategy based on these interaction effects.

Time Series Analysis: Beyond Basic Forecasting

Time series is where statsmodels really shines. It’s got tools that scikit-learn doesn’t even attempt.

ARIMA Models for Forecasting

ARIMA (AutoRegressive Integrated Moving Average) is the classic time series model:

python

import statsmodels.api as sm
import pandas as pd
import numpy as np

# Generate sample time series
np.random.seed(42)
dates = pd.date_range('2020-01-01', periods=200, freq='D')
ts = pd.Series(np.cumsum(np.random.randn(200)) + 100, index=dates)

# Fit ARIMA model (p=1, d=1, q=1)
model = sm.tsa.ARIMA(ts, order=(1, 1, 1)).fit()

# Get summary with diagnostics
print(model.summary())

# Make forecasts
forecast = model.forecast(steps=30)
print(forecast)

The order=(p, d, q) parameters control model complexity. Statsmodels provides tools to select these automatically, but that's a whole tutorial itself.

Seasonal Decomposition

Separating trend, seasonality, and residuals is straightforward:

python

from statsmodels.tsa.seasonal import seasonal_decompose

# Decompose time series
decomposition = seasonal_decompose(ts, model='additive', period=7)

# Plot components
import matplotlib.pyplot as plt
fig = decomposition.plot()
plt.show()

This reveals patterns you’d miss in raw data. I caught a weekly seasonality pattern that wasn’t obvious until decomposition made it visible. Changed our forecasting approach completely.

Stationarity Testing

Time series models assume stationarity. Statsmodels tests for it:

python

from statsmodels.tsa.stattools import adfuller

# Augmented Dickey-Fuller test
result = adfuller(ts)
print(f"ADF Statistic: {result[0]:.4f}")
print(f"p-value: {result[1]:.4f}")

if result[1] < 0.05:
    print("Series is stationary")
else:
    print("Series is non-stationary - consider differencing")

Non-stationary series need differencing before modeling. This test tells you if transformation is needed.

Generalized Linear Models: Beyond OLS

Real data often violates linear regression assumptions. GLMs (Generalized Linear Models) handle different response distributions.

Poisson Regression for Count Data

Predicting counts (number of sales, events, failures) with normal regression is wrong. Poisson regression is right:

python

import statsmodels.api as sm
import pandas as pd
import numpy as np

# Generate count data
np.random.seed(42)
X = pd.DataFrame({
    'feature1': np.random.randn(100),
    'feature2': np.random.randn(100)
})
X = sm.add_constant(X)

# Counts can't be negative
lambda_vals = np.exp(0.5 * X['feature1'] - 0.3 * X['feature2'])
y = np.random.poisson(lambda_vals)

# Fit Poisson regression
poisson_model = sm.GLM(y, X, family=sm.families.Poisson()).fit()
print(poisson_model.summary())

Used this for modeling customer support ticket volumes. Normal regression predicted negative tickets sometimes (impossible). Poisson regression respected the count nature and gave better predictions.

Negative Binomial for Overdispersed Counts

When count variance exceeds the mean (overdispersion), negative binomial is better:

python

import statsmodels.api as sm

# Fit negative binomial
nb_model = sm.GLM(y, X, family=sm.families.NegativeBinomial()).fit()
print(nb_model.summary())

The model summary includes tests for overdispersion, helping you choose between Poisson and negative binomial.

Model Diagnostics: Trust But Verify

Statsmodels makes assumption checking easy. Don’t skip this step — violated assumptions mean unreliable results.

Residual Analysis

python

import matplotlib.pyplot as plt

# After fitting a model
residuals = model.resid
fitted_values = model.fittedvalues

# Residual plot
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Residuals vs fitted
axes[0, 0].scatter(fitted_values, residuals)
axes[0, 0].axhline(y=0, color='r', linestyle='--')
axes[0, 0].set_xlabel('Fitted Values')
axes[0, 0].set_ylabel('Residuals')
axes[0, 0].set_title('Residuals vs Fitted')

# Q-Q plot for normality
sm.qqplot(residuals, line='s', ax=axes[0, 1])
axes[0, 1].set_title('Q-Q Plot')

# Histogram of residuals
axes[1, 0].hist(residuals, bins=30, edgecolor='black')
axes[1, 0].set_xlabel('Residuals')
axes[1, 0].set_title('Histogram of Residuals')

# Scale-location plot
axes[1, 1].scatter(fitted_values, np.sqrt(np.abs(residuals)))
axes[1, 1].set_xlabel('Fitted Values')
axes[1, 1].set_ylabel('√|Residuals|')
axes[1, 1].set_title('Scale-Location Plot')

plt.tight_layout()
plt.show()

What to look for:

Residuals randomly scattered (no patterns)
Q-Q plot points on the line (normality)
Histogram roughly bell-shaped
Constant variance across fitted values

Patterns in these plots mean model assumptions are violated. Address them before trusting results.

Statistical Tests for Assumptions

Statsmodels includes diagnostic tests:

python

from statsmodels.stats.diagnostic import het_breuschpagan, het_white

# Test for heteroscedasticity
lm, lm_pval, fval, fval_pval = het_breuschpagan(residuals, X)
print(f"Breusch-Pagan test p-value: {lm_pval:.4f}")

if lm_pval < 0.05:
    print("Heteroscedasticity detected - consider robust standard errors")

These tests quantify what visual inspection suggests. P-value < 0.05 means assumption is likely violated.

Robust Standard Errors: When Assumptions Fail

Sometimes assumptions fail but you can’t fix them. Robust standard errors adjust for violations:

python

import statsmodels.api as sm

# Fit model with robust standard errors
model = sm.OLS(y, X).fit(cov_type='HC3')
print(model.summary())

The cov_type='HC3' uses heteroscedasticity-consistent standard errors. Your coefficients stay the same, but p-values and confidence intervals adjust for violated assumptions.

This saved me once when I had heteroscedasticity I couldn’t fix. Robust standard errors let me report valid inference despite the violation.

Comparing Models: Statistical Tests

Should you include that extra feature? Statistical tests answer this formally.

F-Test for Nested Models

python

# Fit two models
model_full = sm.OLS(y, X_full).fit()
model_reduced = sm.OLS(y, X_reduced).fit()

# F-test comparing models
from statsmodels.stats.anova import anova_lm
print(anova_lm(model_reduced, model_full))

If p-value < 0.05, the additional features significantly improve the model. If not, simpler model is preferable (parsimony principle).

AIC and BIC for Model Selection

python

print(f"Model 1 AIC: {model1.aic:.2f}")
print(f"Model 2 AIC: {model2.aic:.2f}")

print(f"Model 1 BIC: {model1.bic:.2f}")
print(f"Model 2 BIC: {model2.bic:.2f}")

Lower AIC/BIC is better. AIC favors predictive accuracy, BIC penalizes complexity more heavily. Use BIC when you want simpler models.

Feature Selection with Statistical Testing

Statsmodels makes statistically-informed feature selection possible.

Backward Elimination

python

import statsmodels.api as sm
import pandas as pd

def backward_elimination(X, y, significance_level=0.05):
    X = sm.add_constant(X)
    features = X.columns.tolist()
    
    while len(features) > 1:
        model = sm.OLS(y, X[features]).fit()
        p_values = model.pvalues[1:]  # Exclude intercept
        max_p_value = p_values.max()
        
        if max_p_value > significance_level:
            excluded_feature = p_values.idxmax()
            features.remove(excluded_feature)
            print(f"Removed {excluded_feature} (p-value: {max_p_value:.4f})")
        else:
            break
    
    final_model = sm.OLS(y, X[features]).fit()
    return final_model, features

final_model, selected_features = backward_elimination(X, y)
print(f"\nSelected features: {selected_features}")
print(final_model.summary())

This removes features with p-values above your threshold iteratively. You’re left with only statistically significant predictors.

I prefer this to scikit-learn’s automated feature selection when interpretability matters. Every feature in the final model has statistical justification.

Weighted Least Squares: Handling Unequal Variance

When different observations have different reliability, WLS (Weighted Least Squares) accounts for this:

python

import statsmodels.api as sm
import numpy as np

# Observations with different variance
weights = 1 / variance_estimates  # Higher weight for lower variance

# Fit weighted model
wls_model = sm.WLS(y, X, weights=weights).fit()
print(wls_model.summary())

Used this for survey data where some responses were more reliable than others. Weighting by reliability improved model quality significantly.

Practical Tips and Best Practices

After using statsmodels for years, here’s what actually matters:

Always Check Model Diagnostics

Don’t skip residual plots and diagnostic tests. I’ve seen people report results from models with severe assumption violations. Those results are unreliable at best, misleading at worst.

Use Formula API for Exploration

The formula interface is more readable and handles categoricals automatically. Great for initial exploration. Switch to API approach when you need more control.

Report Confidence Intervals, Not Just P-Values

P-values tell you if an effect exists. Confidence intervals tell you the plausible magnitude. Both matter for decision-making.

Understand What Statistical Significance Means

P-value < 0.05 doesn’t mean the relationship is important, just that it’s unlikely due to chance. A statistically significant but tiny effect might not matter practically.

Combine Statsmodels and Scikit-Learn

Use statsmodels for understanding and inference, scikit-learn for prediction. They complement each other beautifully. I routinely build initial models in statsmodels, understand what matters, then deploy optimized versions in scikit-learn.

When to Use Statsmodels vs Scikit-Learn

Let me settle this directly.

Use statsmodels when:

You need to explain which features matter and why
Statistical significance testing is required
You’re doing time series analysis
Academic or regulatory requirements demand proper inference
Understanding uncertainty is critical

Use scikit-learn when:

Pure prediction accuracy is the goal
You’re deploying production models at scale
You need ensemble methods or neural networks
Speed and efficiency matter more than interpretability
You’re prototyping quickly

Use both when:

You need to understand AND predict
Results require both accuracy and statistical rigor
Stakeholders want both predictions and explanations

I’ve never regretted learning both. They solve different problems, and real-world projects usually need both.

The Statistical Mindset

Here’s what statsmodels taught me: prediction and understanding are different goals. Scikit-learn optimizes for prediction. Statsmodels optimizes for understanding.

Sometimes you need to know your model is 85% accurate. Other times you need to know Feature X has a statistically significant effect of 2.3 units with 95% confidence between 1.8 and 2.8. Different questions, different tools.

The best data scientists I know use both approaches fluently. They predict when prediction matters and test hypotheses when understanding matters. Statsmodels gives you the statistical rigor that pure machine learning sometimes lacks.

Don’t treat statistics as outdated. P-values and confidence intervals aren’t relics — they’re tools for quantifying uncertainty properly. In a field where everyone claims their model “works,” statistics lets you say how well it works and how certain you are about that claim.

Now go build models you can actually explain statistically :)

Latest Post

Reinforcement Learning for Credit Scoring: Applications in Fintech