Installation and Setup
Getting started is straightforward:
python
pip install statsmodels
You’ll also want the usual suspects:
python
pip install numpy pandas matplotlib scipy
Statsmodels plays nicely with pandas DataFrames, which is where it shines. Most examples you’ll find (including mine) assume you’re working with pandas.
python
import statsmodels.api as sm
import statsmodels.formula.api as smf
import pandas as pd
import numpy as np
The two main imports — sm for the API approach and smf for the formula approach—give you different ways to specify models. We'll explore both.
Linear Regression: The Foundation
Let’s start with the basics. Linear regression in statsmodels gives you way more information than scikit-learn’s version.
The API Approach
python
import statsmodels.api as sm
import pandas as pd
import numpy as np
# Generate sample data
np.random.seed(42)
X = pd.DataFrame({
'feature1': np.random.randn(100),
'feature2': np.random.randn(100),
'feature3': np.random.randn(100)
})
y = 2 * X['feature1'] + 3 * X['feature2'] + np.random.randn(100)
# Add constant term (intercept)
X = sm.add_constant(X)
# Fit model
model = sm.OLS(y, X).fit()
# Get comprehensive summary
print(model.summary())
That summary() output is gold. It shows you coefficients, standard errors, t-statistics, p-values, confidence intervals, R-squared, and diagnostic tests—everything a statistician could want.
Understanding the Summary Output
The summary table answers critical questions:
Coefficient values: How much does each feature affect the outcome?
P-values: Is each relationship statistically significant? (p < 0.05 usually means yes)
Confidence intervals: What’s the range of plausible values for each coefficient?
R-squared: What percentage of variance does the model explain?
I’ve used this to defend model choices in meetings. “Feature X has a coefficient of 2.3 with p-value 0.001, meaning we’re 99.9% confident it’s not random chance.” That shuts down the skeptics quickly.
The Formula Approach (R-style)
If you’re coming from R or prefer formula notation, statsmodels has you covered:
python
import statsmodels.formula.api as smf
# Create DataFrame with all variables
df = pd.DataFrame({
'y': y,
'feature1': X['feature1'],
'feature2': X['feature2'],
'feature3': X['feature3']
})
# Fit using formula notation
model = smf.ols('y ~ feature1 + feature2 + feature3', data=df).fit()
print(model.summary())
The formula approach automatically adds the intercept and handles categorical variables beautifully. I prefer it for exploratory analysis because it’s more readable.
Logistic Regression: Statistical Classification
Scikit-learn does logistic regression, but statsmodels tells you why your model works.
Binary Classification with Statistics
python
import statsmodels.api as sm
from sklearn.datasets import load_breast_cancer
import pandas as pd
# Load data
cancer = load_breast_cancer()
X = pd.DataFrame(cancer.data, columns=cancer.feature_names)
y = cancer.target
# Select a few features for interpretability
X_subset = X[['mean radius', 'mean texture', 'mean smoothness']]
X_subset = sm.add_constant(X_subset)
# Fit logistic regression
logit_model = sm.Logit(y, X_subset).fit()
# Get detailed summary
print(logit_model.summary())
# Get odds ratios (more interpretable than raw coefficients)
print("\nOdds Ratios:")
print(np.exp(logit_model.params))
The odds ratios are incredibly useful. An odds ratio of 2.5 for a feature means a one-unit increase in that feature multiplies the odds of the positive class by 2.5. Way more interpretable than raw coefficients.
Predictions with Confidence Intervals
Unlike scikit-learn, statsmodels gives you prediction intervals:
python
predictions = logit_model.get_prediction(X_subset)
# Get predicted probabilities
pred_probs = predictions.predicted_mean
# Get confidence intervals for predictions
pred_intervals = predictions.conf_int()
print(f"Predicted probability: {pred_probs[0]:.3f}")
print(f"95% CI: [{pred_intervals[0, 0]:.3f}, {pred_intervals[0, 1]:.3f}]")This is huge for business applications. You’re not just saying “70% chance of conversion” — you’re saying “70% chance with 95% confidence the true probability is between 65% and 75%.” That’s proper quantification of uncertainty.
Handling Categorical Variables Properly
Statsmodels makes categorical variables easy, especially with the formula API.
Automatic Dummy Encoding
python
import statsmodels.formula.api as smf
import pandas as pd
# Sample data with categorical variable
df = pd.DataFrame({
'sales': [100, 150, 200, 180, 220, 250],
'region': ['North', 'South', 'North', 'South', 'West', 'West'],
'marketing_spend': [10, 15, 12, 18, 20, 22]
})
# Formula automatically creates dummy variables
model = smf.ols('sales ~ C(region) + marketing_spend', data=df).fit()
print(model.summary())
The C(region) notation tells statsmodels to treat region as categorical. It automatically creates dummy variables and handles the reference category. No manual one-hot encoding needed.
Interactions Between Variables
Testing if effects vary by category is trivial:
python
model = smf.ols('sales ~ C(region) * marketing_spend', data=df).fit()
print(model.summary())
The * creates interaction terms. Now you can see if marketing spend has different effects in different regions—critical for targeted strategies.
I used this to show that our marketing campaigns worked great in urban areas but flopped in rural ones. Changed our entire regional strategy based on these interaction effects.
Time Series Analysis: Beyond Basic Forecasting
Time series is where statsmodels really shines. It’s got tools that scikit-learn doesn’t even attempt.
ARIMA Models for Forecasting
ARIMA (AutoRegressive Integrated Moving Average) is the classic time series model:
python
import statsmodels.api as sm
import pandas as pd
import numpy as np
# Generate sample time series
np.random.seed(42)
dates = pd.date_range('2020-01-01', periods=200, freq='D')
ts = pd.Series(np.cumsum(np.random.randn(200)) + 100, index=dates)
# Fit ARIMA model (p=1, d=1, q=1)
model = sm.tsa.ARIMA(ts, order=(1, 1, 1)).fit()
# Get summary with diagnostics
print(model.summary())
# Make forecasts
forecast = model.forecast(steps=30)
print(forecast)
The order=(p, d, q) parameters control model complexity. Statsmodels provides tools to select these automatically, but that's a whole tutorial itself.
Seasonal Decomposition
Separating trend, seasonality, and residuals is straightforward:
python
from statsmodels.tsa.seasonal import seasonal_decompose
# Decompose time series
decomposition = seasonal_decompose(ts, model='additive', period=7)
# Plot components
import matplotlib.pyplot as plt
fig = decomposition.plot()
plt.show()
This reveals patterns you’d miss in raw data. I caught a weekly seasonality pattern that wasn’t obvious until decomposition made it visible. Changed our forecasting approach completely.
Stationarity Testing
Time series models assume stationarity. Statsmodels tests for it:
python
from statsmodels.tsa.stattools import adfuller
# Augmented Dickey-Fuller test
result = adfuller(ts)
print(f"ADF Statistic: {result[0]:.4f}")
print(f"p-value: {result[1]:.4f}")
if result[1] < 0.05:
print("Series is stationary")
else:
print("Series is non-stationary - consider differencing")
Non-stationary series need differencing before modeling. This test tells you if transformation is needed.
Generalized Linear Models: Beyond OLS
Real data often violates linear regression assumptions. GLMs (Generalized Linear Models) handle different response distributions.
Poisson Regression for Count Data
Predicting counts (number of sales, events, failures) with normal regression is wrong. Poisson regression is right:
python
import statsmodels.api as sm
import pandas as pd
import numpy as np
# Generate count data
np.random.seed(42)
X = pd.DataFrame({
'feature1': np.random.randn(100),
'feature2': np.random.randn(100)
})
X = sm.add_constant(X)
# Counts can't be negative
lambda_vals = np.exp(0.5 * X['feature1'] - 0.3 * X['feature2'])
y = np.random.poisson(lambda_vals)
# Fit Poisson regression
poisson_model = sm.GLM(y, X, family=sm.families.Poisson()).fit()
print(poisson_model.summary())
Used this for modeling customer support ticket volumes. Normal regression predicted negative tickets sometimes (impossible). Poisson regression respected the count nature and gave better predictions.
Negative Binomial for Overdispersed Counts
When count variance exceeds the mean (overdispersion), negative binomial is better:
python
import statsmodels.api as sm
# Fit negative binomial
nb_model = sm.GLM(y, X, family=sm.families.NegativeBinomial()).fit()
print(nb_model.summary())
The model summary includes tests for overdispersion, helping you choose between Poisson and negative binomial.
Model Diagnostics: Trust But Verify
Statsmodels makes assumption checking easy. Don’t skip this step — violated assumptions mean unreliable results.
Residual Analysis
python
import matplotlib.pyplot as plt
# After fitting a model
residuals = model.resid
fitted_values = model.fittedvalues
# Residual plot
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# Residuals vs fitted
axes[0, 0].scatter(fitted_values, residuals)
axes[0, 0].axhline(y=0, color='r', linestyle='--')
axes[0, 0].set_xlabel('Fitted Values')
axes[0, 0].set_ylabel('Residuals')
axes[0, 0].set_title('Residuals vs Fitted')
# Q-Q plot for normality
sm.qqplot(residuals, line='s', ax=axes[0, 1])
axes[0, 1].set_title('Q-Q Plot')
# Histogram of residuals
axes[1, 0].hist(residuals, bins=30, edgecolor='black')
axes[1, 0].set_xlabel('Residuals')
axes[1, 0].set_title('Histogram of Residuals')
# Scale-location plot
axes[1, 1].scatter(fitted_values, np.sqrt(np.abs(residuals)))
axes[1, 1].set_xlabel('Fitted Values')
axes[1, 1].set_ylabel('√|Residuals|')
axes[1, 1].set_title('Scale-Location Plot')
plt.tight_layout()
plt.show()
What to look for:
- Residuals randomly scattered (no patterns)
- Q-Q plot points on the line (normality)
- Histogram roughly bell-shaped
- Constant variance across fitted values
Patterns in these plots mean model assumptions are violated. Address them before trusting results.
Statistical Tests for Assumptions
Statsmodels includes diagnostic tests:
python
from statsmodels.stats.diagnostic import het_breuschpagan, het_white
# Test for heteroscedasticity
lm, lm_pval, fval, fval_pval = het_breuschpagan(residuals, X)
print(f"Breusch-Pagan test p-value: {lm_pval:.4f}")
if lm_pval < 0.05:
print("Heteroscedasticity detected - consider robust standard errors")
These tests quantify what visual inspection suggests. P-value < 0.05 means assumption is likely violated.
Robust Standard Errors: When Assumptions Fail
Sometimes assumptions fail but you can’t fix them. Robust standard errors adjust for violations:
python
import statsmodels.api as sm
# Fit model with robust standard errors
model = sm.OLS(y, X).fit(cov_type='HC3')
print(model.summary())
The cov_type='HC3' uses heteroscedasticity-consistent standard errors. Your coefficients stay the same, but p-values and confidence intervals adjust for violated assumptions.
This saved me once when I had heteroscedasticity I couldn’t fix. Robust standard errors let me report valid inference despite the violation.
Comparing Models: Statistical Tests
Should you include that extra feature? Statistical tests answer this formally.
F-Test for Nested Models
python
model_full = sm.OLS(y, X_full).fit()
model_reduced = sm.OLS(y, X_reduced).fit()
# F-test comparing models
from statsmodels.stats.anova import anova_lm
print(anova_lm(model_reduced, model_full))
If p-value < 0.05, the additional features significantly improve the model. If not, simpler model is preferable (parsimony principle).
AIC and BIC for Model Selection
python
print(f"Model 1 AIC: {model1.aic:.2f}")
print(f"Model 2 AIC: {model2.aic:.2f}")print(f"Model 1 BIC: {model1.bic:.2f}")
print(f"Model 2 BIC: {model2.bic:.2f}")Lower AIC/BIC is better. AIC favors predictive accuracy, BIC penalizes complexity more heavily. Use BIC when you want simpler models.
Feature Selection with Statistical Testing
Statsmodels makes statistically-informed feature selection possible.
Backward Elimination
python
import statsmodels.api as sm
import pandas as pd
def backward_elimination(X, y, significance_level=0.05):
X = sm.add_constant(X)
features = X.columns.tolist()
while len(features) > 1:
model = sm.OLS(y, X[features]).fit()
p_values = model.pvalues[1:] # Exclude intercept
max_p_value = p_values.max()
if max_p_value > significance_level:
excluded_feature = p_values.idxmax()
features.remove(excluded_feature)
print(f"Removed {excluded_feature} (p-value: {max_p_value:.4f})")
else:
break
final_model = sm.OLS(y, X[features]).fit()
return final_model, features
final_model, selected_features = backward_elimination(X, y)
print(f"\nSelected features: {selected_features}")
print(final_model.summary())
This removes features with p-values above your threshold iteratively. You’re left with only statistically significant predictors.
I prefer this to scikit-learn’s automated feature selection when interpretability matters. Every feature in the final model has statistical justification.
Weighted Least Squares: Handling Unequal Variance
When different observations have different reliability, WLS (Weighted Least Squares) accounts for this:
python
import statsmodels.api as sm
import numpy as np
# Observations with different variance
weights = 1 / variance_estimates # Higher weight for lower variance
# Fit weighted model
wls_model = sm.WLS(y, X, weights=weights).fit()
print(wls_model.summary())
Used this for survey data where some responses were more reliable than others. Weighting by reliability improved model quality significantly.
Practical Tips and Best Practices
After using statsmodels for years, here’s what actually matters:
Always Check Model Diagnostics
Don’t skip residual plots and diagnostic tests. I’ve seen people report results from models with severe assumption violations. Those results are unreliable at best, misleading at worst.
Use Formula API for Exploration
The formula interface is more readable and handles categoricals automatically. Great for initial exploration. Switch to API approach when you need more control.
Report Confidence Intervals, Not Just P-Values
P-values tell you if an effect exists. Confidence intervals tell you the plausible magnitude. Both matter for decision-making.
Understand What Statistical Significance Means
P-value < 0.05 doesn’t mean the relationship is important, just that it’s unlikely due to chance. A statistically significant but tiny effect might not matter practically.
Combine Statsmodels and Scikit-Learn
Use statsmodels for understanding and inference, scikit-learn for prediction. They complement each other beautifully. I routinely build initial models in statsmodels, understand what matters, then deploy optimized versions in scikit-learn.
When to Use Statsmodels vs Scikit-Learn
Let me settle this directly.
Use statsmodels when:
- You need to explain which features matter and why
- Statistical significance testing is required
- You’re doing time series analysis
- Academic or regulatory requirements demand proper inference
- Understanding uncertainty is critical
Use scikit-learn when:
- Pure prediction accuracy is the goal
- You’re deploying production models at scale
- You need ensemble methods or neural networks
- Speed and efficiency matter more than interpretability
- You’re prototyping quickly
Use both when:
- You need to understand AND predict
- Results require both accuracy and statistical rigor
- Stakeholders want both predictions and explanations
I’ve never regretted learning both. They solve different problems, and real-world projects usually need both.
The Statistical Mindset
Here’s what statsmodels taught me: prediction and understanding are different goals. Scikit-learn optimizes for prediction. Statsmodels optimizes for understanding.
Sometimes you need to know your model is 85% accurate. Other times you need to know Feature X has a statistically significant effect of 2.3 units with 95% confidence between 1.8 and 2.8. Different questions, different tools.
The best data scientists I know use both approaches fluently. They predict when prediction matters and test hypotheses when understanding matters. Statsmodels gives you the statistical rigor that pure machine learning sometimes lacks.
Don’t treat statistics as outdated. P-values and confidence intervals aren’t relics — they’re tools for quantifying uncertainty properly. In a field where everyone claims their model “works,” statistics lets you say how well it works and how certain you are about that claim.
Now go build models you can actually explain statistically :)
Comments
Post a Comment