Boosting: Sequential Model Improvement
Boosting trains models sequentially, with each new model focusing on examples the previous models got wrong.
AdaBoost: The Original Boosting Algorithm
python
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
# Create AdaBoost
adaboost = AdaBoostClassifier(
estimator=DecisionTreeClassifier(max_depth=1), # Weak learner
n_estimators=50, # Number of boosting rounds
learning_rate=1.0, # Contribution of each classifier
random_state=42
)
# Train
adaboost.fit(X_train, y_train)
# Predict
predictions = adaboost.predict(X_test)
accuracy = adaboost.score(X_test, y_test)
print(f"AdaBoost Accuracy: {accuracy:.4f}")
How AdaBoost works:
- Train weak classifier
- Identify misclassified examples
- Increase weights on those examples
- Train new classifier on reweighted data
- Repeat
- Combine all classifiers with weighted voting
AdaBoost is elegant but has largely been superseded by Gradient Boosting.
Gradient Boosting: The Modern Standard
python
from sklearn.ensemble import GradientBoostingClassifier
# Create Gradient Boosting
gb = GradientBoostingClassifier(
n_estimators=100, # Number of boosting stages
learning_rate=0.1, # Shrinks contribution of each tree
max_depth=3, # Maximum tree depth
min_samples_split=5,
min_samples_leaf=3,
subsample=0.8, # Fraction of samples for each tree
random_state=42
)
# Train
gb.fit(X_train, y_train)
# Predict
predictions = gb.predict(X_test)
accuracy = gb.score(X_test, y_test)
print(f"Gradient Boosting Accuracy: {accuracy:.4f}")
Key hyperparameters:
learning_rate (shrinkage):
- Lower = more conservative = better generalization
- 0.01–0.3 typical range
- Lower learning rate requires more estimators
- Start with 0.1
n_estimators (boosting rounds):
- More = better performance (until overfitting)
- 100–1000 typical range
- Monitor validation performance
- Use early stopping when possible
max_depth (tree complexity):
- 3–5 typical for boosting (shallow trees work better)
- Deeper trees = more overfitting risk
- Start with 3
subsample (stochastic gradient boosting):
- Fraction of samples per tree (< 1.0 adds randomness)
- 0.5–1.0 typical range
- Helps prevent overfitting
- 0.8 is good default
Gradient Boosting Regression Example
python
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error, r2_score
# Create regression dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Create model
gb_reg = GradientBoostingRegressor(
n_estimators=100,
learning_rate=0.1,
max_depth=3,
random_state=42
)
# Train
gb_reg.fit(X_train, y_train)
# Predict
predictions = gb_reg.predict(X_test)
# Evaluate
mse = mean_squared_error(y_test, predictions)
r2 = r2_score(y_test, predictions)
print(f"MSE: {mse:.4f}")
print(f"R²: {r2:.4f}")Histogram-Based Gradient Boosting (Faster)
For large datasets, use the histogram-based variant:
python
from sklearn.ensemble import HistGradientBoostingClassifier
# Much faster on large datasets
hist_gb = HistGradientBoostingClassifier(
max_iter=100, # Like n_estimators
learning_rate=0.1,
max_depth=10,
random_state=42
)
hist_gb.fit(X_train, y_train)
accuracy = hist_gb.score(X_test, y_test)
print(f"Histogram GB Accuracy: {accuracy:.4f}")
This is dramatically faster than regular GradientBoosting on datasets with 10K+ samples. Use it when training time matters.
Stacking: Meta-Learning from Multiple Models
Stacking trains a meta-model to combine predictions from multiple base models.
Basic Stacking Implementation
python
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
# Define base models
base_models = [
('rf', RandomForestClassifier(n_estimators=10, random_state=42)),
('dt', DecisionTreeClassifier(max_depth=5, random_state=42)),
('svm', SVC(probability=True, random_state=42)),
('nb', GaussianNB())
]
# Define meta-model
meta_model = LogisticRegression()
# Create stacking ensemble
stacking = StackingClassifier(
estimators=base_models,
final_estimator=meta_model,
cv=5 # Cross-validation for base model predictions
)
# Train
stacking.fit(X_train, y_train)
# Predict
predictions = stacking.predict(X_test)
accuracy = stacking.score(X_test, y_test)
print(f"Stacking Accuracy: {accuracy:.4f}")
How stacking works:
- Train base models on training data
- Generate out-of-fold predictions using CV
- Train meta-model on base model predictions
- Final predictions combine all models through meta-model
Advanced Stacking with passthrough
python
stacking_passthrough = StackingClassifier(
estimators=base_models,
final_estimator=meta_model,
cv=5,
passthrough=True
)
stacking_passthrough.fit(X_train, y_train)
accuracy = stacking_passthrough.score(X_test, y_test)
print(f"Stacking (with passthrough) Accuracy: {accuracy:.4f}")
Including original features often improves performance — the meta-model can learn when to trust base models vs. original features.
Stacking Best Practices
Choose diverse base models:
- Different algorithm types (trees, linear, SVM)
- Different hyperparameters
- Different feature subsets
- Models that fail differently
Keep meta-model simple:
- Logistic Regression (classification)
- Ridge/Lasso Regression (regression)
- Avoid complex meta-models (overfitting risk)
Use cross-validation:
- Prevents information leakage
- Creates proper out-of-fold predictions
- Essential for valid stacking
Voting Classifiers: Simple Ensemble
Sometimes you don’t need stacking’s complexity — just combine predictions directly:
Hard Voting
python
from sklearn.ensemble import VotingClassifier
# Hard voting (majority vote)
voting_hard = VotingClassifier(
estimators=base_models,
voting='hard' # Majority vote
)
voting_hard.fit(X_train, y_train)
accuracy = voting_hard.score(X_test, y_test)
print(f"Hard Voting Accuracy: {accuracy:.4f}")
Soft Voting (Better)
python
voting_soft = VotingClassifier(
estimators=base_models,
voting='soft'
)
voting_soft.fit(X_train, y_train)
accuracy = voting_soft.score(X_test, y_test)
print(f"Soft Voting Accuracy: {accuracy:.4f}")
Soft voting usually outperforms hard voting because it considers prediction confidence, not just the final class.
Comparing Ensemble Methods
Let’s compare all methods on the same dataset:
python
from sklearn.metrics import accuracy_score
import numpy as np
# Create models
models = {
'Single Decision Tree': DecisionTreeClassifier(max_depth=5, random_state=42),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
'AdaBoost': AdaBoostClassifier(n_estimators=50, random_state=42),
'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
'Stacking': stacking,
'Voting (Soft)': voting_soft
}
# Train and evaluate
results = {}
for name, model in models.items():
model.fit(X_train, y_train)
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
results[name] = accuracy
print(f"{name}: {accuracy:.4f}")
# Best model
best_model = max(results, key=results.get)
print(f"\nBest model: {best_model} ({results[best_model]:.4f})")
Typical results you’ll see:
- Single model: 82–85%
- Random Forest: 86–89%
- Gradient Boosting: 88–91%
- Stacking: 89–92%
Ever wonder why Kaggle winners always use ensembles? This is why. FYI, my competition scores improved 5–10% when I started ensembling properly.
Feature Importance from Ensembles
Ensemble methods provide feature importance:
python
import matplotlib.pyplot as plt
# Get feature importance from Random Forest
importances = rf.feature_importances_
indices = np.argsort(importances)[::-1]
# Plot
plt.figure(figsize=(10, 6))
plt.title("Feature Importances")
plt.bar(range(X_train.shape[1]), importances[indices])
plt.xlabel("Feature Index")
plt.ylabel("Importance")
plt.show()
# Print top features
for i in range(10):
print(f"Feature {indices[i]}: {importances[indices[i]]:.4f}")
This tells you which features drive predictions — invaluable for understanding your model and communicating with stakeholders.
Common Mistakes to Avoid
Learn from these ensemble failures:
Mistake 1: Ensembling Identical Models
python
models = [
('rf1', RandomForestClassifier(random_state=42)),
('rf2', RandomForestClassifier(random_state=42)),
('rf3', RandomForestClassifier(random_state=42))
]
Identical models make identical predictions. You gain nothing. Use diverse models.
Mistake 2: Not Using Cross-Validation in Stacking
python
stacking = StackingClassifier(estimators=base_models, final_estimator=meta_model)
# Good - CV prevents leakage
stacking = StackingClassifier(estimators=base_models, final_estimator=meta_model, cv=5)
Without CV, your meta-model trains on predictions from models that saw the training data. That’s cheating.
Mistake 3: Over-Tuning Individual Models
Ensemble power comes from diversity. Don’t spend hours perfectly tuning each base model. Use reasonable defaults and let ensemble averaging handle the rest.
Mistake 4: Forgetting Computational Cost
python
huge_ensemble = StackingClassifier(
estimators=[
('rf1', RandomForestClassifier(n_estimators=1000)),
('rf2', RandomForestClassifier(n_estimators=1000)),
('gb', GradientBoostingClassifier(n_estimators=1000))
],
cv=10
)
Ensembles multiply computational cost. Balance performance with training time. :/
The Bottom Line
Ensemble methods are why production ML systems work reliably and why Kaggle winners win. Single models are fine for learning, but ensembles are essential for serious ML work.
Use Random Forest when: You want good performance with minimal tuning on tabular data
Use Gradient Boosting when: You need maximum accuracy and have time to tune
Use Stacking when: You’re competing or need every last percentage point
Use Voting when: You want ensemble benefits without stacking’s complexity
Start with Random Forest for baseline. Add Gradient Boosting if you need better performance. Use stacking when you’re competing or accuracy is critical.
Installation is simple (you probably have it):
bash
pip install scikit-learn
Stop training single models. Start ensembling. Your accuracy scores — and your career — will thank you. The difference between 85% and 92% accuracy is often the difference between “interesting prototype” and “production system.” Ensemble methods bridge that gap. :)
Comments
Post a Comment