Scikit-learn Ensemble Methods: Bagging, Boosting, and Stacking Explained

Your model achieves 82% accuracy. Not bad, but not great either. You try a different algorithm — 84%. Better, but still not competition-winning. Then someone mentions “ensemble methods” and suddenly your score jumps to 91%. You have no idea what you just did, but it worked, and now you’re wondering if you’ve been doing machine learning on hard mode this entire time.

I spent my first year of ML building single models and wondering why competition winners always seemed to get better results. Turns out, they weren’t using one model — they were combining dozens. Ensemble methods are why Kaggle leaderboards look the way they do, and why production ML systems rarely rely on single models. Let me show you how to actually use bagging, boosting, and stacking in Scikit-learn without the academic jargon that makes most tutorials unreadable.

Why Ensemble Methods Actually Work

Before diving into code, understand the core insight that makes ensembles powerful:

The wisdom of crowds principle:

Individual models make different mistakes
Combining their predictions averages out errors
Diverse models make for better ensembles
Multiple weak models can beat one strong model

Think of it like getting medical opinions. One doctor might be 80% accurate. Three independent doctors voting? Probably closer to 90% accurate. Same principle applies to ML models.

What makes a good ensemble:

Diversity: Models disagree on some predictions
Individual competence: Each model better than random guessing
Appropriate combination: Right method for combining predictions

Bad ensembles combine identical models or models that all fail the same way. Good ensembles combine diverse models that fail differently.

Bagging: Bootstrap Aggregating

Bagging trains multiple models on different random subsets of your training data, then averages their predictions.

How Bagging Works

The process:

Create multiple bootstrap samples (random sampling with replacement)
Train a model on each sample
Average predictions (regression) or vote (classification)

Why it works:

Reduces overfitting
Decreases model variance
Works especially well with high-variance models (like decision trees)

Random Forest: Bagging Done Right

Random Forest is the most popular bagging implementation:

python

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Create dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Create Random Forest
rf = RandomForestClassifier(
    n_estimators=100,      # Number of trees
    max_depth=10,          # Maximum tree depth
    min_samples_split=5,   # Minimum samples to split node
    min_samples_leaf=2,    # Minimum samples in leaf
    max_features='sqrt',   # Features per split
    random_state=42
)

# Train
rf.fit(X_train, y_train)

# Predict
predictions = rf.predict(X_test)
accuracy = rf.score(X_test, y_test)
print(f"Random Forest Accuracy: {accuracy:.4f}")

Tuning Random Forest

Key hyperparameters:

n_estimators (number of trees):

More trees = better performance (usually)
Diminishing returns after ~100–500 trees
More trees = longer training time
Start with 100, increase if underfitting

max_depth (tree depth):

Controls overfitting
None = trees grow until pure leaves (often overfits)
5–20 is typical range
Tune based on validation performance

min_samples_split (minimum samples to split):

Higher values prevent overfitting
2–20 typical range
Increase if overfitting

max_features (features per split):

‘sqrt’: square root of total features (classification default)
‘log2’: log2 of total features
None: use all features (often overfits)
Float: percentage of features

python

from sklearn.model_selection import GridSearchCV

# Parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15, None],
    'min_samples_split': [2, 5, 10],
    'max_features': ['sqrt', 'log2']
}

# Grid search
grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

grid_search.fit(X_train, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_:.4f}")

When to Use Random Forest

Random Forest excels when:

You have tabular data
You need feature importance
You want interpretability + performance
You have many features (handles high dimensions well)
You need something that works well out-of-the-box

Skip Random Forest when:

You have very large datasets (training is slow)
You need maximum accuracy (boosting often better)
You’re working with images/text (deep learning better)

IMO, Random Forest should be your default first model for tabular data. It rarely performs terribly and often performs quite well.

Detect Fake News Using Scikit-learn : Click Here to learn More

Boosting: Sequential Model Improvement

Boosting trains models sequentially, with each new model focusing on examples the previous models got wrong.

AdaBoost: The Original Boosting Algorithm

python

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

# Create AdaBoost
adaboost = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=1),  # Weak learner
    n_estimators=50,           # Number of boosting rounds
    learning_rate=1.0,         # Contribution of each classifier
    random_state=42
)

# Train
adaboost.fit(X_train, y_train)

# Predict
predictions = adaboost.predict(X_test)
accuracy = adaboost.score(X_test, y_test)
print(f"AdaBoost Accuracy: {accuracy:.4f}")

How AdaBoost works:

Train weak classifier
Identify misclassified examples
Increase weights on those examples
Train new classifier on reweighted data
Repeat
Combine all classifiers with weighted voting

AdaBoost is elegant but has largely been superseded by Gradient Boosting.

Gradient Boosting: The Modern Standard

python

from sklearn.ensemble import GradientBoostingClassifier

# Create Gradient Boosting
gb = GradientBoostingClassifier(
    n_estimators=100,          # Number of boosting stages
    learning_rate=0.1,         # Shrinks contribution of each tree
    max_depth=3,               # Maximum tree depth
    min_samples_split=5,
    min_samples_leaf=3,
    subsample=0.8,             # Fraction of samples for each tree
    random_state=42
)

# Train
gb.fit(X_train, y_train)

# Predict
predictions = gb.predict(X_test)
accuracy = gb.score(X_test, y_test)
print(f"Gradient Boosting Accuracy: {accuracy:.4f}")

Key hyperparameters:

learning_rate (shrinkage):

Lower = more conservative = better generalization
0.01–0.3 typical range
Lower learning rate requires more estimators
Start with 0.1

n_estimators (boosting rounds):

More = better performance (until overfitting)
100–1000 typical range
Monitor validation performance
Use early stopping when possible

max_depth (tree complexity):

3–5 typical for boosting (shallow trees work better)
Deeper trees = more overfitting risk
Start with 3

subsample (stochastic gradient boosting):

Fraction of samples per tree (< 1.0 adds randomness)
0.5–1.0 typical range
Helps prevent overfitting
0.8 is good default

Gradient Boosting Regression Example

python

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error, r2_score

# Create regression dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Create model
gb_reg = GradientBoostingRegressor(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    random_state=42
)

# Train
gb_reg.fit(X_train, y_train)

# Predict
predictions = gb_reg.predict(X_test)

# Evaluate
mse = mean_squared_error(y_test, predictions)
r2 = r2_score(y_test, predictions)

print(f"MSE: {mse:.4f}")
print(f"R²: {r2:.4f}")

Histogram-Based Gradient Boosting (Faster)

For large datasets, use the histogram-based variant:

python

from sklearn.ensemble import HistGradientBoostingClassifier

# Much faster on large datasets
hist_gb = HistGradientBoostingClassifier(
    max_iter=100,              # Like n_estimators
    learning_rate=0.1,
    max_depth=10,
    random_state=42
)

hist_gb.fit(X_train, y_train)
accuracy = hist_gb.score(X_test, y_test)
print(f"Histogram GB Accuracy: {accuracy:.4f}")

This is dramatically faster than regular GradientBoosting on datasets with 10K+ samples. Use it when training time matters.

Stacking: Meta-Learning from Multiple Models

Stacking trains a meta-model to combine predictions from multiple base models.

Basic Stacking Implementation

python

from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB

# Define base models
base_models = [
    ('rf', RandomForestClassifier(n_estimators=10, random_state=42)),
    ('dt', DecisionTreeClassifier(max_depth=5, random_state=42)),
    ('svm', SVC(probability=True, random_state=42)),
    ('nb', GaussianNB())
]

# Define meta-model
meta_model = LogisticRegression()

# Create stacking ensemble
stacking = StackingClassifier(
    estimators=base_models,
    final_estimator=meta_model,
    cv=5  # Cross-validation for base model predictions
)

# Train
stacking.fit(X_train, y_train)

# Predict
predictions = stacking.predict(X_test)
accuracy = stacking.score(X_test, y_test)
print(f"Stacking Accuracy: {accuracy:.4f}")

How stacking works:

Train base models on training data
Generate out-of-fold predictions using CV
Train meta-model on base model predictions
Final predictions combine all models through meta-model

Advanced Stacking with passthrough

python

# Pass original features to meta-model
stacking_passthrough = StackingClassifier(
    estimators=base_models,
    final_estimator=meta_model,
    cv=5,
    passthrough=True  # Include original features
)

stacking_passthrough.fit(X_train, y_train)
accuracy = stacking_passthrough.score(X_test, y_test)
print(f"Stacking (with passthrough) Accuracy: {accuracy:.4f}")

Including original features often improves performance — the meta-model can learn when to trust base models vs. original features.

Stacking Best Practices

Choose diverse base models:

Different algorithm types (trees, linear, SVM)
Different hyperparameters
Different feature subsets
Models that fail differently

Keep meta-model simple:

Logistic Regression (classification)
Ridge/Lasso Regression (regression)
Avoid complex meta-models (overfitting risk)

Use cross-validation:

Prevents information leakage
Creates proper out-of-fold predictions
Essential for valid stacking

Voting Classifiers: Simple Ensemble

Sometimes you don’t need stacking’s complexity — just combine predictions directly:

Hard Voting

python

from sklearn.ensemble import VotingClassifier

# Hard voting (majority vote)
voting_hard = VotingClassifier(
    estimators=base_models,
    voting='hard'  # Majority vote
)

voting_hard.fit(X_train, y_train)
accuracy = voting_hard.score(X_test, y_test)
print(f"Hard Voting Accuracy: {accuracy:.4f}")

Soft Voting (Better)

python

# Soft voting (average probabilities)
voting_soft = VotingClassifier(
    estimators=base_models,
    voting='soft'  # Average predicted probabilities
)

voting_soft.fit(X_train, y_train)
accuracy = voting_soft.score(X_test, y_test)
print(f"Soft Voting Accuracy: {accuracy:.4f}")

Soft voting usually outperforms hard voting because it considers prediction confidence, not just the final class.

Comparing Ensemble Methods

Let’s compare all methods on the same dataset:

python

from sklearn.metrics import accuracy_score
import numpy as np

# Create models
models = {
    'Single Decision Tree': DecisionTreeClassifier(max_depth=5, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'AdaBoost': AdaBoostClassifier(n_estimators=50, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
    'Stacking': stacking,
    'Voting (Soft)': voting_soft
}

# Train and evaluate
results = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    accuracy = accuracy_score(y_test, predictions)
    results[name] = accuracy
    print(f"{name}: {accuracy:.4f}")

# Best model
best_model = max(results, key=results.get)
print(f"\nBest model: {best_model} ({results[best_model]:.4f})")

Typical results you’ll see:

Single model: 82–85%
Random Forest: 86–89%
Gradient Boosting: 88–91%
Stacking: 89–92%

Ever wonder why Kaggle winners always use ensembles? This is why. FYI, my competition scores improved 5–10% when I started ensembling properly.

Feature Importance from Ensembles

Ensemble methods provide feature importance:

python

import matplotlib.pyplot as plt

# Get feature importance from Random Forest
importances = rf.feature_importances_
indices = np.argsort(importances)[::-1]

# Plot
plt.figure(figsize=(10, 6))
plt.title("Feature Importances")
plt.bar(range(X_train.shape[1]), importances[indices])
plt.xlabel("Feature Index")
plt.ylabel("Importance")
plt.show()

# Print top features
for i in range(10):
    print(f"Feature {indices[i]}: {importances[indices[i]]:.4f}")

This tells you which features drive predictions — invaluable for understanding your model and communicating with stakeholders.

Common Mistakes to Avoid

Learn from these ensemble failures:

Mistake 1: Ensembling Identical Models

python

# Bad - all models are the same
models = [
    ('rf1', RandomForestClassifier(random_state=42)),
    ('rf2', RandomForestClassifier(random_state=42)),
    ('rf3', RandomForestClassifier(random_state=42))
]

Identical models make identical predictions. You gain nothing. Use diverse models.

Mistake 2: Not Using Cross-Validation in Stacking

python

# Bad - no CV, information leakage
stacking = StackingClassifier(estimators=base_models, final_estimator=meta_model)

# Good - CV prevents leakage
stacking = StackingClassifier(estimators=base_models, final_estimator=meta_model, cv=5)

Without CV, your meta-model trains on predictions from models that saw the training data. That’s cheating.

Mistake 3: Over-Tuning Individual Models

Ensemble power comes from diversity. Don’t spend hours perfectly tuning each base model. Use reasonable defaults and let ensemble averaging handle the rest.

Mistake 4: Forgetting Computational Cost

python

# This takes FOREVER to train
huge_ensemble = StackingClassifier(
    estimators=[
        ('rf1', RandomForestClassifier(n_estimators=1000)),
        ('rf2', RandomForestClassifier(n_estimators=1000)),
        ('gb', GradientBoostingClassifier(n_estimators=1000))
    ],
    cv=10
)

Ensembles multiply computational cost. Balance performance with training time. :/

The Bottom Line

Ensemble methods are why production ML systems work reliably and why Kaggle winners win. Single models are fine for learning, but ensembles are essential for serious ML work.

Use Random Forest when: You want good performance with minimal tuning on tabular data

Use Gradient Boosting when: You need maximum accuracy and have time to tune

Use Stacking when: You’re competing or need every last percentage point

Use Voting when: You want ensemble benefits without stacking’s complexity

Start with Random Forest for baseline. Add Gradient Boosting if you need better performance. Use stacking when you’re competing or accuracy is critical.

Installation is simple (you probably have it):

bash

pip install scikit-learn

Stop training single models. Start ensembling. Your accuracy scores — and your career — will thank you. The difference between 85% and 92% accuracy is often the difference between “interesting prototype” and “production system.” Ensemble methods bridge that gap. :)

Sam Austin

Search This Blog

Latest Post

Reinforcement Learning for Credit Scoring: Applications in Fintech