Imbalanced-learn (imblearn): Handle Imbalanced Datasets Like a Pro

So you’ve built a classifier with 95% accuracy, and you’re feeling pretty good about yourself. Then someone points out that your dataset is 95% negative cases, and your model literally just predicts “negative” for everything. Congratulations — you’ve built a very expensive way to always say “no.”

I’ve been there. Built a fraud detection model that “worked great” until I realized it caught exactly zero fraudulent transactions. Turns out 99.5% accuracy means nothing when you’re trying to find that 0.5% of actual fraud. That’s when I discovered imbalanced-learn, and honestly, it changed how I approach classification problems.

Let me show you how to actually handle imbalanced datasets instead of just pretending accuracy matters.

Why Imbalanced Data Breaks Everything

Here’s the uncomfortable truth: most real-world classification problems are imbalanced. Fraud detection, disease diagnosis, equipment failure prediction, spam filtering — the interesting class is always rare.

Your model learns to take shortcuts. Why bother learning complex patterns when you can get 99% accuracy by always predicting the majority class? It’s like studying for an exam by just writing “B” for every multiple choice question. Sometimes it works, but you haven’t actually learned anything.

The damage imbalanced data causes:

Models ignore minority classes completely
High accuracy masks terrible recall
Predictions are useless for the class you actually care about
Standard algorithms optimize for the wrong thing
You deploy confident garbage to production

I spent a month optimizing a customer churn model before realizing it predicted “no churn” for everyone. Perfect accuracy on 92% of cases, zero value for the business.

Understanding Imbalanced-Learn (imblearn)

Imbalanced-learn is a Python library built specifically for this problem. It integrates seamlessly with scikit-learn and provides tools for resampling, algorithm modifications, and ensemble methods designed for imbalanced data.

The core philosophy? Don’t let your majority class bully the minority class into invisibility.

Installing and Basic Setup

python

pip install imbalanced-learn

That’s it. Now you’ve got access to over-sampling, under-sampling, and hybrid techniques that actually work.

python

from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

# Check your class distribution first
from collections import Counter
print(f"Original distribution: {Counter(y_train)}")

Always check your distribution first. You need to know what you’re dealing with. If you’ve got 10,000 samples and only 50 positives, that’s a 200:1 imbalance. Standard algorithms will fail miserably.

SMOTE: The Synthetic Oversampling Game-Changer

SMOTE (Synthetic Minority Over-sampling Technique) is probably the most popular resampling method, and for good reason. Instead of just duplicating minority samples, it creates synthetic examples.

How SMOTE Actually Works

SMOTE picks a minority sample, finds its k nearest neighbors (also minority class), and creates new samples along the lines connecting them. It’s like interpolating between real examples to create plausible new ones.

python

from imblearn.over_sampling import SMOTE

# Original imbalanced data
print(f"Before SMOTE: {Counter(y_train)}")
# Output: Counter({0: 9500, 1: 500})

smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

print(f"After SMOTE: {Counter(y_resampled)}")
# Output: Counter({0: 9500, 1: 9500})

Boom. You’ve just created 9,000 synthetic minority samples. Your classes are now balanced, and your model can actually learn patterns from both.

SMOTE Variations for Different Situations

Regular SMOTE works great, but imblearn offers variants for specific problems:

SMOTE variants I actually use:

SMOTE — Standard synthetic oversampling
ADASYN — Adaptive synthetic sampling (focuses on harder examples)
BorderlineSMOTE — Only oversamples near decision boundaries
SVMSMOTE — Uses SVM to identify boundary cases

python

from imblearn.over_sampling import ADASYN, BorderlineSMOTE

# ADASYN adapts to data density
adasyn = ADASYN(random_state=42)
X_ada, y_ada = adasyn.fit_resample(X_train, y_train)

# BorderlineSMOTE focuses on decision boundary
border_smote = BorderlineSMOTE(random_state=42)
X_border, y_border = border_smote.fit_resample(X_train, y_train)

I typically start with regular SMOTE. If results aren’t great, I try BorderlineSMOTE since it focuses on the samples that matter most — the ones near the decision boundary.

When SMOTE Can Backfire

SMOTE isn’t magic. I learned this when applying it to high-dimensional data with lots of noise. Ever wondered why your carefully balanced dataset still produces mediocre results?

SMOTE problems:

Creates noise in high-dimensional spaces
Can generate unrealistic synthetic samples
Doesn’t work well with overlapping classes
Amplifies outliers if you’re not careful
Increases training time significantly

For a credit card fraud project, SMOTE actually made things worse. The synthetic samples were too similar to legitimate transactions, and the model got confused. I switched to under-sampling and saw immediate improvement.

Under-Sampling: The Aggressive Approach

Under-sampling is the opposite strategy — remove majority class samples until classes balance. Sounds wasteful, right? You’re throwing away data. But sometimes it’s exactly what you need.

Random Under-Sampling

The simplest approach: randomly delete majority samples until you hit your target ratio.

python

from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(random_state=42)
X_under, y_under = rus.fit_resample(X_train, y_train)

print(f"After under-sampling: {Counter(y_under)}")
# Now you have equal classes, but less total data

I use this when I have massive datasets where throwing away 90% of the majority class still leaves plenty of samples. Better a balanced dataset of 10,000 samples than an imbalanced one of 100,000.

Smart Under-Sampling Techniques

Random deletion is crude. Imblearn offers smarter approaches that keep the most informative majority samples.

python

from imblearn.under_sampling import TomekLinks, EditedNearestNeighbours

# Tomek Links removes majority samples that are too close to minority
tomek = TomekLinks()
X_tomek, y_tomek = tomek.fit_resample(X_train, y_train)

# ENN removes majority samples whose neighbors are mostly minority
enn = EditedNearestNeighbours()
X_enn, y_enn = enn.fit_resample(X_train, y_train)

TomekLinks identifies pairs of samples from opposite classes that are each other’s nearest neighbors, then removes the majority class sample. This cleans up the decision boundary.

EditedNearestNeighbours looks at each sample’s k nearest neighbors. If most neighbors are from the opposite class, it’s probably noise — remove it.

These techniques don’t balance classes completely, but they clean up noise and make the boundary clearer. I often use them before SMOTE for a cleaner synthetic sampling.

NearMiss: The Selective Curator

NearMiss keeps only those majority samples that are close to minority samples. Different versions use different distance criteria.

python

from imblearn.under_sampling import NearMiss

# NearMiss-1: Select majority samples with smallest average distance to 3 closest minority samples
nm1 = NearMiss(version=1)
X_nm, y_nm = nm1.fit_resample(X_train, y_train)

This is great when you want to focus your model on the difficult cases near the decision boundary. The trade-off? You lose information about the majority class distribution.

Combination Methods: Best of Both Worlds

Why choose between over-sampling and under-sampling when you can do both? Combination methods give you balanced classes without extreme approaches.

SMOTEENN: Clean Then Synthesize

SMOTEENN applies SMOTE first, then uses EditedNearestNeighbours to clean up noisy synthetic samples.

python

from imblearn.combine import SMOTEENN

smote_enn = SMOTEENN(random_state=42)
X_combined, y_combined = smote_enn.fit_resample(X_train, y_train)

This is my go-to for medium-sized datasets. You get the benefits of synthetic sampling without amplifying noise. The ENN step removes problematic synthetic samples that landed in weird places.

SMOTETomek: Synthesize Then Clean Boundaries

SMOTETomek does SMOTE first, then removes Tomek links to clean the decision boundary.

python

from imblearn.combine import SMOTETomek

smote_tomek = SMOTETomek(random_state=42)
X_st, y_st = smote_tomek.fit_resample(X_train, y_train)

Slightly different from SMOTEENN — it focuses on cleaning the boundary rather than removing all noisy samples. I prefer this when I want a cleaner separation between classes.

Ensemble Methods for Imbalanced Data

Sometimes resampling isn’t enough. Ensemble methods in imblearn train multiple classifiers on different balanced subsets, then combine their predictions.

Balanced Random Forest

BalancedRandomForestClassifier automatically balances each tree’s bootstrap sample. No manual resampling needed.

python

from imblearn.ensemble import BalancedRandomForestClassifier

brf = BalancedRandomForestClassifier(
    n_estimators=100,
    random_state=42,
    sampling_strategy='auto'  # Automatically balance
)

brf.fit(X_train, y_train)
predictions = brf.predict(X_test)

This thing is beautiful. Each tree in the forest gets a balanced bootstrap sample, so every tree learns from both classes equally. The ensemble smooths out individual tree biases.

I’ve used this on fraud detection where resampling was too slow. Just plug in your imbalanced data and go. The results? Often better than manually resampling + standard random forest.

Balanced Bagging Classifier

BalancedBaggingClassifier works with any base classifier. It creates balanced bootstrap samples and trains multiple base classifiers.

python

from imblearn.ensemble import BalancedBaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bbc = BalancedBaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=10,
    random_state=42
)

bbc.fit(X_train, y_train)

More flexible than BalancedRandomForest since you control the base classifier. Want to ensemble logistic regression? Go for it. SVM? Why not.

EasyEnsemble: The Under-Sampling Ensemble

EasyEnsembleClassifier creates multiple balanced subsets by under-sampling the majority class, trains a classifier on each, then combines them.

python

from imblearn.ensemble import EasyEnsembleClassifier

eec = EasyEnsembleClassifier(
    n_estimators=10,
    random_state=42
)

eec.fit(X_train, y_train)

This is brilliant for huge datasets with extreme imbalance. Instead of one model seeing all the majority class data (causing bias), you train multiple models on different majority class subsets. Each model gets a balanced view.

Used this for a project with 1 million samples and 99:1 imbalance. Training took a fraction of the time compared to SMOTE, and results were better.

Pipeline Integration: Doing It Right

Here’s where most people mess up: they resample their data before splitting train/test. This causes data leakage. Your test samples influenced the resampling of your training set.

The right way? Use imblearn’s Pipeline.

The Correct Pipeline Approach

python

from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

# Build pipeline with imblearn's Pipeline (not sklearn's!)
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('smote', SMOTE(random_state=42)),
    ('classifier', RandomForestClassifier(random_state=42))
])

# Resampling happens inside CV, no leakage
from sklearn.model_selection import cross_val_score
scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='f1')

print(f"F1 Score: {scores.mean():.3f} (+/- {scores.std():.3f})")

Now SMOTE only sees the training folds during cross-validation. Each fold gets independently resampled. No leakage, honest results.

Pipeline with GridSearchCV

You can even tune resampling parameters alongside model hyperparameters:

python

from sklearn.model_selection import GridSearchCV

param_grid = {
    'smote__k_neighbors': [3, 5, 7],
    'classifier__n_estimators': [50, 100, 200],
    'classifier__max_depth': [10, 20, 30]
}

grid_search = GridSearchCV(
    pipeline,
    param_grid,
    cv=5,
    scoring='f1',
    n_jobs=-1
)

grid_search.fit(X_train, y_train)
print(f"Best params: {grid_search.best_params_}")

This finds the optimal number of SMOTE neighbors alongside the best random forest parameters. Everything tuned together, everything properly validated.

Metrics That Actually Matter

Stop using accuracy. Seriously, just stop. With imbalanced data, accuracy is worse than useless — it’s actively misleading.

The Metrics You Should Track

python

from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

# After training your model
y_pred = pipeline.predict(X_test)

# Get comprehensive metrics
print(classification_report(y_test, y_pred))

# Confusion matrix shows what's actually happening
print(confusion_matrix(y_test, y_pred))

# ROC-AUC for probability-based assessment
y_proba = pipeline.predict_proba(X_test)[:, 1]
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.3f}")

The metrics that matter:

Precision: Of your positive predictions, how many were correct?
Recall: Of actual positives, how many did you catch?
F1-Score: Harmonic mean of precision and recall
ROC-AUC: Overall discrimination ability
Precision-Recall AUC: Better than ROC-AUC for extreme imbalance

IMO, F1-score is your friend for imbalanced problems. It forces you to care about both false positives and false negatives.

The Confusion Matrix Reality Check

Always look at your confusion matrix. It tells you what’s really happening:

python

from sklearn.metrics import ConfusionMatrixDisplay
import matplotlib.pyplot as plt

ConfusionMatrixDisplay.from_estimator(
    pipeline, 
    X_test, 
    y_test,
    cmap='Blues'
)
plt.show()

That 95% accuracy might hide that you’re missing 80% of fraud cases. The confusion matrix shows you the ugly truth.

Real-World Strategy: What Actually Works

After years of fighting imbalanced data, here’s my battle-tested approach:

Step 1: Understand Your Data

python

from collections import Counter
import numpy as np

# Check class distribution
class_counts = Counter(y_train)
imbalance_ratio = max(class_counts.values()) / min(class_counts.values())

print(f"Class distribution: {class_counts}")
print(f"Imbalance ratio: {imbalance_ratio:.1f}:1")

# Are classes separable?
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_train)
# Plot and visually inspect separation

Know your enemy. How imbalanced are we talking? 2:1 is barely imbalanced. 100:1 requires serious intervention.

Step 2: Start Simple

python

from imblearn.ensemble import BalancedRandomForestClassifier

# Try the easiest solution first
brf = BalancedRandomForestClassifier(n_estimators=100, random_state=42)
brf.fit(X_train, y_train)

from sklearn.metrics import f1_score
baseline_f1 = f1_score(y_test, brf.predict(X_test))
print(f"Baseline F1: {baseline_f1:.3f}")

Don’t overcomplicate. Start with BalancedRandomForest. It works surprisingly often, and you save yourself hours of tuning.

Step 3: Experiment Systematically

python

from imblearn.over_sampling import SMOTE, ADASYN
from imblearn.combine import SMOTEENN, SMOTETomek

techniques = {
    'SMOTE': SMOTE(random_state=42),
    'ADASYN': ADASYN(random_state=42),
    'SMOTEENN': SMOTEENN(random_state=42),
    'SMOTETomek': SMOTETomek(random_state=42)
}

results = {}
for name, sampler in techniques.items():
    pipeline = Pipeline([
        ('sampler', sampler),
        ('classifier', RandomForestClassifier(random_state=42))
    ])
    scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='f1')
    results[name] = scores.mean()
    print(f"{name}: {scores.mean():.3f}")

best_technique = max(results, key=results.get)
print(f"\nBest technique: {best_technique}")

Test multiple approaches. Your dataset might respond better to one technique over another. There’s no universal “best” method.

Step 4: Fine-Tune the Winner

Once you’ve identified the best approach, tune it properly:

python

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

best_pipeline = Pipeline([
    ('smote', SMOTE(random_state=42)),
    ('classifier', RandomForestClassifier(random_state=42))
])

param_dist = {
    'smote__k_neighbors': randint(3, 10),
    'classifier__n_estimators': randint(50, 300),
    'classifier__max_depth': randint(10, 50),
    'classifier__min_samples_split': randint(2, 20)
}

random_search = RandomizedSearchCV(
    best_pipeline,
    param_dist,
    n_iter=50,
    cv=5,
    scoring='f1',
    n_jobs=-1,
    random_state=42
)

random_search.fit(X_train, y_train)
final_model = random_search.best_estimator_

Now you’ve got a properly tuned model that handles imbalance intelligently.

Common Mistakes That Cost Me Weeks

Learn from my pain.

Mistake 1: Resampling Before Train/Test Split

python

# WRONG - data leakage!
X_resampled, y_resampled = SMOTE().fit_resample(X, y)
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled)

# RIGHT - split first, resample training only
X_train, X_test, y_train, y_test = train_test_split(X, y)
X_train_res, y_train_res = SMOTE().fit_resample(X_train, y_train)

Resampling before splitting means your test set influenced training. Your results are lies.

Mistake 2: Over-Sampling the Test Set

Never, ever resample your test set:

python

# WRONG - test set should reflect real distribution!
X_test_res, y_test_res = SMOTE().fit_resample(X_test, y_test)

# RIGHT - only resample training
X_train_res, y_train_res = SMOTE().fit_resample(X_train, y_train)
# Test set stays original

Your test set should reflect production reality. Production data won’t be balanced, so neither should your test set.

Mistake 3: Using SMOTE with High-Dimensional Data Blindly

In high dimensions, SMOTE can create unrealistic synthetic samples:

python

# For high-dimensional data, try dimensionality reduction first
from sklearn.decomposition import PCA

pipeline = Pipeline([
    ('pca', PCA(n_components=50)),  # Reduce dimensions first
    ('smote', SMOTE(random_state=42)),
    ('classifier', RandomForestClassifier(random_state=42))
])

Or just use ensemble methods that don’t need resampling.

When to Use What: The Decision Tree

Here’s my mental model for choosing techniques:

Use BalancedRandomForest when:

You want something that “just works”
Dataset size is moderate (1K-100K samples)
You don’t want to manually tune resampling
Random forests are appropriate for your problem

Use SMOTE when:

Imbalance is moderate (< 20:1)
Low to medium dimensionality
You have enough minority samples (50+)
Classes are reasonably separable

Use under-sampling when:

Massive datasets where you can afford to lose data
Extreme imbalance (100:1 or worse)
SMOTE creates too much noise
Training time is a major constraint

Use combination methods (SMOTEENN, SMOTETomek) when:

Your data has overlap and noise
Standard SMOTE doesn’t work well
You need cleaner decision boundaries

Use ensemble methods (EasyEnsemble) when:

Extreme imbalance with huge datasets
Multiple balanced views are better than one imbalanced view
Computational resources allow parallel training

The truth? You’ll probably try 2–3 approaches before finding what works for your specific dataset. That’s normal.

The Real Talk on Imbalanced Data

Here’s what the tutorials don’t tell you: handling imbalanced data is messy. There’s no magic bullet that works everywhere. What crushes it on fraud detection might fail spectacularly on medical diagnosis.

The key is systematic experimentation. Try multiple techniques. Measure with appropriate metrics (not accuracy!). Validate properly without data leakage. Pick the approach that works for your data, not the one that sounds fanciest.

And FYI — sometimes the problem isn’t really imbalance. Sometimes you just need better features or a more appropriate model. I’ve seen “imbalanced data problems” disappear completely after adding domain-specific features that actually captured the signal.

Start with imblearn’s tools. They’ll handle 90% of imbalanced scenarios. For the remaining 10%, you’ll need domain expertise and creativity. But at least you’ll have a solid foundation to build on.

Now go balance those classes and build something that actually catches the rare cases you care about :)

Latest Post

Reinforcement Learning for Credit Scoring: Applications in Fintech