Here’s something that’ll blow your mind: the way fintech companies decide whether to lend you money is getting a serious upgrade. And I’m not talking about minor tweaks to old formulas — I’m talking about reinforcement learning algorithms that literally learn from every lending decision they make.
Boruta Feature Selection: Identify Important Features in Python
on
Get link
Facebook
X
Pinterest
Email
Other Apps
I spent two months building this beautiful fraud detection model with 200+ features. It was accurate, sure, but also slow as molasses and impossible to explain to stakeholders. My manager kept asking “which features actually matter?” and I’d just shrug. Then I discovered Boruta, ran it overnight, and boom — turns out only 23 features were doing the heavy lifting. Everything else? Noise.
Boruta is this clever feature selection algorithm that doesn’t just rank features — it actually tells you which ones are genuinely important versus which ones are just riding along for the ride. It’s like having a brutally honest friend who’ll tell you when your features are useless, and trust me, you need that friend.
Let me show you why Boruta should be in every data scientist’s toolkit.
Boruta Feature Selection
What Makes Boruta Special?
Most feature selection methods give you a ranking and leave you to figure out where to cut. Top 10 features? Top 50? Who knows. It’s guesswork dressed up as science.
Boruta takes a different approach — it uses statistical hypothesis testing to determine which features are truly important. The algorithm creates shadow features (random copies of your data), trains a model, and asks: “Are your real features performing better than pure noise?” If they’re not, they get the boot.
Here’s what makes it brilliant:
No arbitrary cutoffs: You don’t pick “top N features” — Boruta tells you what’s important
Captures feature interactions: Unlike univariate methods, Boruta considers how features work together
Statistically rigorous: Uses multiple testing correction to avoid false discoveries
The name comes from a Slavic deity of forests, which is fitting since the algorithm is based on Random Forests. Kinda poetic, right? :)
Installation and Setup
Getting Boruta running is refreshingly straightforward:
pip install boruta
You’ll also need scikit-learn (but you already have that, let’s be honest):
pip install scikit-learn
For this tutorial, I’ll use a real dataset so you can see Boruta in action with actual messy data:
python
import pandas as pd import numpy as np from sklearn.datasetsimport make_classification from sklearn.model_selectionimport train_test_split from boruta importBorutaPy from sklearn.ensembleimportRandomForestClassifier
# Generate a realistic dataset with noise X, y = make_classification( n_samples=1000, n_features=25, n_informative=10, n_redundant=5, n_repeated=0, n_classes=2, random_state=42 )
# Convert to DataFrame for readability feature_names = [f'feature_{i}' for i in range(25)] X_df = pd.DataFrame(X, columns=feature_names)
This dataset has 10 truly informative features, 5 redundant ones, and 10 that are pure noise. Perfect for testing Boruta’s ability to separate signal from garbage.
# Get selected features selected_features = X_df.columns[boruta_selector.support_].tolist() print(f"Selected features: {selected_features}") print(f"Number of features: {len(selected_features)}")
The verbose=2 parameter shows you progress—you'll see Boruta iterating and making decisions in real-time. It's oddly satisfying watching it work.
What just happened? Boruta created shadow features, ran multiple Random Forest iterations, compared your real features to the shadows, and used statistical tests to decide which features are genuinely important.
The support_ attribute gives you a boolean mask of selected features. Grab those columns and you're done.
Understanding Boruta’s Decisions
Boruta doesn’t just give you a yes/no answer — it provides three categories:
Confirmed: Features that are definitely important Tentative: Features on the fence (more iterations might help) Rejected: Features that are basically noise
python
# See all three categories confirmed = X_df.columns[boruta_selector.support_].tolist() tentative = X_df.columns[boruta_selector.support_weak_].tolist() rejected = X_df.columns[~(boruta_selector.support_ | boruta_selector.support_weak_)].tolist()
Using 100 (default) means features must beat the maximum shadow feature importance. Setting it to 90 or 95 makes the test slightly more lenient. I stick with 100 — if your feature can’t beat random noise, it doesn’t deserve to stay.
two_step: Use two-step correction for multiple testing
# Run Boruta boruta_selector = BorutaPy( estimator=rf_reg, n_estimators='auto', verbose=2 )
boruta_selector.fit(X, y)
The algorithm is identical — Boruta just uses your estimator’s feature importances, whatever they are.
Visualizing Results: Making Sense of the Output
Numbers are great, but visualizations help stakeholders understand what’s happening. Here’s how I present Boruta results:
python
import matplotlib.pyplotas plt import seaborn as sns
# Create feature importance DataFrame results_df = pd.DataFrame({ 'feature': X_df.columns, 'rank': boruta_selector.ranking_, 'decision': ['Confirmed' if s else ('Tentative' if t else 'Rejected') for s, t in zip(boruta_selector.support_, boruta_selector.support_weak_)] })
# Sort by rank results_df = results_df.sort_values('rank')
# Get results selected = X.columns[boruta_selector.support_].tolist() print(f"\nImportant features for fraud detection:") print(selected)
Boruta identifies transaction_amount, distance_from_home, and transaction_hour as key fraud indicators while correctly ignoring the random noise features. This is exactly what you want — it separates genuine signals from irrelevant variables.
Comparing Boruta to Other Methods
Let’s be real: Boruta isn’t the only feature selection game in town. How does it stack up?
Boruta vs. Recursive Feature Elimination (RFE):
RFE removes features one at a time until you hit a target number. Problem? You have to specify that target. Boruta decides automatically based on statistics. Winner: Boruta for flexibility.
SelectKBest uses univariate statistics and misses feature interactions completely. A feature might be useless alone but powerful combined with others. Boruta catches this, SelectKBest doesn’t. Winner: Boruta for complex data.
Boruta vs. L1 Regularization (Lasso):
Lasso is fast and works well for linear relationships. But if your relationships are nonlinear (and let’s be honest, they usually are), Boruta wins. Plus Boruta gives you statistical confidence, not just coefficients. Winner: Depends on your data, but Boruta for nonlinear problems.
Boruta vs. Feature Importance from Trees:
Tree feature importances are great, but there’s no statistical testing — just relative rankings. Boruta adds the rigor of hypothesis testing. Winner: Boruta for interpretability and confidence :/
Practical Tips from Production Use
After running Boruta on dozens of real projects, here’s what I’ve learned:
Scale your features first: Boruta is somewhat robust to scaling, but I always standardize anyway. Can’t hurt, might help.
Save computation with early stopping: If you’re running Boruta on huge datasets, consider using fewer trees in your base estimator. You can always re-run with more trees on the selected features.
When Boruta Struggles (And What To Do)
Boruta isn’t magic. Here’s when it can mislead you:
Tiny datasets: With < 100 samples, the statistical tests lack power. You’ll get lots of tentative features or false rejections. Solution? Get more data, or use simpler selection methods.
Massive feature spaces: With 10,000+ features, Boruta can take forever. Solution? Pre-filter with a fast method (variance threshold, SelectKBest), then use Boruta on the top 100–200 features.
Extreme class imbalance: If your positive class is 0.1% of data, Boruta might struggle. Solution? Use SMOTE or other sampling techniques first, or tune your base estimator’s class weights carefully.
Time series data: Boruta doesn’t understand temporal dependencies. Using it naively on time series can leak future information. Solution? Create proper time-based features first (lags, rolling statistics), then apply Boruta.
Integrating Boruta Into Pipelines
Making Boruta part of your ML pipeline is straightforward with scikit-learn:
python
from sklearn.pipelineimportPipeline from sklearn.preprocessingimportStandardScaler from sklearn.ensembleimportRandomForestClassifier
# Fit the entire pipeline pipeline.fit(X_train, y_train)
# Predict y_pred = pipeline.predict(X_test)
Warning: Boruta in a pipeline can be slow during cross-validation since it reruns selection for each fold. For production, I usually run Boruta once, save the selected features, then build my final pipeline with just those features.
Performance Impact: Speed vs. Accuracy
Let’s address the elephant in the room: does feature selection actually help model performance?
In my experience:
Accuracy: Usually stays the same or improves slightly (removing noise helps)
Speed: Training and inference get way faster (fewer features = less computation)
Interpretability: Massively improved (explaining 20 features beats explaining 200)
Overfitting: Reduced (fewer features = simpler model)
I ran a benchmark on the fraud detection example:
Without Boruta (13 features): - Training time: 2.3s - Inference time: 45ms per1000 predictions - Test accuracy: 94.2%
With Boruta (7 features): - Training time: 1.1s - Inference time: 22ms per 1000 predictions - Test accuracy: 94.8%
You get better performance AND faster models. That’s a rare win-win in ML.
Wrapping Up
Boruta saved my fraud detection project, and it’s become my default feature selection method for any serious ML work. The statistical rigor gives you confidence, the automation saves time, and stakeholders actually understand the results when you show them “these 20 features matter, these 80 don’t.”
Is it perfect? No. It’s slow on massive datasets and can be finicky with tiny samples. But for the typical ML project with dozens to hundreds of features and thousands of samples? Boruta is hard to beat.
Next time you’re drowning in features and your manager asks “which ones actually matter?”, you’ll have an answer better than a shrug. Run Boruta, grab coffee, come back to results you can trust and explain. Your models (and your sanity) will thank you.
Loving the article? ☕ If you’d like to help me keep writing stories like this, consider supporting me on Buy Me a Coffee: https://buymeacoffee.com/samaustin. Even a small contribution means a lot!
Comments
Post a Comment