Feature-engine Library: Advanced Feature Engineering in Python

Let’s be honest — feature engineering is where most of your model’s performance actually comes from. You can throw the fanciest neural network at your data, but if your features are trash, your results will be trash. I’ve seen simple models with great features absolutely destroy complex models with raw data.

The problem? Traditional feature engineering is tedious, repetitive, and error-prone. You’re constantly writing the same preprocessing code, forgetting to apply transformations to your test set, and debugging why your pipeline breaks in production. Feature-engine fixes all of this, and it does it beautifully.

What Makes Feature-engine Special?

Feature-engine is a Python library specifically designed for feature engineering that integrates seamlessly with scikit-learn pipelines. Think of it as scikit-learn’s feature engineering toolkit on steroids.

Here’s what sets it apart from other libraries:

Scikit-learn compatible: Works perfectly with pipelines and cross-validation
Handles missing data intelligently: Multiple imputation strategies built-in
Prevents data leakage: Learns from training data only, applies to test data
Production-ready: No messy code when deploying models
Extensive documentation: Seriously, the docs are actually helpful

I discovered Feature-engine after spending three days debugging a data leakage issue in my pipeline. Turned out I was fitting my encoder on the entire dataset instead of just the training set. Feature-engine’s transformers make this mistake basically impossible.

Why Not Just Use Pandas?

Good question. You could do everything with Pandas, but you’ll end up with:

Code that’s hard to reproduce across train/test splits
Manual tracking of transformation parameters
No integration with scikit-learn workflows
Production nightmares when you need to deploy

Ever wondered why your carefully engineered features work in your notebook but break in production? Yeah, that’s the Pandas-only approach biting you.

Installation and Setup

Getting started is straightforward:

bash

pip install feature-engine

You’ll want to import it alongside your usual suspects:

python

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
import feature_engine as fe
from feature_engine.encoding import OneHotEncoder, OrdinalEncoder
from feature_engine.imputation import MeanMedianImputer, CategoricalImputer

One thing I love? The import structure is intuitive. Encoders live in encoding, imputers in imputation, and transformers in transformation. No hunting through documentation to figure out where something lives.

Handling Missing Data Like a Pro

Missing data is the bane of every data scientist’s existence. Feature-engine offers way more flexibility than scikit-learn’s basic imputers.

Numerical Missing Values

python

from feature_engine.imputation import (
    MeanMedianImputer,
    ArbitraryNumberImputer,
    EndTailImputer
)

# Simple mean imputation
mean_imputer = MeanMedianImputer(
    imputation_method='mean',
    variables=['age', 'income', 'credit_score']
)

# Or use median (better for skewed data)
median_imputer = MeanMedianImputer(
    imputation_method='median',
    variables=['price', 'transaction_amount']
)

# Fit on training data
mean_imputer.fit(X_train)

# Transform both sets
X_train_imputed = mean_imputer.transform(X_train)
X_test_imputed = mean_imputer.transform(X_test)

Notice how we fit only on training data? This prevents data leakage. The test set gets imputed using the mean calculated from training data only. This seems basic, but you’d be shocked how many production models leak information here.

**Upgrade your low res photo to HD :** **Click Here**

Arbitrary Value Imputation

Sometimes you want to impute with a specific value that indicates “missing”:

python

# Impute with -999 to flag missing values
arbitrary_imputer = ArbitraryNumberImputer(
    arbitrary_number=-999,
    variables=['days_since_last_purchase']
)

arbitrary_imputer.fit(X_train)
X_train = arbitrary_imputer.transform(X_train)

This is super useful when missingness itself is informative. If “days since last purchase” is missing, it might mean the customer never purchased anything — that’s valuable information!

Categorical Missing Values

python

from feature_engine.imputation import CategoricalImputer

# Impute with most frequent category
cat_imputer = CategoricalImputer(
    imputation_method='frequent',
    variables=['city', 'product_category']
)

# Or use a custom string
cat_imputer = CategoricalImputer(
    imputation_method='missing',
    fill_value='Unknown',
    variables=['city', 'product_category']
)

FYI, I almost always use ‘missing’ or ‘Unknown’ for categorical imputation rather than mode. Why? Because the mode might change between train and test sets, causing weird inconsistencies.

Encoding Categorical Variables

This is where Feature-engine really shines. The encoding options go way beyond sklearn’s basic encoders.

One-Hot Encoding Done Right

python

from feature_engine.encoding import OneHotEncoder

# Standard one-hot encoding
ohe = OneHotEncoder(
    variables=['city', 'product_type'],
    drop_last=True  # Avoid multicollinearity
)

ohe.fit(X_train)
X_train_encoded = ohe.transform(X_train)

But here’s where it gets interesting. What happens when your test set has categories that weren’t in training? Feature-engine handles this gracefully:

python

ohe = OneHotEncoder(
    variables=['city'],
    drop_last=True,
    drop_last_binary=True  # Handle binary variables smartly
)

No more “ValueError: Found unknown categories” crashes in production. The encoder simply ignores categories it hasn’t seen before.

Ordinal Encoding with Custom Order

python

from feature_engine.encoding import OrdinalEncoder

# Define custom ordinal relationships
education_order = {
    'High School': 1,
    'Bachelor': 2,
    'Master': 3,
    'PhD': 4
}

ordinal_enc = OrdinalEncoder(
    encoding_method='ordered',
    variables=['education_level'],
    mapping={'education_level': education_order}
)

ordinal_enc.fit(X_train)
X_train_encoded = ordinal_enc.transform(X_train)

This is way cleaner than manually mapping values with Pandas. Plus, it plays nice with pipelines and prevents you from accidentally forgetting to apply the same mapping to your test set.

Target-Based Encoding

This one’s powerful but dangerous if you’re not careful:

python

from feature_engine.encoding import MeanEncoder

# Encode based on target mean
mean_encoder = MeanEncoder(
    variables=['customer_id', 'product_id']
)

# MUST fit on training data only!
mean_encoder.fit(X_train, y_train)
X_train_encoded = mean_encoder.transform(X_train)
X_test_encoded = mean_encoder.transform(X_test)

Mean encoding replaces categories with the average target value for that category. It’s incredibly effective for high-cardinality features but can cause massive overfitting if not used carefully. Always use cross-validation to validate this approach.

Feature Transformation Techniques

Log Transformations

python

from feature_engine.transformation import LogTransformer

# Apply log transformation to skewed variables
log_transformer = LogTransformer(
    variables=['income', 'transaction_amount', 'property_value']
)

log_transformer.fit(X_train)
X_train_log = log_transformer.transform(X_train)

Log transformations are essential for right-skewed distributions. They help normalize your data and can dramatically improve linear model performance. I use these on pretty much every financial or count variable.

Power Transformations

python

from feature_engine.transformation import PowerTransformer

# Box-Cox transformation (requires positive values)
power_transformer = PowerTransformer(
    variables=['sales', 'visits'],
    exp='optimal'  # Automatically finds best lambda
)

power_transformer.fit(X_train)
X_train_transformed = power_transformer.transform(X_train)

Power transformations are like log transformations but more flexible. They can handle different types of skewness and automatically find the optimal transformation parameter.

Reciprocal Transformation

python

from feature_engine.transformation import ReciprocalTransformer

# Take the reciprocal (1/x)
reciprocal_transformer = ReciprocalTransformer(
    variables=['time_to_event', 'distance']
)

This is underrated but incredibly useful when you want to convert “time until event” into “rate” or “distance” into “proximity”. Changes the interpretation completely.

Outlier Handling

Outliers can wreck your models. Feature-engine gives you sophisticated options beyond simple capping.

Winsorization

python

from feature_engine.outliers import Winsorizer

# Cap outliers at 5th and 95th percentiles
winsorizer = Winsorizer(
    capping_method='quantiles',
    tail='both',
    fold=0.05,
    variables=['income', 'age', 'spending']
)

winsorizer.fit(X_train)
X_train_capped = winsorizer.transform(X_train)

Winsorization is my go-to outlier handling technique. Instead of removing outliers, you cap them at reasonable thresholds. This preserves your sample size while reducing the impact of extreme values.

IQR-Based Capping

python

# Cap using interquartile range
iqr_capper = Winsorizer(
    capping_method='iqr',
    tail='both',
    fold=1.5,  # Standard IQR multiplier
    variables=['transaction_value']
)

The IQR method is more robust than using fixed percentiles, especially when your data distribution changes over time.

Outlier Trimming

python

from feature_engine.outliers import OutlierTrimmer

# Remove outliers entirely
trimmer = OutlierTrimmer(
    capping_method='iqr',
    tail='both',
    fold=3.0,  # Only trim extreme outliers
    variables=['measurement_error']
)

trimmer.fit(X_train)
X_train_trimmed = trimmer.transform(X_train)

I rarely use trimming because it changes your sample size, but it’s useful when outliers are clearly data errors rather than legitimate extreme values :/

Feature Creation

Creating new features is where the magic happens. Feature-engine makes this systematic and reproducible.

Mathematical Combinations

python

from feature_engine.creation import MathFeatures

# Create new features through mathematical operations
math_features = MathFeatures(
    variables=['income', 'age'],
    func=['sum', 'prod', 'mean', 'std'],
    new_variables_names=['income_age_sum', 'income_age_prod', 
                         'income_age_mean', 'income_age_std']
)

math_features.fit(X_train)
X_train_features = math_features.transform(X_train)

These combinations often capture interaction effects that individual features miss. Income × Age might be way more predictive than either alone.

Cyclical Features

python

from feature_engine.creation import CyclicalFeatures

# Transform cyclical variables (months, hours, days)
cyclical = CyclicalFeatures(
    variables=['month', 'hour_of_day', 'day_of_week']
)

cyclical.fit(X_train)
X_train_cyclical = cyclical.transform(X_train)

This creates sine and cosine transformations so your model understands that December (12) is close to January (1). Crucial for time-based features.

Relative Features

python

from feature_engine.creation import RelativeFeatures

# Create ratios and differences
relative = RelativeFeatures(
    variables=['revenue', 'cost'],
    reference=['cost'],
    func=['sub', 'div']
)

# Creates: revenue - cost (profit) and revenue / cost (margin)
relative.fit(X_train)
X_train_relative = relative.transform(X_train)

Financial ratios, performance metrics, efficiency measures — these are gold for predictive models. Revenue and cost individually might be noisy, but profit margin is often highly predictive.

Discretization and Binning

Sometimes continuous variables work better as categories.

Equal-Width Binning

python

from feature_engine.discretisation import EqualWidthDiscretiser

# Create equal-width bins
ew_disc = EqualWidthDiscretiser(
    bins=5,
    variables=['age', 'income']
)

ew_disc.fit(X_train)
X_train_binned = ew_disc.transform(X_train)

Equal-Frequency Binning

python

from feature_engine.discretisation import EqualFrequencyDiscretiser

# Create bins with equal number of observations
ef_disc = EqualFrequencyDiscretiser(
    q=5,  # Number of quantiles
    variables=['transaction_value']
)

Equal-frequency is usually better than equal-width because it handles skewed distributions more gracefully. Each bin has roughly the same number of samples.

Decision Tree Binning

python

from feature_engine.discretisation import DecisionTreeDiscretiser

# Let a decision tree find optimal splits
dt_disc = DecisionTreeDiscretiser(
    cv=5,
    scoring='roc_auc',
    variables=['credit_score', 'debt_ratio']
)

# Requires target variable
dt_disc.fit(X_train, y_train)
X_train_binned = dt_disc.transform(X_train)

This is incredibly powerful. The binning is supervised — it finds splits that maximize predictive power. I’ve seen this single technique boost model performance by 3–5% on its own.

Building Production Pipelines

Here’s where everything comes together. Feature-engine was built for pipelines.

Complete Feature Engineering Pipeline

python

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

# Build a comprehensive pipeline
feature_pipeline = Pipeline([
    # 1. Handle missing values
    ('impute_numerical', MeanMedianImputer(
        imputation_method='median',
        variables=['age', 'income']
    )),
    
    ('impute_categorical', CategoricalImputer(
        imputation_method='missing',
        variables=['city', 'occupation']
    )),
    
    # 2. Handle outliers
    ('cap_outliers', Winsorizer(
        capping_method='iqr',
        tail='both',
        fold=1.5,
        variables=['income', 'spending']
    )),
    
    # 3. Transform skewed features
    ('log_transform', LogTransformer(
        variables=['income', 'property_value']
    )),
    
    # 4. Create new features
    ('create_features', MathFeatures(
        variables=['income', 'age'],
        func=['prod', 'mean']
    )),
    
    # 5. Encode categoricals
    ('encode_categorical', OneHotEncoder(
        variables=['city', 'occupation'],
        drop_last=True
    )),
    
    # 6. Train model
    ('classifier', RandomForestClassifier(n_estimators=100))
])

# Fit everything at once
feature_pipeline.fit(X_train, y_train)

# Predict on new data
predictions = feature_pipeline.predict(X_test)

This is beautiful. Every transformation is reproducible, properly fitted on training data only, and applies consistently to any new data. No more scattered preprocessing code across multiple notebooks.

Cross-Validation with Feature Engineering

python

from sklearn.model_selection import cross_val_score

# Cross-validate the entire pipeline
scores = cross_val_score(
    feature_pipeline,
    X_train,
    y_train,
    cv=5,
    scoring='roc_auc'
)

print(f"Mean AUC: {scores.mean():.3f} (+/- {scores.std():.3f})")

The pipeline ensures no data leakage during cross-validation. Each fold fits transformers on its own training data and applies to validation data independently.

Advanced Tricks and Best Practices

Combining Multiple Encoders

You don’t have to use the same encoding for all categorical variables:

python

# Low cardinality: one-hot encode
ohe = OneHotEncoder(variables=['city', 'product_type'])

# High cardinality: target encode
mean_enc = MeanEncoder(variables=['customer_id', 'merchant_id'])

# Ordinal relationships: ordinal encode
ord_enc = OrdinalEncoder(
    encoding_method='ordered',
    variables=['satisfaction_level']
)

# Chain them in a pipeline
encoder_pipeline = Pipeline([
    ('one_hot', ohe),
    ('target', mean_enc),
    ('ordinal', ord_enc)
])

This mixed approach is often the best strategy. Different variable types need different encodings.

Handling Rare Categories

python

from feature_engine.encoding import RareLabelEncoder

# Group rare categories together
rare_encoder = RareLabelEncoder(
    tol=0.05,  # Categories with <5% frequency
    n_categories=10,  # Keep top 10 categories
    variables=['city', 'product_category']
)

rare_encoder.fit(X_train)
X_train_encoded = rare_encoder.transform(X_train)

Rare categories are notorious for causing overfitting and encoding issues. This groups them into an “Other” category automatically.

Custom Transformers

Need something specific? Create your own transformer:

python

from feature_engine.base_transformers import BaseNumericalTransformer

class CustomTransformer(BaseNumericalTransformer):
    def fit(self, X, y=None):
        # Learn parameters from training data
        self.custom_param_ = X[self.variables].mean()
        return self
    
    def transform(self, X):
        X = X.copy()
        X[self.variables] = X[self.variables] / self.custom_param_
        return X

It integrates seamlessly with pipelines and follows scikit-learn conventions.

Common Mistakes to Avoid

1. Fitting on the Entire Dataset

Don’t do this:

python

# WRONG - fits on all data including test
encoder.fit(pd.concat([X_train, X_test]))

Always fit only on training data:

python

# CORRECT
encoder.fit(X_train)
X_train = encoder.transform(X_train)
X_test = encoder.transform(X_test)

2. Not Specifying Variables

python

# Be explicit about which variables to transform
transformer = LogTransformer(
    variables=['income', 'price']  # Specify!
)

If you don’t specify variables, some transformers apply to all numerical columns, which might not be what you want.

3. Over-Engineering Features

More features ≠ better model. I’ve seen people create 200 features from 10 original variables and wonder why their model overfits horribly.

Start simple. Add complexity only when it demonstrably improves cross-validation scores.

Performance Comparison: Before and After

Let me share real numbers from a recent project. Starting with basic preprocessing:

python

# Basic approach: ~0.78 AUC
basic_pipeline = Pipeline([
    ('impute', SimpleImputer()),
    ('scale', StandardScaler()),
    ('model', LogisticRegression())
])

After applying Feature-engine techniques:

python

# Feature-engine approach: ~0.84 AUC
advanced_pipeline = Pipeline([
    ('impute_num', MeanMedianImputer(imputation_method='median')),
    ('impute_cat', CategoricalImputer(imputation_method='missing')),
    ('rare_labels', RareLabelEncoder(tol=0.05)),
    ('outliers', Winsorizer(capping_method='iqr')),
    ('log_transform', LogTransformer()),
    ('encode', OneHotEncoder(drop_last=True)),
    ('create_features', MathFeatures(func=['prod', 'div'])),
    ('model', LogisticRegression())
])

That’s a 6-point AUC improvement from better feature engineering alone. Same model, same data, just smarter preprocessing.

Final Thoughts

Feature-engine has become an essential part of my ML toolkit. It takes the tedious, error-prone parts of feature engineering and makes them systematic, reproducible, and actually enjoyable (okay, maybe enjoyable is a stretch, but definitely less painful).

The real value isn’t in any single technique — it’s in how everything works together seamlessly. You build pipelines once, they work consistently in production, and you’re not debugging preprocessing issues at 3 AM.

Start with the basics: imputation, encoding, and outlier handling. Get comfortable with pipelines. Then gradually add more sophisticated techniques like target encoding, feature creation, and decision tree discretization.

Your models will thank you. Your production engineers will thank you. And honestly? Future you, looking at clean pipeline code instead of scattered Pandas operations, will thank you most of all. Trust me on this one.

Buy me a coffee from the link given below👇👇👇

Latest Post

Reinforcement Learning for Credit Scoring: Applications in Fintech