Scikit-optimize (skopt): Bayesian Optimization for Hyperparameters

You’ve been running grid search on your model for six hours. You’re testing every combination of learning rates, regularization values, and layer sizes. The search space has 1,000 possible combinations, and you’re maybe 30% through. Your laptop sounds like a jet engine. You’re burning electricity and time testing parameters that are obviously terrible, but grid search doesn’t know any better.

I spent my first year of machine learning doing exactly this — exhaustively testing parameter combinations like some kind of brute-force cave person. Then I discovered Bayesian optimization, and suddenly I was getting better results in 1/10th the time. Scikit-optimize (skopt) made this accessible without needing a PhD in Gaussian processes. Turns out, smart search beats exhaustive search every single time.

Let me show you how to stop wasting compute on bad hyperparameters and start finding optimal settings efficiently.

What Is Bayesian Optimization and Why It’s Better

Before we get into skopt specifically, understand why Bayesian optimization destroys grid search and random search:

Grid Search:

Tests every possible combination
Wastes time on obviously bad regions
Exponentially grows with parameters
No learning from previous trials

Random Search:

Tests random combinations
Better than grid search (surprisingly)
Still wastes evaluations on bad areas
No intelligence about what to try next

Bayesian Optimization:

Builds a probabilistic model of your objective function
Learns from each evaluation
Focuses search on promising regions
Balances exploration (trying new areas) vs. exploitation (refining good areas)

Ever wonder how research labs tune models so efficiently? They’re using smart optimization, not brute force.

What Is Scikit-optimize (skopt)?

Scikit-optimize is a Python library that implements Bayesian optimization using Gaussian processes and other surrogate models. It’s designed to work seamlessly with scikit-learn but works with any Python function.

What skopt provides:

Bayesian optimization algorithms
Integration with scikit-learn’s API
Visualization tools
Checkpoint and resume functionality
Multiple acquisition functions
Support for different search spaces

Think of it as “grid search, but smart.” Same easy API, dramatically better results.

Installation and Setup

Getting started is straightforward:

bash

pip install scikit-optimize

Import the essentials:

python

from skopt import gp_minimize, forest_minimize, gbrt_minimize
from skopt.space import Real, Integer, Categorical
from skopt.utils import use_named_args
from skopt import BayesSearchCV
import numpy as np

That’s all you need. Now let’s make it actually do something useful.

Your First Bayesian Optimization (Simple Example)

Let’s start with a basic optimization problem to understand the mechanics:

python

from skopt import gp_minimize
from skopt.space import Real

# Define a function to minimize (e.g., our model's validation error)
def objective_function(params):
    x, y = params
    # Some arbitrary function (imagine this is your model's validation error)
    return (x - 2)**2 + (y + 3)**2 + 5

# Define search space
search_space = [
    Real(-5.0, 5.0, name='x'),
    Real(-5.0, 5.0, name='y')
]

# Run optimization
result = gp_minimize(
    objective_function,
    search_space,
    n_calls=20,  # Number of evaluations
    random_state=42
)

print(f"Best parameters: {result.x}")
print(f"Best score: {result.fun}")

This runs 20 evaluations and finds the optimal parameters. Compare that to grid search which would need hundreds of evaluations for the same search space resolution.

Understanding the Result Object

python

# Best parameters found
print(result.x)  # [2.0, -3.0] (optimal values)

# Best score achieved
print(result.fun)  # 5.0 (minimum value)

# All evaluated parameters
print(result.x_iters)  # List of all tested combinations

# All scores
print(result.func_vals)  # Corresponding scores

# Optimization space
print(result.space)  # The search space used

The result object contains everything about the optimization run. Useful for analysis and debugging.

Defining Search Spaces (Getting It Right)

The search space definition is critical:

Real-Valued Parameters

python

from skopt.space import Real

# Linear scale (default)
learning_rate = Real(1e-6, 1e-1, name='learning_rate')

# Log scale (better for learning rates)
learning_rate_log = Real(1e-6, 1e-1, prior='log-uniform', name='learning_rate')

# Uniform distribution
dropout = Real(0.0, 0.5, name='dropout')

Use log-uniform for parameters that span multiple orders of magnitude (like learning rates). Use uniform for parameters in a smaller range.

Integer Parameters

python

from skopt.space import Integer

# Number of layers
n_layers = Integer(1, 10, name='n_layers')

# Hidden units
hidden_units = Integer(32, 512, name='hidden_units')

# Batch size (often works better as powers of 2)
batch_size = Integer(16, 256, name='batch_size')

Categorical Parameters

python

from skopt.space import Categorical

# Optimizer choice
optimizer = Categorical(['adam', 'sgd', 'rmsprop'], name='optimizer')

# Activation function
activation = Categorical(['relu', 'tanh', 'sigmoid'], name='activation')

# Model architecture
model_type = Categorical(['resnet', 'vgg', 'efficientnet'], name='model')

Categorical parameters let you search over discrete choices.

👉👉How To Overcome Uncertainty with Bayesian Probability in Python : Click Here

Combining Different Types

python

search_space = [
    Real(1e-6, 1e-1, prior='log-uniform', name='learning_rate'),
    Integer(32, 512, name='hidden_units'),
    Integer(1, 5, name='n_layers'),
    Real(0.0, 0.5, name='dropout'),
    Categorical(['adam', 'sgd', 'rmsprop'], name='optimizer')
]

Mix and match parameter types to define complex search spaces.

Optimizing Scikit-Learn Models (The Easy Way)

Skopt integrates directly with scikit-learn through BayesSearchCV:

Basic Usage

python

from skopt import BayesSearchCV
from skopt.space import Real, Integer
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

# Load data
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Define search space
search_space = {
    'n_estimators': Integer(10, 200),
    'max_depth': Integer(3, 20),
    'min_samples_split': Integer(2, 20),
    'min_samples_leaf': Integer(1, 10),
    'max_features': Real(0.1, 1.0)
}

# Create optimizer
opt = BayesSearchCV(
    RandomForestClassifier(),
    search_space,
    n_iter=50,  # Number of parameter settings sampled
    cv=3,
    n_jobs=-1,
    verbose=1
)

# Run optimization
opt.fit(X_train, y_train)

# Best parameters
print(f"Best parameters: {opt.best_params_}")
print(f"Best score: {opt.best_score_}")

# Use best model
best_model = opt.best_estimator_
test_score = best_model.score(X_test, y_test)
print(f"Test score: {test_score}")

This is almost identical to scikit-learn’s GridSearchCV API, but uses Bayesian optimization under the hood. IMO, you should replace every GridSearchCV in your code with BayesSearchCV.

Advanced BayesSearchCV Options

python

opt = BayesSearchCV(
    estimator=RandomForestClassifier(),
    search_spaces=search_space,
    n_iter=50,
    cv=5,
    n_jobs=-1,
    scoring='accuracy',  # or 'f1', 'roc_auc', custom scorer
    verbose=2,
    random_state=42,
    return_train_score=True,
    refit=True,  # Refit on entire dataset with best params
)

Optimizing Custom Functions (Maximum Flexibility)

For custom models or non-scikit-learn code:

Neural Network Example

python

from skopt import gp_minimize
from skopt.space import Real, Integer, Categorical
from skopt.utils import use_named_args
import tensorflow as tf

# Define search space
search_space = [
    Integer(32, 256, name='units_1'),
    Integer(32, 256, name='units_2'),
    Real(1e-5, 1e-2, prior='log-uniform', name='learning_rate'),
    Real(0.0, 0.5, name='dropout'),
    Categorical(['relu', 'tanh'], name='activation')
]

# Define objective function
@use_named_args(search_space)
def objective(**params):
    # Build model with params
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(params['units_1'], activation=params['activation']),
        tf.keras.layers.Dropout(params['dropout']),
        tf.keras.layers.Dense(params['units_2'], activation=params['activation']),
        tf.keras.layers.Dropout(params['dropout']),
        tf.keras.layers.Dense(10, activation='softmax')
    ])
    
    model.compile(
        optimizer=tf.keras.optimizers.Adam(params['learning_rate']),
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )
    
    # Train
    history = model.fit(
        X_train, y_train,
        validation_split=0.2,
        epochs=10,
        batch_size=32,
        verbose=0
    )
    
    # Return validation loss (what we want to minimize)
    return min(history.history['val_loss'])

# Run optimization
result = gp_minimize(
    objective,
    search_space,
    n_calls=30,
    random_state=42,
    verbose=True
)

print(f"Best validation loss: {result.fun}")
print(f"Best parameters: {result.x}")

The @use_named_args decorator converts the list of parameters into keyword arguments, making the code cleaner.

Different Optimization Strategies

Skopt provides multiple optimization algorithms:

Gaussian Process (GP) Optimization

python

from skopt import gp_minimize

# Best for: Smooth, expensive functions
# Pros: Most sample-efficient, models uncertainty well
# Cons: Slower for many evaluations, doesn't scale well beyond ~20D
result = gp_minimize(objective, search_space, n_calls=50)

This is the “classic” Bayesian optimization. Best for expensive evaluations where you want maximum efficiency.

Random Forest (Forest) Optimization

python

from skopt import forest_minimize

# Best for: Larger search spaces, faster evaluations
# Pros: Scales better, handles high dimensions
# Cons: Less sample-efficient than GP
result = forest_minimize(objective, search_space, n_calls=100)

Random forest surrogates handle higher-dimensional spaces better than GP.

Gradient Boosting (GBRT) Optimization

python

from skopt import gbrt_minimize

# Best for: When you want something between GP and random forest
# Pros: Good balance of efficiency and scalability
# Cons: Not as well-studied as GP or random forest
result = gbrt_minimize(objective, search_space, n_calls=75)

Gradient boosting machines as surrogates. Often works well in practice.

Which One to Use?

Use GP (gp_minimize) when:

Evaluations are expensive (minutes or hours per trial)
Search space is relatively low-dimensional (<15D)
You want maximum sample efficiency

Use Random Forest (forest_minimize) when:

Search space is high-dimensional (>15D)
Evaluations are relatively fast
GP is taking too long to fit

Use GBRT (gbrt_minimize) when:

You want to try something different
Forest and GP aren’t working well

Honestly, start with GP and switch if it’s too slow. FYI, I use GP for about 80% of my projects.

Acquisition Functions (Balancing Exploration and Exploitation)

Acquisition functions determine what point to evaluate next:

python

result = gp_minimize(
    objective,
    search_space,
    n_calls=50,
    acq_func='EI'  # Expected Improvement
)

Available acquisition functions:

‘EI’ (Expected Improvement) — Default, good balance ‘PI’ (Probability of Improvement) — More exploitative ‘LCB’ (Lower Confidence Bound) — More exploratory ‘gp_hedge’ — Learns which works best during optimization

For most cases, stick with ‘EI’ (the default). If you’re not finding good results, try ‘LCB’ for more exploration or ‘PI’ for more exploitation.

Checkpointing and Resuming (Save Your Progress)

Long optimizations should be checkpointed:

python

from skopt.callbacks import CheckpointSaver

# Create checkpoint callback
checkpoint_saver = CheckpointSaver("./checkpoint.pkl", compress=9)

# Run optimization with checkpointing
result = gp_minimize(
    objective,
    search_space,
    n_calls=100,
    callback=[checkpoint_saver]
)

# Resume from checkpoint later
from skopt import load

previous_result = load('./checkpoint.pkl')

# Continue optimization
result = gp_minimize(
    objective,
    search_space,
    n_calls=50,  # Additional calls
    x0=previous_result.x_iters,  # Previous evaluations
    y0=previous_result.func_vals,
    callback=[checkpoint_saver]
)

Essential for long-running optimizations that might crash or get interrupted.

Visualization (Understanding Your Optimization)

Skopt provides excellent visualization tools:

Convergence Plot

python

from skopt.plots import plot_convergence
import matplotlib.pyplot as plt

plot_convergence(result)
plt.show()

Shows how the best found value improves over iterations. Helps you decide if you need more evaluations.

Objective Function Plot

python

from skopt.plots import plot_objective

# Visualize objective function in parameter space
plot_objective(result)
plt.tight_layout()
plt.show()

Shows how the objective varies with each parameter. Helps understand parameter importance and interactions.

Evaluations Plot

python

from skopt.plots import plot_evaluations

# Shows all evaluated points
plot_evaluations(result)
plt.tight_layout()
plt.show()

Visualizes the search trajectory through parameter space.

Parallel Optimization (Faster Searches)

Evaluate multiple points simultaneously:

python

from skopt import Optimizer
from joblib import Parallel, delayed

# Create optimizer
opt = Optimizer(search_space, base_estimator='GP', n_initial_points=10)

# Function to evaluate
def evaluate_point(point):
    return objective(point)

# Parallel optimization loop
for i in range(10):  # 10 iterations
    # Ask for multiple points to evaluate
    points = opt.ask(n_points=4)  # Get 4 points
    
    # Evaluate in parallel
    scores = Parallel(n_jobs=4)(delayed(evaluate_point)(point) for point in points)
    
    # Tell optimizer the results
    opt.tell(points, scores)

# Get best result
best_point = opt.Xi[np.argmin(opt.yi)]
best_score = min(opt.yi)

print(f"Best parameters: {best_point}")
print(f"Best score: {best_score}")

This evaluates multiple parameter settings simultaneously, speeding up optimization when you have multiple CPUs/GPUs.

Real-World Example: Optimizing XGBoost

Let’s optimize an XGBoost model with Bayesian optimization:

python

from skopt import BayesSearchCV
from skopt.space import Real, Integer
import xgboost as xgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

# Load data
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Define comprehensive search space
search_space = {
    'learning_rate': Real(0.01, 0.3, prior='log-uniform'),
    'max_depth': Integer(3, 10),
    'min_child_weight': Integer(1, 10),
    'subsample': Real(0.5, 1.0),
    'colsample_bytree': Real(0.5, 1.0),
    'gamma': Real(0, 5),
    'reg_alpha': Real(1e-5, 1, prior='log-uniform'),
    'reg_lambda': Real(1e-5, 1, prior='log-uniform'),
    'n_estimators': Integer(50, 300)
}

# Create optimizer
opt = BayesSearchCV(
    xgb.XGBClassifier(eval_metric='logloss', use_label_encoder=False),
    search_space,
    n_iter=50,
    cv=5,
    n_jobs=-1,
    scoring='roc_auc',
    verbose=1,
    random_state=42
)

# Run optimization
opt.fit(X_train, y_train)

# Results
print(f"Best parameters: {opt.best_params_}")
print(f"Best cross-validation score: {opt.best_score_:.4f}")

# Test performance
test_score = opt.score(X_test, y_test)
print(f"Test score: {test_score:.4f}")

# Visualize
from skopt.plots import plot_convergence
plot_convergence(opt.optimizer_results_[0])
plt.show()

This searches a 9-dimensional space efficiently, finding good parameters in 50 evaluations instead of the thousands grid search would need.

Common Mistakes and How to Avoid Them

Learn from these optimization failures:

Mistake 1: Wrong Search Space Bounds

python

# Bad - search space too narrow
learning_rate = Real(0.01, 0.03)

# Good - give the optimizer room to search
learning_rate = Real(1e-5, 1e-1, prior='log-uniform')

If your optimal value is at the boundary of your search space, you defined the space wrong. Expand it.

Mistake 2: Not Using Log-Uniform for Wide Ranges

python

# Bad - linear scale for multiple orders of magnitude
learning_rate = Real(0.0001, 0.1)  # Biased toward larger values

# Good - log scale
learning_rate = Real(0.0001, 0.1, prior='log-uniform')  # Samples evenly in log space

Use log-uniform for parameters spanning multiple orders of magnitude. This is especially important for learning rates and regularization.

Mistake 3: Too Few Evaluations

python

# Bad - not enough evaluations for Bayesian optimization to help
result = gp_minimize(objective, search_space, n_calls=5)

# Good - give it enough calls to learn
result = gp_minimize(objective, search_space, n_calls=50)

Bayesian optimization needs ~10–20 evaluations to build a useful model. With only 5 evaluations, you might as well use random search.

Mistake 4: Not Validating Properly

python

# Bad - optimizing on training data
def bad_objective(params):
    model.fit(X_train, y_train)
    return -model.score(X_train, y_train)  # Training score!

# Good - using validation set
def good_objective(params):
    model.fit(X_train, y_train)
    return -model.score(X_val, y_val)  # Validation score

Always optimize based on validation performance, not training performance. This is basic ML hygiene but people forget it constantly. :/

Mistake 5: Not Saving Progress

python

# Bad - hours of optimization lost if it crashes
result = gp_minimize(objective, search_space, n_calls=1000)

# Good - checkpoint regularly
from skopt.callbacks import CheckpointSaver
result = gp_minimize(
    objective, search_space, n_calls=1000,
    callback=[CheckpointSaver("./checkpoint.pkl")]
)

Long optimizations will crash. Murphy’s Law guarantees it. Checkpoint your progress.

The Bottom Line for ML Practitioners

Grid search is brute force. Random search is slightly smarter brute force. Bayesian optimization is actual intelligence applied to hyperparameter search. Scikit-optimize makes this accessible without requiring deep knowledge of Gaussian processes or acquisition functions.

Use skopt when:

Hyperparameter tuning takes significant time
Search space is large or high-dimensional
You want better results with fewer evaluations
You’re tired of wasting compute on obviously bad parameters

Stick with grid/random search when:

Evaluations are nearly instant (milliseconds)
Search space is tiny (2–3 parameters, few values each)
You’re doing quick experiments

For most real ML projects, Bayesian optimization is the right choice. It’s more efficient, finds better parameters, and your compute budget will thank you.

Installation is simple:

bash

pip install scikit-optimize

Replace your next GridSearchCV with BayesSearchCV. Compare the results. You’ll probably never go back to exhaustive search. Stop testing obviously bad hyperparameters and start finding optimal settings efficiently. Your models — and your electricity bill — will thank you. :)

Latest Post

Reinforcement Learning for Credit Scoring: Applications in Fintech