Sacred Experiment Tracking: Organize ML Research and Reproducibility

I’ll never forget the PhD student who came to me in tears. She’d just gotten amazing results on her model, showed them to her advisor, and then… couldn’t reproduce them. Different random seed? Wrong hyperparameter? Old code version? Who knows. The experiment was lost forever, and with it, six weeks of GPU time. That’s when I introduced her to Sacred, and it literally saved her dissertation.

Sacred is this beautifully designed Python library that tracks every single detail of your ML experiments automatically. It’s like having a paranoid lab assistant who writes down everything — hyperparameters, code versions, dependencies, outputs, you name it. And the best part? It stays out of your way while doing it.

Let me show you why Sacred is the experiment tracking tool you didn’t know you desperately needed.

Why Experiment Tracking Matters (More Than You Think)

Here’s a dirty secret about ML research: most experiments are never truly reproducible. You run something, get great results, write them down in a Jupyter notebook named “final_FINAL_v3.ipynb”, and move on. Three months later when the reviewer asks you to re-run with a different metric? Good luck.

Sacred solves this by automatically capturing:

Every hyperparameter you use
The exact code that ran (including git commit)
All dependencies and their versions
Random seeds and their states
Console output and file artifacts
System information (GPU, CPU, OS)

You don’t have to remember to log things. Sacred just… does it. It’s honestly kind of magical.

Getting Started: Installation and Your First Experiment

Installation is refreshingly simple:

pip install sacred

For the full experience with database storage and web UI, grab MongoDB observers:

pip install sacred pymongo

Now let’s create your first Sacred experiment. I’ll start with something familiar — a simple neural network:

python

from sacred import Experiment
from sacred.observers import MongoObserver
import torch
import torch.nn as nn

# Create the experiment
ex = Experiment('mnist_classifier')

# Add MongoDB observer (optional but recommended)
ex.observers.append(MongoObserver(url='localhost:27017', db_name='sacred'))

@ex.config
def my_config():
    learning_rate = 0.001
    batch_size = 64
    epochs = 10
    hidden_size = 128

@ex.automain
def train(_run, learning_rate, batch_size, epochs, hidden_size):
    # Your training code here
    model = nn.Sequential(
        nn.Linear(784, hidden_size),
        nn.ReLU(),
        nn.Linear(hidden_size, 10)
    )
    
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
    
    for epoch in range(epochs):
        # Training loop
        loss = train_epoch(model, optimizer, batch_size)
        
        # Log metrics to Sacred
        _run.log_scalar("training.loss", loss, epoch)
    
    return {"final_loss": loss}

Run it like a normal Python script:

python train.py

That’s it. Sacred now knows everything about this run. The _run parameter Sacred injects gives you access to logging functions, and the @ex.config decorator defines your hyperparameters.

Configuration Magic: Sacred’s Secret Sauce

Sacred’s configuration system is where it really shines. You can define configs in multiple ways:

Function-based configs (what we just saw):

python

@ex.config
def cfg():
    learning_rate = 0.001
    batch_size = 64
    optimizer = 'adam'

Dictionary configs:

python

ex.add_config({
    'learning_rate': 0.001,
    'batch_size': 64,
    'optimizer': 'adam'
})

Config files (JSON or YAML):

python

ex.add_config('config.yaml')

Here’s where it gets cool. You can override any config value from the command line:

bash

python train.py with learning_rate=0.01 batch_size=128

Ever wondered why managing hyperparameters in ML projects feels like herding cats? Sacred makes it systematic.

Named configurations let you define presets:

python

@ex.named_config
def small_model():
    hidden_size = 64
    batch_size = 32
    
@ex.named_config  
def large_model():
    hidden_size = 512
    batch_size = 128
    dropout = 0.5

Run them with:

bash

python train.py with small_model
python train.py with large_model learning_rate=0.01

This is perfect for comparing architectures or doing ablation studies. I use named configs for every paper experiment now :)

**Sharpener: Make blurry, pixelated photos sharp online :** **Click Here**

Capturing Everything: What Sacred Tracks

Sacred’s automatic tracking is ridiculously comprehensive. Let me break down what it captures without you lifting a finger:

Code snapshot: Sacred captures the exact state of your code at runtime, including uncommitted changes. It even records the git commit hash if you’re in a repo.

Dependencies: Every package and version gets logged. No more “but it worked on my machine” excuses.

Hardware info: CPU, GPU models, memory — all recorded automatically.

Random seeds: Sacred manages random seeds for you, ensuring reproducibility across numpy, Python’s random module, and PyTorch/TensorFlow.

Console output: Every print statement gets saved. Great for debugging failed runs.

Here’s how to log additional artifacts:

python

@ex.automain
def train(_run):
    # Log scalars (losses, metrics)
    _run.log_scalar("train.loss", train_loss, step)
    _run.log_scalar("val.accuracy", val_acc, step)
    
    # Save artifacts (model checkpoints, plots)
    _run.add_artifact("model.pth")
    _run.add_artifact("confusion_matrix.png")
    
    # Store custom info
    _run.info["best_epoch"] = best_epoch
    _run.info["dataset_stats"] = compute_stats()

Database Storage: MongoDB Integration

Storing experiments in a database is where Sacred becomes a game-changer for research teams. Here’s the setup:

Install and start MongoDB:

bash

# Ubuntu/Debian
sudo apt-get install mongodb
sudo systemctl start mongodb

# macOS
brew install mongodb-community
brew services start mongodb-community

Connect Sacred to MongoDB:

python

from sacred.observers import MongoObserver

ex.observers.append(MongoObserver(
    url='localhost:27017',
    db_name='my_experiments'
))

Now every experiment run gets stored in MongoDB with a unique ID. You can query your runs programmatically:

python

from pymongo import MongoClient

client = MongoClient('localhost', 27017)
db = client.my_experiments
runs = db.runs

# Find best performing runs
best_runs = runs.find({'result.val_accuracy': {'$gt': 0.95}}).sort('result.val_accuracy', -1)

for run in best_runs:
    print(f"Run {run['_id']}: {run['result']['val_accuracy']}")
    print(f"Config: {run['config']}")

This is insanely useful for hyperparameter analysis and comparing architectures.

Omniboard: The Visual Interface You Deserve

Sacred’s command-line interface is fine, but let’s be honest — we want pretty dashboards. Enter Omniboard, a gorgeous web UI for Sacred experiments.

Installation:

bash

npm install -g omniboard

Launch it:

bash

omniboard -m localhost:27017:my_experiments

Visit http://localhost:9000 and you'll see a beautiful table of all your experiments with:

Sortable columns for any metric
Filtering by hyperparameters
Config diffs between runs
Interactive plots of logged metrics
Artifact downloads

IMO, Omniboard is what makes Sacred viable for serious research. I keep it open in a browser tab all day during heavy experimentation phases.

Advanced Patterns: Ingredients and Modularity

Sacred has this concept called ingredients that lets you modularize experiment configs. Perfect for complex projects:

python

from sacred import Ingredient

# Define reusable components
data_ingredient = Ingredient('dataset')
model_ingredient = Ingredient('model')

@data_ingredient.config
def data_cfg():
    name = 'cifar10'
    train_split = 0.8
    augmentation = True

@model_ingredient.config  
def model_cfg():
    architecture = 'resnet18'
    pretrained = True
    dropout = 0.5

# Main experiment uses ingredients
ex = Experiment('image_classifier', 
                ingredients=[data_ingredient, model_ingredient])

@ex.automain
def train(dataset, model):
    # Access ingredient configs
    data = load_data(dataset['name'], dataset['train_split'])
    net = create_model(model['architecture'], model['pretrained'])
    # ... training code

Override ingredient configs like this:

bash

python train.py with dataset.name=imagenet model.architecture=resnet50

This is perfect for papers where you’re testing multiple datasets and models. You define each component once and mix-and-match in experiments.

Capturing Sacred: Best Practices from the Field

After managing thousands of Sacred experiments, here’s what actually matters:

Always add a description:

python

@ex.main
def train(_run):
    _run.info['description'] = 'Testing new regularization technique'
    # ... training code

Future you will thank present you when browsing old experiments.

Use custom metrics wisely:

python

@ex.capture
def evaluate_model(model, _run):
    metrics = compute_metrics(model)
    
    # Log them all at once
    for name, value in metrics.items():
        _run.log_scalar(f"eval.{name}", value)

Structure your artifact names:

python

# Bad
_run.add_artifact("model.pth")

# Good  
_run.add_artifact(f"checkpoints/model_epoch_{epoch}.pth")
_run.add_artifact(f"plots/{dataset_name}_confusion_matrix.png")

This keeps your artifact storage organized as experiments pile up.

Fail gracefully:

python

@ex.automain
def train(_run):
    try:
        # Training code
        result = train_model()
        return result
    except Exception as e:
        _run.info['error'] = str(e)
        _run.info['traceback'] = traceback.format_exc()
        raise

Sacred will mark the run as failed, but you’ll have debugging info.

Integration with Other Tools

Sacred plays nicely with the ML ecosystem. Here are combinations I use constantly:

Sacred + PyTorch Lightning:

python

from pytorch_lightning import Trainer
from pytorch_lightning.callbacks import Callback

class SacredCallback(Callback):
    def __init__(self, sacred_run):
        self.run = sacred_run
    
    def on_validation_end(self, trainer, pl_module):
        metrics = trainer.callback_metrics
        for name, value in metrics.items():
            self.run.log_scalar(name, value, trainer.current_epoch)

@ex.automain
def train(_run):
    model = LitModel()
    trainer = Trainer(callbacks=[SacredCallback(_run)])
    trainer.fit(model)

Sacred + Weights & Biases (yes, you can use both):

python

import wandb

@ex.automain  
def train(_run):
    # Initialize W&B with Sacred config
    wandb.init(project="my_project", config=_run.config)
    
    # Train and log to both
    for epoch in range(epochs):
        loss = train_epoch()
        _run.log_scalar("loss", loss, epoch)  # Sacred
        wandb.log({"loss": loss})  # W&B

Why use both? Sacred gives you perfect reproducibility and local storage. W&B gives you beautiful dashboards and collaboration features. Best of both worlds :/

Grid Search and Hyperparameter Optimization

Sacred doesn’t have built-in grid search (that’s not its job), but it makes running sweeps trivial:

python

# sweep.py
import subprocess
import itertools

learning_rates = [0.001, 0.01, 0.1]
batch_sizes = [32, 64, 128]
hidden_sizes = [128, 256, 512]

for lr, bs, hs in itertools.product(learning_rates, batch_sizes, hidden_sizes):
    cmd = f"python train.py with learning_rate={lr} batch_size={bs} hidden_size={hs}"
    subprocess.run(cmd, shell=True)

Run this and Sacred logs every combination automatically. Then use Omniboard or MongoDB queries to find the best config:

python

best_run = db.runs.find_one(
    sort=[('result.val_accuracy', -1)]
)
print(f"Best config: {best_run['config']}")

For fancier hyperparameter optimization, integrate with Optuna:

python

import optuna

def objective(trial):
    learning_rate = trial.suggest_loguniform('learning_rate', 1e-5, 1e-1)
    batch_size = trial.suggest_categorical('batch_size', [32, 64, 128])
    
    # Run Sacred experiment
    run = ex.run(config_updates={
        'learning_rate': learning_rate,
        'batch_size': batch_size
    })
    
    return run.result['val_accuracy']

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)

Now you’ve got both Optuna’s smart search and Sacred’s comprehensive logging. Pretty slick.

Distributed Experiments: Running on Clusters

Sacred works beautifully on computing clusters. Here’s a SLURM example:

bash

#!/bin/bash
#SBATCH --job-name=sacred_exp
#SBATCH --gres=gpu:1
#SBATCH --array=0-9

# Each array task runs different config
python train.py with \
    learning_rate=0.$((SLURM_ARRAY_TASK_ID + 1)) \
    seed=$SLURM_ARRAY_TASK_ID

Submit with sbatch run_experiments.sh and all 10 runs log to the same MongoDB instance. You can monitor them in real-time with Omniboard.

FYI, this is how I run all my ablation studies now — fire off 50 jobs, grab coffee, come back to organized results.

Reproducing Experiments: The Whole Point

Reproducing an experiment is Sacred’s killer feature. Every run gets assigned an ID. To reproduce run 42:

bash

python train.py with sacred_id=42

Sacred loads the exact config, random seeds, and code version from that run. If you’ve been storing artifacts, you can even resume from a checkpoint.

For paper submissions, I create a “reproduce” script:

python

# reproduce_paper_results.py
from sacred import SETTINGS
SETTINGS.CONFIG.READ_ONLY_CONFIG = True

experiment_ids = [123, 124, 125]  # Paper's main results

for exp_id in experiment_ids:
    ex.run(config_updates={'sacred_id': exp_id})
    print(f"Reproduced experiment {exp_id}")

Reviewers love this. You can literally give them a single command to reproduce every result in your paper.

Common Gotchas (Learn From My Pain)

MongoDB connection issues: If Sacred can’t connect to MongoDB, it silently falls back to file storage. Always check the console output for observer warnings.

Disk space: Artifacts pile up fast, especially model checkpoints. Set up a cleanup policy:

python

import os
from datetime import datetime, timedelta

# Delete artifacts older than 30 days
cutoff = datetime.now() - timedelta(days=30)
old_runs = db.runs.find({'start_time': {'$lt': cutoff}})

for run in old_runs:
    # Remove artifacts but keep metadata
    if 'artifacts' in run:
        for artifact in run['artifacts']:
            os.remove(artifact['path'])

Large configs: Sacred logs your entire config to MongoDB. If you’re storing huge objects (like entire datasets), use references instead:

python

@ex.config
def cfg():
    data_path = '/data/imagenet'  # Store path, not data
    model_config_path = 'configs/resnet50.yaml'  # Store path, not config

Nested config updates: Command-line updates don’t work with deeply nested configs. Use config files instead for complex hierarchies.

When Sacred Isn’t the Right Choice

Real talk: Sacred isn’t perfect for everything. Skip it if:

You’re doing quick prototyping and don’t care about reproducibility yet
Your “experiments” are one-off scripts that’ll never run again
You need real-time collaboration features (use W&B or MLflow instead)
You’re already invested in another experiment tracking system

Sacred shines for research projects where reproducibility matters and you’re running hundreds of experiments. For production ML pipelines, you probably want something with more deployment features.

Wrapping Up

Sacred transformed how I do ML research. What used to be chaotic experimentation — scattered notebooks, forgotten hyperparameters, irreproducible results — became systematic and organized. The PhD student I mentioned? She finished her dissertation with 200+ perfectly documented experiments, any of which she could reproduce in seconds.

The learning curve is gentle, the overhead is minimal, and the payoff is huge. Next time you get great results and think “I should probably write these settings down somewhere,” just use Sacred instead. Future you will send a thank-you note.

Now go forth and track those experiments. Your research deserves better than “final_FINAL_v3.ipynb”.

Loving the article? ☕
If you’d like to help me keep writing stories like this, consider supporting me on Buy Me a Coffee: https://buymeacoffee.com/samaustin. Even a small contribution means a lot!

Sam Austin

Search This Blog

Latest Post

Reinforcement Learning for Credit Scoring: Applications in Fintech