Weights & Biases (wandb) Tutorial: Track ML Experiments Like a Pro

You’ve just finished training a model. It achieved 89% accuracy — your best result yet. Two weeks later, you need to reproduce it. You stare at your terminal history trying to remember what hyperparameters you used. You check your notebooks, but there are 47 files named “experiment_final_v2_REAL_final.ipynb”. You have zero idea what learning rate, batch size, or architecture produced that 89%. Your experiments are a black hole where results go to be forgotten forever.

I lived this nightmare for a year before discovering Weights & Biases. Now every experiment is logged, visualized, and reproducible. I can compare 50 runs instantly, see exactly what hyperparameters worked, and share results with teammates effortlessly. W&B transformed my experiment tracking from “hope I wrote it down somewhere” to actual professional engineering. Let me show you how to stop losing track of your experiments.

What Is Weights & Biases (wandb)

Weights & Biases is an experiment tracking platform for machine learning. It logs metrics, hyperparameters, system stats, model artifacts, and more — automatically visualizing everything in a web dashboard.

What W&B tracks:

Training/validation metrics
Hyperparameters
System metrics (GPU/CPU/memory)
Model checkpoints
Code versions
Dataset versions
Visualizations (images, plots, etc.)

What problems it solves:

“What hyperparameters did I use?”
“Why is this experiment different from last week’s?”
“How do I compare 20 training runs?”
“Can I reproduce this result?”
“How do I share results with my team?”

Think of W&B as git for ML experiments — version control for training runs instead of code.

Installation and Setup (Actually Easy)

Getting started takes 2 minutes:

bash

# Install
pip install wandb

# Login (creates account if needed)
wandb login

The login opens a browser, you authenticate, and you’re done. The free tier is generous (100GB storage, unlimited runs), and honestly sufficient for most individual use.

Your First Experiment (Stupidly Simple)

Here’s basic tracking:

python

import wandb
import random

# Initialize run
wandb.init(project="my-project", name="experiment-1")

# Train loop (simplified)
for epoch in range(10):
    # Fake training metrics
    train_loss = random.random() * (0.9 ** epoch)
    train_acc = random.random() * 0.5 + 0.5
    
    # Log metrics
    wandb.log({
        "train_loss": train_loss,
        "train_accuracy": train_acc,
        "epoch": epoch
    })

# Finish run
wandb.finish()

Run this, check the URL printed in terminal, and you’ll see beautiful charts tracking your metrics over time. That’s it. You’re logging experiments.

Real PyTorch Training Example

Let’s track an actual training loop:

python

import torch
import torch.nn as nn
import torch.optim as optim
import wandb

# Initialize wandb
wandb.init(
    project="image-classification",
    name="resnet18-cifar10",
    config={
        "learning_rate": 0.001,
        "architecture": "ResNet18",
        "dataset": "CIFAR-10",
        "epochs": 10,
        "batch_size": 32
    }
)

# Access config
config = wandb.config

# Create model
model = create_model(config.architecture)
optimizer = optim.Adam(model.parameters(), lr=config.learning_rate)
criterion = nn.CrossEntropyLoss()

# Training loop
for epoch in range(config.epochs):
    model.train()
    train_loss = 0
    correct = 0
    total = 0
    
    for batch_idx, (data, target) in enumerate(train_loader):
        # Forward pass
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        
        # Backward pass
        loss.backward()
        optimizer.step()
        
        # Track accuracy
        _, predicted = output.max(1)
        total += target.size(0)
        correct += predicted.eq(target).sum().item()
        train_loss += loss.item()
        
        # Log batch metrics (optional, can be noisy)
        if batch_idx % 100 == 0:
            wandb.log({
                "batch_loss": loss.item(),
                "batch_idx": batch_idx + epoch * len(train_loader)
            })
    
    # Validation
    model.eval()
    val_loss = 0
    val_correct = 0
    val_total = 0
    
    with torch.no_grad():
        for data, target in val_loader:
            output = model(data)
            loss = criterion(output, target)
            
            val_loss += loss.item()
            _, predicted = output.max(1)
            val_total += target.size(0)
            val_correct += predicted.eq(target).sum().item()
    
    # Log epoch metrics
    train_acc = 100. * correct / total
    val_acc = 100. * val_correct / val_total
    
    wandb.log({
        "epoch": epoch,
        "train_loss": train_loss / len(train_loader),
        "train_accuracy": train_acc,
        "val_loss": val_loss / len(val_loader),
        "val_accuracy": val_acc,
        "learning_rate": optimizer.param_groups[0]['lr']
    })
    
    print(f'Epoch: {epoch}, Train Acc: {train_acc:.2f}%, Val Acc: {val_acc:.2f}%')

# Save model
torch.save(model.state_dict(), 'model.pth')

# Log model artifact
wandb.save('model.pth')

wandb.finish()

This logs:

All hyperparameters
Training/validation metrics each epoch
Model checkpoint
System metrics automatically

Check the W&B dashboard and you’ll see real-time charts updating as training progresses.

Advanced Logging: Images, Histograms, and More

W&B logs more than just scalars:

Logging Images

python

import wandb
import numpy as np
import matplotlib.pyplot as plt

# Log single image
image = wandb.Image(image_array, caption="Sample prediction")
wandb.log({"example": image})

# Log multiple images
images = [wandb.Image(img, caption=f"Image {i}") for i, img in enumerate(image_list)]
wandb.log({"examples": images})

# Log images with predictions
def log_predictions(images, labels, predictions):
    table = wandb.Table(columns=["image", "true_label", "prediction"])
    for img, label, pred in zip(images, labels, predictions):
        table.add_data(wandb.Image(img), label, pred)
    wandb.log({"predictions": table})

log_predictions(test_images[:10], test_labels[:10], model_predictions[:10])

Perfect for visualizing model predictions, data samples, or failure cases.

Logging Histograms

python

# Log weight distributions
for name, param in model.named_parameters():
    wandb.log({f"{name}_histogram": wandb.Histogram(param.data.cpu())})

# Log gradient norms
for name, param in model.named_parameters():
    if param.grad is not None:
        wandb.log({f"{name}_grad_norm": param.grad.norm().item()})

Helps debug training issues — see if gradients are exploding/vanishing, weights are updating properly, etc.

Learn Agentic System Design: Click Here

Learn to design the next generation of AI systems. Explore the architectures and strategies behind autonomous agents that solve complex, real-world problems.

Logging Custom Plots

python

# Create matplotlib figure
fig, ax = plt.subplots()
ax.plot(epochs, train_losses, label='Train')
ax.plot(epochs, val_losses, label='Val')
ax.set_xlabel('Epoch')
ax.set_ylabel('Loss')
ax.legend()

# Log to wandb
wandb.log({"loss_curves": wandb.Image(fig)})
plt.close()

Logging Confusion Matrix

python

from sklearn.metrics import confusion_matrix
import wandb

# Get predictions
y_true = ...
y_pred = ...

# Log confusion matrix
wandb.log({
    "confusion_matrix": wandb.plot.confusion_matrix(
        probs=None,
        y_true=y_true,
        preds=y_pred,
        class_names=class_names
    )
})

Hyperparameter Sweeps (Game-Changer)

W&B makes hyperparameter sweeps almost too easy:

Define Sweep Config

python

# sweep.yaml or in Python
sweep_config = {
    'method': 'bayes',  # or 'grid', 'random'
    'metric': {
        'name': 'val_accuracy',
        'goal': 'maximize'
    },
    'parameters': {
        'learning_rate': {
            'distribution': 'log_uniform_values',
            'min': 1e-5,
            'max': 1e-1
        },
        'batch_size': {
            'values': [16, 32, 64, 128]
        },
        'dropout': {
            'distribution': 'uniform',
            'min': 0.0,
            'max': 0.5
        },
        'optimizer': {
            'values': ['adam', 'sgd', 'adamw']
        }
    }
}

# Initialize sweep
sweep_id = wandb.sweep(sweep_config, project="hyperparameter-search")

Training Function for Sweep

python

def train():
    # Initialize run (wandb.init called by sweep agent)
    run = wandb.init()
    
    # Get hyperparameters from sweep
    config = wandb.config
    
    # Build model with sweep hyperparameters
    model = create_model(config.dropout)
    
    if config.optimizer == 'adam':
        optimizer = optim.Adam(model.parameters(), lr=config.learning_rate)
    elif config.optimizer == 'sgd':
        optimizer = optim.SGD(model.parameters(), lr=config.learning_rate, momentum=0.9)
    else:
        optimizer = optim.AdamW(model.parameters(), lr=config.learning_rate)
    
    # Training loop (same as before)
    for epoch in range(10):
        train_loss, train_acc = train_epoch(model, optimizer, train_loader, config.batch_size)
        val_loss, val_acc = validate(model, val_loader)
        
        wandb.log({
            'train_loss': train_loss,
            'train_accuracy': train_acc,
            'val_loss': val_loss,
            'val_accuracy': val_acc,
            'epoch': epoch
        })

# Run sweep
wandb.agent(sweep_id, train, count=50)  # Run 50 experiments

W&B automatically:

Runs 50 different hyperparameter combinations
Uses Bayesian optimization to find best settings
Visualizes all results
Identifies optimal hyperparameters

I’ve replaced entire weeks of manual hyperparameter tuning with W&B sweeps. It’s legitimately one of the best features.

Organizing Experiments: Projects, Groups, Tags

Keep experiments organized as they grow:

python

# Projects: Top-level organization
wandb.init(project="image-classification")

# Groups: Related runs (e.g., same architecture)
wandb.init(
    project="image-classification",
    group="resnet-experiments"
)

# Tags: Flexible categorization
wandb.init(
    project="image-classification",
    group="resnet-experiments",
    tags=["baseline", "augmentation", "lr-sweep"]
)

# Notes: Describe the run
wandb.init(
    project="image-classification",
    notes="Testing new augmentation strategy"
)

This structure makes finding specific experiments later actually possible.

Comparing Runs (Finally Useful)

The dashboard makes comparing runs visual:

Select multiple runs in UI
View parallel coordinates plot
Compare metrics side-by-side
Filter by hyperparameters
Identify what changed between runs

No more manually tracking spreadsheets or jupyter notebooks. Everything’s visual and interactive.

Integrations with Popular Frameworks

W&B integrates with everything:

PyTorch Lightning

python

from pytorch_lightning.loggers import WandbLogger
import pytorch_lightning as pl

wandb_logger = WandbLogger(project="lightning-runs")

trainer = pl.Trainer(logger=wandb_logger, max_epochs=10)
trainer.fit(model)

Automatic logging of everything Lightning tracks.

Keras/TensorFlow

python

from wandb.keras import WandbCallback

model.fit(
    train_dataset,
    validation_data=val_dataset,
    epochs=10,
    callbacks=[WandbCallback()]
)

Hugging Face Transformers

python

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir='./results',
    report_to='wandb',  # Enable wandb logging
    run_name='bert-finetuning'
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset
)

trainer.train()

Scikit-learn

python

from sklearn.ensemble import RandomForestClassifier
import wandb

wandb.init(project="sklearn-models")

model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)

# Log metrics
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)

wandb.log({
    "train_accuracy": train_score,
    "test_accuracy": test_score
})

# Log model
wandb.sklearn.plot_classifier(model, X_train, X_test, y_train, y_test, y_pred, y_probas)

Artifact Tracking (Models, Datasets, etc.)

Track versions of models and datasets:

python

# Log model artifact
run = wandb.init(project="artifacts-demo")

# Save model
model.save('model.h5')

# Log as artifact
artifact = wandb.Artifact('model', type='model')
artifact.add_file('model.h5')
run.log_artifact(artifact)

# Later: Download artifact
run = wandb.init(project="artifacts-demo")
artifact = run.use_artifact('model:latest')
artifact_dir = artifact.download()

Track dataset versions:

python

# Log dataset
artifact = wandb.Artifact('cifar10', type='dataset')
artifact.add_dir('data/cifar10')
run.log_artifact(artifact)

# Use dataset
artifact = run.use_artifact('cifar10:v0')
artifact_dir = artifact.download()

This creates lineage: which model was trained on which dataset version with which code version.

Common Patterns and Best Practices

Pattern 1: Resume Training from Checkpoint

python

# Save run ID when starting
run = wandb.init(project="my-project")
run_id = run.id

# Later: Resume
run = wandb.init(
    project="my-project",
    id=run_id,
    resume="must"  # Resume existing run
)

Pattern 2: Log System Metrics

python

# Automatic system metrics
wandb.init(
    project="my-project",
    config=config,
    settings=wandb.Settings(
        _disable_stats=False  # Enable system metrics (default)
    )
)

W&B automatically tracks GPU/CPU usage, memory, network, etc.

Pattern 3: Conditional Logging

python

# Log less frequently for large datasets
if batch_idx % 100 == 0:  # Every 100 batches
    wandb.log({"batch_loss": loss.item()})

# Always log epoch metrics
wandb.log({"epoch_loss": epoch_loss}, step=epoch)

Reduces logging overhead while keeping important metrics.

Pattern 4: Offline Mode

python

# Work offline (syncs later)
import os
os.environ["WANDB_MODE"] = "offline"

wandb.init(project="my-project")
# Train normally
wandb.finish()

# Later: sync offline runs
# wandb sync <run_directory>

Perfect for running on compute without internet.

Common Mistakes to Avoid

Learn from these W&B failures:

Mistake 1: Not Calling wandb.finish()

python

# Bad - run stays active
wandb.init()
train()
# Script ends without wandb.finish()

# Good - explicitly finish
wandb.init()
train()
wandb.finish()

Always call wandb.finish() or use context manager. IMO, unfinished runs are annoying to clean up.

Mistake 2: Logging Too Much

python

# Bad - log every batch (slow, noisy)
for batch in train_loader:
    loss = train_step(batch)
    wandb.log({"loss": loss})  # Every single batch!

# Good - log periodically
for i, batch in enumerate(train_loader):
    loss = train_step(batch)
    if i % 100 == 0:
        wandb.log({"loss": loss})

Logging every batch creates massive overhead and noisy charts.

Mistake 3: Not Logging Config

python

# Bad - no hyperparameter tracking
wandb.init(project="my-project")

# Good - log all hyperparameters
wandb.init(
    project="my-project",
    config={
        "learning_rate": 0.001,
        "batch_size": 32,
        "architecture": "ResNet18"
    }
)

Config is crucial for reproducing results.

Mistake 4: Not Using Descriptive Names

python

# Bad - meaningless names
wandb.init(project="project1", name="run1")

# Good - descriptive names
wandb.init(
    project="image-classification",
    name="resnet18-lr001-batch32",
    tags=["baseline", "no-augmentation"]
)

Future you will thank present you for descriptive names. FYI, I’ve wasted hours finding specific runs with bad names. :/

Free vs. Paid: What You Actually Need

Free tier includes:

Unlimited runs
100GB storage
Public/private projects
All visualization features
Basic collaboration

Paid tiers add:

More storage
Team features
Advanced security
Priority support
Custom deployment

For individuals and small teams, free tier is completely sufficient. I’ve used free tier for years without hitting limits.

The Bottom Line

Weights & Biases transforms experiment tracking from “I think I used these hyperparameters” to “here’s the exact configuration that produced that result.” It’s not just logging — it’s reproducibility, comparison, and collaboration made trivial.

Use W&B when:

Running multiple experiments
Need to reproduce results
Comparing different approaches
Working in a team
Doing hyperparameter sweeps

Skip W&B when:

Running one-off scripts
Learning ML basics (focus on fundamentals first)
Offline-only requirements (though offline mode exists)

For any serious ML work, W&B should be in your stack from day one. The time saved finding old experiments, comparing runs, and reproducing results pays for itself immediately.

Installation:

bash

pip install wandb
wandb login

Stop tracking experiments in your head, spreadsheets, or scattered jupyter notebooks. Start using W&B. Your experiments will be reproducible, comparable, and actually useful for learning what works. The difference between “I can’t remember what I did” and “here’s every detail of every experiment” is the difference between amateur experimentation and professional ML engineering.

Now go log some experiments. Your future self will thank you when you can actually reproduce that great result you got last month. :)

Sam Austin

Search This Blog

Latest Post

Reinforcement Learning for Credit Scoring: Applications in Fintech