How to Use Matplotlib Subplots for ML Model Comparison Visualizations

You know that moment when you’ve trained five different models and you’re trying to compare their performance by flipping between separate plots like some kind of data science maniac? Yeah, I’ve been there. It’s messy, it’s inefficient, and honestly, it makes you look unprofessional when presenting results.

Matplotlib subplots are the answer you didn’t know you needed. They let you arrange multiple visualizations side-by-side, making model comparisons actually meaningful instead of a memory test. Once you master subplots, your analysis notebooks go from amateur hour to publication-ready faster than you can say “overfitting.”

Let me walk you through everything you need to know about using subplots specifically for ML model comparison. No fluff, just the practical stuff that actually matters.

Why Subplots Matter for Model Comparison

Here’s the thing: your brain is terrible at remembering visual details between separate plots. You look at Model A’s confusion matrix, then Model B’s confusion matrix, and by the time you’re on Model C, you’ve already forgotten what Model A looked like.

Subplots solve this by putting everything in one view. You can see patterns instantly. Is one model consistently better across all metrics? Are there trade-offs between precision and recall? You’ll spot these trends immediately when visualizations are adjacent.

Plus, let’s be honest — stakeholders have the attention span of a goldfish. One comprehensive figure beats ten separate plots every single time. They want the story at a glance, not a slideshow.

The Basics: Creating Your First Subplot Grid

Starting simple here because complexity comes later. The most common way to create subplots uses plt.subplots():

python

import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 2, figsize=(12, 10))

This creates a 2x2 grid of subplots. The fig object controls the overall figure, while axes is an array containing individual subplot axes. Think of it like a container (fig) holding multiple canvases (axes).

The figsize parameter is crucial—don't skip it. Default sizes are tiny and useless for presentations. I typically use (12, 10) for a 2x2 grid and scale proportionally for other layouts.

You access individual subplots using array indexing: axes[0, 0] for top-left, axes[0, 1] for top-right, and so on. Each subplot behaves like its own independent plot.

Comparing Model Performance Metrics Side-by-Side

Let’s get practical. You’ve trained multiple models and want to compare their accuracy, precision, recall, and F1 scores. Here’s how you visualize that effectively:

python

import numpy as np
import matplotlib.pyplot as plt

models = ['Logistic Regression', 'Random Forest', 'SVM', 'XGBoost']
metrics = {
    'Accuracy': [0.82, 0.88, 0.85, 0.90],
    'Precision': [0.79, 0.86, 0.83, 0.89],
    'Recall': [0.84, 0.87, 0.84, 0.88],
    'F1 Score': [0.81, 0.87, 0.83, 0.89]
}

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle('Model Performance Comparison', fontsize=16, fontweight='bold')

for idx, (metric_name, values) in enumerate(metrics.items()):
    row = idx // 2
    col = idx % 2
    
    axes[row, col].bar(models, values, color=['#3498db', '#2ecc71', '#e74c3c', '#f39c12'])
    axes[row, col].set_title(metric_name, fontsize=14, fontweight='bold')
    axes[row, col].set_ylim([0.7, 1.0])
    axes[row, col].set_ylabel('Score')
    axes[row, col].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

See how quickly you can spot that XGBoost dominates across all metrics? That’s the power of subplots. FYI, I always use tight_layout()—it prevents overlapping labels and makes everything look clean.

Confusion Matrix Comparison: The Real MVP

Confusion matrices are essential for classification problems, and comparing them across models reveals tons of insights. Which model handles false positives better? Let’s find out:

python

from sklearn.metrics import confusion_matrix
import seaborn as sns

fig, axes = plt.subplots(2, 2, figsize=(14, 12))
fig.suptitle('Confusion Matrix Comparison', fontsize=16, fontweight='bold')

model_predictions = {
    'Logistic Regression': y_pred_lr,
    'Random Forest': y_pred_rf,
    'SVM': y_pred_svm,
    'XGBoost': y_pred_xgb
}

for idx, (model_name, predictions) in enumerate(model_predictions.items()):
    row = idx // 2
    col = idx % 2
    
    cm = confusion_matrix(y_test, predictions)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
                ax=axes[row, col], cbar=False)
    axes[row, col].set_title(model_name, fontsize=14, fontweight='bold')
    axes[row, col].set_xlabel('Predicted')
    axes[row, col].set_ylabel('Actual')

plt.tight_layout()
plt.show()

Pro tip: Use consistent color scales across all confusion matrices. If you don’t, your brain gets confused trying to compare different color intensities. Keep it simple, keep it consistent.

I’ve caught so many subtle model behaviors this way — like one model being overly conservative with positive predictions while another is trigger-happy. You won’t notice these patterns looking at metrics alone.

ROC Curves: All Models, One Plot

ROC curves are interesting because sometimes you want them on separate subplots, sometimes together. For model comparison, I usually prefer one subplot with all curves overlaid:

python

from sklearn.metrics import roc_curve, auc

fig, ax = plt.subplots(1, 1, figsize=(10, 8))

colors = ['#3498db', '#2ecc71', '#e74c3c', '#f39c12']
model_probs = {
    'Logistic Regression': y_proba_lr,
    'Random Forest': y_proba_rf,
    'SVM': y_proba_svm,
    'XGBoost': y_proba_xgb
}

for (model_name, probs), color in zip(model_probs.items(), colors):
    fpr, tpr, _ = roc_curve(y_test, probs[:, 1])
    roc_auc = auc(fpr, tpr)
    ax.plot(fpr, tpr, color=color, lw=2, 
            label=f'{model_name} (AUC = {roc_auc:.2f})')

ax.plot([0, 1], [0, 1], 'k--', lw=2, label='Random Classifier')
ax.set_xlabel('False Positive Rate', fontsize=12)
ax.set_ylabel('True Positive Rate', fontsize=12)
ax.set_title('ROC Curve Comparison', fontsize=14, fontweight='bold')
ax.legend(loc='lower right', fontsize=10)
ax.grid(alpha=0.3)

plt.show()

Overlaying ROC curves makes it painfully obvious which model performs best. If curves don’t cross, the choice is clear. If they do cross, you’ve got interesting trade-offs to discuss with stakeholders.

Feature Importance Comparison: Who Uses What?

Ever wondered if different models value the same features? Let’s visualize that:

python

fig, axes = plt.subplots(1, 3, figsize=(18, 6))
fig.suptitle('Feature Importance Comparison', fontsize=16, fontweight='bold')

models_importance = [
    ('Random Forest', rf_model.feature_importances_),
    ('XGBoost', xgb_model.feature_importances_),
    ('Gradient Boosting', gb_model.feature_importances_)
]

for idx, (model_name, importances) in enumerate(models_importance):
    # Get top 10 features
    indices = np.argsort(importances)[-10:]
    
    axes[idx].barh(range(len(indices)), importances[indices], color='#3498db')
    axes[idx].set_yticks(range(len(indices)))
    axes[idx].set_yticklabels([feature_names[i] for i in indices])
    axes[idx].set_xlabel('Importance', fontsize=11)
    axes[idx].set_title(model_name, fontsize=13, fontweight='bold')
    axes[idx].grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.show()

I’ve caught models focusing on completely different features more times than I can count. Sometimes it reveals data leakage, sometimes it’s just fascinating model behavior. Either way, you need to visualize this.

Learning Curves: Training vs Validation Performance

Want to diagnose overfitting across models at a glance? Learning curves in subplots make it obvious:

python

from sklearn.model_selection import learning_curve

fig, axes = plt.subplots(2, 2, figsize=(14, 12))
fig.suptitle('Learning Curves Comparison', fontsize=16, fontweight='bold')

models_dict = {
    'Logistic Regression': lr_model,
    'Random Forest': rf_model,
    'SVM': svm_model,
    'XGBoost': xgb_model
}

for idx, (model_name, model) in enumerate(models_dict.items()):
    row = idx // 2
    col = idx % 2
    
    train_sizes, train_scores, val_scores = learning_curve(
        model, X_train, y_train, cv=5, n_jobs=-1,
        train_sizes=np.linspace(0.1, 1.0, 10)
    )
    
    train_mean = np.mean(train_scores, axis=1)
    val_mean = np.mean(val_scores, axis=1)
    
    axes[row, col].plot(train_sizes, train_mean, 'o-', color='#3498db', 
                        label='Training Score', linewidth=2)
    axes[row, col].plot(train_sizes, val_mean, 'o-', color='#e74c3c', 
                        label='Validation Score', linewidth=2)
    axes[row, col].set_title(model_name, fontsize=13, fontweight='bold')
    axes[row, col].set_xlabel('Training Examples', fontsize=11)
    axes[row, col].set_ylabel('Score', fontsize=11)
    axes[row, col].legend(loc='lower right')
    axes[row, col].grid(alpha=0.3)

plt.tight_layout()
plt.show()

That massive gap between training and validation scores? That’s overfitting, my friend. Learning curves expose this brutally, and having them side-by-side shows you which models generalize well and which ones are memorizing training data.

Residual Plots for Regression Models

For regression problems, residual plots are your diagnostic tool. Are errors randomly distributed or is there a pattern? Let’s check across models:

python

fig, axes = plt.subplots(2, 2, figsize=(14, 12))
fig.suptitle('Residual Analysis Comparison', fontsize=16, fontweight='bold')

model_predictions = {
    'Linear Regression': y_pred_lin,
    'Ridge Regression': y_pred_ridge,
    'Random Forest': y_pred_rf,
    'XGBoost': y_pred_xgb
}

for idx, (model_name, predictions) in enumerate(model_predictions.items()):
    row = idx // 2
    col = idx % 2
    
    residuals = y_test - predictions
    
    axes[row, col].scatter(predictions, residuals, alpha=0.5, color='#3498db')
    axes[row, col].axhline(y=0, color='#e74c3c', linestyle='--', linewidth=2)
    axes[row, col].set_title(model_name, fontsize=13, fontweight='bold')
    axes[row, col].set_xlabel('Predicted Values', fontsize=11)
    axes[row, col].set_ylabel('Residuals', fontsize=11)
    axes[row, col].grid(alpha=0.3)

plt.tight_layout()
plt.show()

Random scatter around zero? Good. Funnel shape or curved pattern? Houston, we have a problem. IMO, residual plots are criminally underused in model evaluation, and subplots make them actually practical to review.

Hyperparameter Tuning Visualization

When you’re tuning hyperparameters, you want to see how different values affect performance across multiple metrics. Subplots make this comparison meaningful:

python

param_range = np.logspace(-3, 3, 7)

fig, axes = plt.subplots(1, 3, figsize=(18, 5))
fig.suptitle('Hyperparameter Tuning: C Parameter Effect', fontsize=16, fontweight='bold')

metrics_scores = {
    'Accuracy': accuracy_scores,
    'Precision': precision_scores,
    'Recall': recall_scores
}

for idx, (metric_name, scores) in enumerate(metrics_scores.items()):
    axes[idx].semilogx(param_range, scores['train'], 'o-', color='#3498db',
                       label='Training', linewidth=2)
    axes[idx].semilogx(param_range, scores['val'], 'o-', color='#e74c3c',
                       label='Validation', linewidth=2)
    axes[idx].set_title(metric_name, fontsize=13, fontweight='bold')
    axes[idx].set_xlabel('C Parameter', fontsize=11)
    axes[idx].set_ylabel('Score', fontsize=11)
    axes[idx].legend(loc='best')
    axes[idx].grid(alpha=0.3)

plt.tight_layout()
plt.show()

This shows you the sweet spot for your hyperparameter across different metrics simultaneously. Sometimes the optimal value differs between metrics — subplots reveal these trade-offs instantly.

Custom Layouts: Beyond Simple Grids

Not every comparison needs a perfect grid. Sometimes you want one large plot with smaller supporting plots. GridSpec is your friend here:

python

from matplotlib.gridspec import GridSpec

fig = plt.figure(figsize=(15, 10))
gs = GridSpec(3, 3, figure=fig)

# Large main plot
ax_main = fig.add_subplot(gs[0:2, :])
ax_main.set_title('Primary Model Comparison', fontsize=14, fontweight='bold')

# Smaller supporting plots
ax1 = fig.add_subplot(gs[2, 0])
ax2 = fig.add_subplot(gs[2, 1])
ax3 = fig.add_subplot(gs[2, 2])

# Use the axes as needed
ax_main.plot(x_data, y_data1, label='Model 1')
ax_main.plot(x_data, y_data2, label='Model 2')
ax_main.legend()

plt.tight_layout()
plt.show()

I use this layout when presenting: main comparison up top, detailed breakdowns below. It guides the viewer’s attention exactly where you want it.

Styling Tips That Actually Matter

Default Matplotlib styling is… not great. Here’s how I make my plots presentation-ready:

Use a better style:

python

plt.style.use('seaborn-v0_8-darkgrid')

Consistent color schemes:

python

colors = ['#3498db', '#2ecc71', '#e74c3c', '#f39c12', '#9b59b6']

Readable fonts:

python

plt.rcParams['font.size'] = 11
plt.rcParams['axes.titlesize'] = 14
plt.rcParams['axes.labelsize'] = 12

Always use tight_layout():

python

plt.tight_layout()

These small tweaks transform amateur-looking plots into professional visualizations. Trust me, stakeholders notice this stuff even if they don’t consciously realize it.

Common Mistakes (That I’ve Definitely Made)

Let me save you some headaches:

Mistake 1: Forgetting to flatten axes arrays. When you create a single row or column of subplots, axes is 1D, not 2D. This breaks your indexing logic. Use axes.flatten() for consistency.

Mistake 2: Inconsistent scales. If you’re comparing models, use the same y-axis limits across subplots. Different scales make visual comparison meaningless.

Mistake 3: Too many subplots. More than 6–8 subplots in one figure gets overwhelming. Split into multiple figures if needed.

Mistake 4: Tiny fonts. Default font sizes are too small for presentations. Bump everything up by at least 2 points.

Mistake 5: Not saving high-res images. Use plt.savefig('plot.png', dpi=300, bbox_inches='tight') for publication-quality output.

I’ve committed every single one of these crimes against data visualization :/

Putting It All Together: Complete Comparison Dashboard

Here’s a realistic example combining everything:

python

fig = plt.figure(figsize=(18, 12))
gs = GridSpec(3, 3, figure=fig, hspace=0.3, wspace=0.3)

# Performance metrics
ax1 = fig.add_subplot(gs[0, :])
# ... bar chart code ...

# Confusion matrices
ax2 = fig.add_subplot(gs[1, 0])
ax3 = fig.add_subplot(gs[1, 1])
ax4 = fig.add_subplot(gs[1, 2])
# ... confusion matrix code ...

# ROC curves
ax5 = fig.add_subplot(gs[2, 0])
# ... ROC curve code ...

# Feature importance
ax6 = fig.add_subplot(gs[2, 1])
# ... feature importance code ...

# Learning curves
ax7 = fig.add_subplot(gs[2, 2])
# ... learning curve code ...

plt.suptitle('Complete ML Model Comparison Dashboard', 
             fontsize=18, fontweight='bold', y=0.995)
plt.savefig('model_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

This is what good model comparison looks like. Everything relevant in one comprehensive view. No clicking through tabs, no mental gymnastics remembering what the previous plot showed. Just pure, visual insight.

Final Thoughts: Make Subplots Your Default

Stop making separate plots for model comparison. Seriously, just stop. Subplots aren’t harder — they’re actually easier once you learn the syntax, and they make your analysis infinitely more useful.

Your stakeholders will understand your results faster. Your teammates will appreciate the clarity. And honestly, you’ll catch insights you’d miss with separate visualizations.

Start simple with basic grids, then gradually experiment with custom layouts and styling. The investment pays off immediately. Every ML practitioner should have subplot templates ready to go — it’s that fundamental.

Now go make some beautiful model comparisons and impress everyone with your visualization game! :)

Sam Austin

Search This Blog

Latest Post

Reinforcement Learning for Credit Scoring: Applications in Fintech