TorchServe Tutorial: Deploy PyTorch Models in Production

Your PyTorch model works perfectly in your notebook. Then someone asks you to put it in production. You Google “deploy PyTorch model” and find a dozen different approaches — Flask, FastAPI, custom serving infrastructure, Docker, Kubernetes, cloud services. You try Flask first, write 300 lines of boilerplate for request handling, batching, error handling, and monitoring. Two weeks later, you have a fragile serving system that crashes under load and has zero observability.

I went through this exact pain before discovering TorchServe. It’s PyTorch’s official serving framework, and it handles all the production infrastructure you’d otherwise build yourself — batching, multi-model serving, versioning, metrics, scaling, and GPU management. What took me weeks to build poorly now takes 10 minutes to configure correctly. TorchServe is the difference between “I deployed something that technically works” and “I have production-grade model serving.”

Let me show you how to stop reinventing serving infrastructure and deploy PyTorch models properly.

What Is TorchServe and Why It Exists

TorchServe is an official model serving framework from PyTorch and AWS. It provides production-ready model serving without custom infrastructure code.

What TorchServe provides:

REST and gRPC APIs automatically
Dynamic batching for throughput
Multi-model serving on one instance
Model versioning and A/B testing
Metrics and monitoring (Prometheus)
GPU management and optimization
Multi-worker scaling

What problems it solves:

Writing serving infrastructure from scratch
Inefficient request handling
Poor GPU utilization
No model management system
Missing observability
Complex scaling logic

Think of TorchServe as “the serving framework you would build if you had six months and a team — but pre-built and tested.”

Installation and Setup

Installing TorchServe is straightforward:

bash

# Install TorchServe and torch-model-archiver
pip install torchserve torch-model-archiver

# For Apple Silicon Macs
pip install torchserve torch-model-archiver --extra-index-url https://download.pytorch.org/whl/cpu

Verify installation:

bash

torchserve --help

That’s it. TorchServe is ready to serve models.

Your First Model Deployment (Simple Example)

Let’s serve a basic image classifier:

Step 1: Save Your Trained Model

python

import torch
import torchvision.models as models

# Load or train your model
model = models.resnet18(pretrained=True)
model.eval()

# Save model
torch.save(model.state_dict(), "resnet18.pth")

Step 2: Create Handler (Optional for Standard Models)

TorchServe includes default handlers for common tasks. For custom logic, create a handler:

python

# handler.py
import torch
import torchvision.transforms as transforms
from PIL import Image
import io

class ImageClassifier:
    def __init__(self):
        self.transform = transforms.Compose([
            transforms.Resize(256),
            transforms.CenterCrop(224),
            transforms.ToTensor(),
            transforms.Normalize(mean=[0.485, 0.456, 0.406],
                               std=[0.229, 0.224, 0.225])
        ])
        self.initialized = False
    
    def initialize(self, context):
        """Initialize model and load state."""
        self.manifest = context.manifest
        properties = context.system_properties
        model_dir = properties.get("model_dir")
        
        # Load model
        self.model = torch.jit.load(f"{model_dir}/model.pt")
        self.model.eval()
        
        # Load class names
        with open(f"{model_dir}/index_to_name.json") as f:
            self.mapping = json.load(f)
        
        self.initialized = True
    
    def preprocess(self, data):
        """Preprocess input data."""
        images = []
        for row in data:
            image = row.get("data") or row.get("body")
            image = Image.open(io.BytesIO(image))
            image = self.transform(image)
            images.append(image)
        
        return torch.stack(images)
    
    def inference(self, data):
        """Run inference."""
        with torch.no_grad():
            outputs = self.model(data)
            probabilities = torch.nn.functional.softmax(outputs, dim=1)
        return probabilities
    
    def postprocess(self, data):
        """Postprocess predictions."""
        predictions = []
        for output in data:
            top5_prob, top5_idx = torch.topk(output, 5)
            result = {
                self.mapping[str(idx.item())]: prob.item()
                for idx, prob in zip(top5_idx, top5_prob)
            }
            predictions.append(result)
        return predictions

_service = ImageClassifier()

def handle(data, context):
    """Entry point for TorchServe."""
    if not _service.initialized:
        _service.initialize(context)
    
    if data is None:
        return None
    
    data = _service.preprocess(data)
    data = _service.inference(data)
    data = _service.postprocess(data)
    
    return data

Step 3: Create Model Archive

bash

# Create .mar file (model archive)
torch-model-archiver \
    --model-name resnet18 \
    --version 1.0 \
    --serialized-file resnet18.pth \
    --handler image_classifier \
    --export-path model-store \
    --extra-files index_to_name.json

For custom handlers:

bash

torch-model-archiver \
    --model-name resnet18 \
    --version 1.0 \
    --model-file model.py \
    --serialized-file resnet18.pth \
    --handler handler.py \
    --export-path model-store

Step 4: Start TorchServe

bash

# Start TorchServe
torchserve --start \
    --model-store model-store \
    --models resnet18=resnet18.mar \
    --ncs

Step 5: Make Predictions

bash

# Inference API (port 8080)
curl -X POST http://localhost:8080/predictions/resnet18 \
    -T test_image.jpg

# Response:
# {
#   "dog": 0.89,
#   "cat": 0.08,
#   "horse": 0.02,
#   ...
# }

That’s a complete production-ready serving setup in 5 steps.

**Upscaler: Increase image resolution and improve quality :** **Click Here**

TorchServe Architecture

Understanding the architecture helps with configuration:

APIs:

Inference API (port 8080): Model predictions
Management API (port 8081): Register/unregister models, scale workers
Metrics API (port 8082): Prometheus metrics

Components:

Frontend: Handles HTTP/gRPC requests
Backend: Manages workers and model loading
Workers: Execute model inference
Model Store: Directory containing .mar files

Management API: Dynamic Model Management

bash

# Register new model
curl -X POST "http://localhost:8081/models?url=resnet18.mar&initial_workers=2&synchronous=true"

# List models
curl http://localhost:8081/models

# Describe model
curl http://localhost:8081/models/resnet18

# Scale workers
curl -X PUT "http://localhost:8081/models/resnet18?min_worker=4&max_worker=8"

# Unregister model
curl -X DELETE http://localhost:8081/models/resnet18

This enables zero-downtime model updates and dynamic scaling.

Batching: Maximize Throughput

TorchServe automatically batches requests for better GPU utilization:

python

# config.properties
inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082

# Batching configuration
batch_size=8
max_batch_delay=100

# Worker configuration
number_of_gpu=1
number_of_netty_threads=4
default_workers_per_model=2

TorchServe collects requests for up to max_batch_delay milliseconds or until reaching batch_size, then processes them together. This dramatically improves throughput.

Your handler receives batched inputs automatically:

python

def preprocess(self, data):
    """Data is already a batch."""
    batch = []
    for item in data:
        # Process each item
        batch.append(processed_item)
    return torch.stack(batch)

Multi-Model Serving

Serve multiple models on one instance:

bash

# Start with multiple models
torchserve --start \
    --model-store model-store \
    --models \
        resnet18=resnet18.mar \
        mobilenet=mobilenet.mar \
        efficientnet=efficientnet.mar

Each model gets its own endpoint:

bash

curl -X POST http://localhost:8080/predictions/resnet18 -T image.jpg
curl -X POST http://localhost:8080/predictions/mobilenet -T image.jpg
curl -X POST http://localhost:8080/predictions/efficientnet -T image.jpg

Models share infrastructure but run independently. Perfect for serving multiple versions or model variants.

Model Versioning and A/B Testing

Deploy multiple versions simultaneously:

bash

# Register v1.0
curl -X POST "http://localhost:8081/models?url=resnet18_v1.mar&model_name=resnet18&initial_workers=2"

# Register v2.0
curl -X POST "http://localhost:8081/models?url=resnet18_v2.mar&model_name=resnet18&initial_workers=1"

# Inference automatically load balances between versions
curl -X POST http://localhost:8080/predictions/resnet18 -T image.jpg

# Or target specific version
curl -X POST http://localhost:8080/predictions/resnet18/2.0 -T image.jpg

This enables:

Canary deployments
A/B testing
Gradual rollouts
Blue-green deployments

Production Configuration

Create config.properties for production settings:

properties

# Inference settings
inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082

# Performance tuning
batch_size=32
max_batch_delay=100
number_of_gpu=2
number_of_netty_threads=8
default_workers_per_model=4

# Model store
model_store=./model-store
load_models=all

# Logging
default_response_timeout=120
enable_metrics_api=true

# CORS (if needed)
cors_allowed_origin=*
cors_allowed_methods=GET,POST,PUT,OPTIONS
cors_allowed_headers=*

# SSL (production)
keystore=keystore.jks
keystore_pass=changeit
keystore_type=JKS
private_key_file=private-key.pem
certificate_file=certificate.pem

Start with config:

bash

torchserve --start --ts-config config.properties

Metrics and Monitoring

TorchServe exposes Prometheus metrics:

bash

# Get metrics
curl http://localhost:8082/metrics

# Example metrics:
# ts_inference_requests_total
# ts_inference_latency_microseconds
# ts_queue_latency_microseconds
# model_prediction_latency
# cpu_utilization
# memory_used
# disk_available

Integrate with Prometheus:

yaml

# prometheus.yml
scrape_configs:
  - job_name: 'torchserve'
    static_configs:
      - targets: ['localhost:8082']

Then visualize in Grafana. Essential for production monitoring.

Custom Handlers for Complex Logic

For advanced preprocessing/postprocessing:

python

# custom_handler.py
import torch
import json
from ts.torch_handler.base_handler import BaseHandler

class CustomHandler(BaseHandler):
    def __init__(self):
        super().__init__()
        self.initialized = False
    
    def initialize(self, context):
        """Initialize model and resources."""
        self.manifest = context.manifest
        properties = context.system_properties
        model_dir = properties.get("model_dir")
        self.device = torch.device(
            "cuda" if torch.cuda.is_available() else "cpu"
        )
        
        # Load model
        model_file = self.manifest['model']['modelFile']
        model_path = f"{model_dir}/{model_file}"
        
        self.model = torch.jit.load(model_path, map_location=self.device)
        self.model.eval()
        
        self.initialized = True
    
    def preprocess(self, requests):
        """Transform raw input into model input."""
        inputs = []
        for req in requests:
            data = req.get("body") or req.get("data")
            
            # Custom preprocessing
            processed = self.custom_preprocess(data)
            inputs.append(processed)
        
        return torch.stack(inputs).to(self.device)
    
    def inference(self, model_input):
        """Run model inference."""
        with torch.no_grad():
            model_output = self.model(model_input)
        return model_output
    
    def postprocess(self, inference_output):
        """Transform model output into response format."""
        predictions = []
        for output in inference_output:
            # Custom postprocessing
            result = self.custom_postprocess(output)
            predictions.append(result)
        
        return predictions
    
    def custom_preprocess(self, data):
        """Your custom preprocessing logic."""
        # Implement your logic here
        pass
    
    def custom_postprocess(self, output):
        """Your custom postprocessing logic."""
        # Implement your logic here
        pass

_service = CustomHandler()

def handle(data, context):
    """Entry point."""
    if not _service.initialized:
        _service.initialize(context)
    
    if data is None:
        return None
    
    data = _service.preprocess(data)
    data = _service.inference(data)
    data = _service.postprocess(data)
    
    return data

Docker Deployment

Deploy with Docker for isolation:

dockerfile

# Dockerfile
FROM pytorch/torchserve:latest

# Copy model archive
COPY model-store /home/model-server/model-store

# Copy config
COPY config.properties /home/model-server/config.properties

# Expose ports
EXPOSE 8080 8081 8082

# Start TorchServe
CMD ["torchserve", \
     "--start", \
     "--ts-config", "/home/model-server/config.properties", \
     "--model-store", "/home/model-server/model-store", \
     "--models", "all"]

Build and run:

bash

docker build -t my-model-server .
docker run -p 8080:8080 -p 8081:8081 -p 8082:8082 my-model-server

Kubernetes Deployment

Scale across a cluster:

yaml

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: torchserve-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: torchserve
  template:
    metadata:
      labels:
        app: torchserve
    spec:
      containers:
      - name: torchserve
        image: my-model-server:latest
        ports:
        - containerPort: 8080
          name: inference
        - containerPort: 8081
          name: management
        - containerPort: 8082
          name: metrics
        resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            memory: "4Gi"
            cpu: "2"
---
apiVersion: v1
kind: Service
metadata:
  name: torchserve-service
spec:
  type: LoadBalancer
  selector:
    app: torchserve
  ports:
  - port: 8080
    targetPort: 8080
    name: inference
  - port: 8081
    targetPort: 8081
    name: management

Deploy:

bash

kubectl apply -f deployment.yaml

Best Practices

Practice 1: Use TorchScript

python

# Convert to TorchScript for faster inference
model = MyModel()
model.eval()

# Trace or script
traced_model = torch.jit.trace(model, example_input)
# or
scripted_model = torch.jit.script(model)

# Save
torch.jit.save(traced_model, "model.pt")

TorchScript models run faster and are more portable than pure Python models.

Practice 2: Implement Health Checks

python

# In handler
def ping(self):
    """Health check endpoint."""
    return self.initialized

TorchServe calls this for readiness probes.

Practice 3: Profile and Optimize

bash

# Enable profiling
curl -X POST "http://localhost:8081/models/resnet18/1.0?min_worker=1&max_worker=1"

# Run inference
curl -X POST http://localhost:8080/predictions/resnet18 -T image.jpg

# Check metrics for bottlenecks
curl http://localhost:8082/metrics

Monitor model_prediction_latency and ts_queue_latency_microseconds to identify bottlenecks.

Practice 4: Version Models Explicitly

bash

# Include version in filename
torch-model-archiver --model-name resnet18 --version 1.0 ...

# Always version in production
curl -X POST "http://localhost:8081/models?url=resnet18_v1.0.mar"

Common Mistakes to Avoid

Learn from these TorchServe failures:

Mistake 1: Not Batching Correctly

python

# Bad - processes one at a time
def inference(self, data):
    results = []
    for item in data:
        result = self.model(item.unsqueeze(0))
        results.append(result)
    return results

# Good - processes batch
def inference(self, data):
    return self.model(data)  # data is already batched

TorchServe batches for you. Process the entire batch together for GPU efficiency.

Mistake 2: Blocking Operations in Handler

python

# Bad - blocks during preprocessing
import time
def preprocess(self, data):
    time.sleep(1)  # Simulating slow I/O
    return processed_data

Handlers run in worker processes. Blocking operations reduce throughput. Use async operations or separate services for slow operations.

Mistake 3: Not Setting Timeouts

python

# In config.properties
default_response_timeout=120  # Set appropriate timeout

Without timeouts, stuck requests block workers forever.

Mistake 4: Incorrect Model Archive

bash

# Bad - missing required files
torch-model-archiver --model-name model --serialized-file model.pth

# Good - includes all dependencies
torch-model-archiver \
    --model-name model \
    --version 1.0 \
    --model-file model.py \
    --serialized-file model.pth \
    --handler handler.py \
    --extra-files vocab.json,config.json

Include all files your model needs in the archive. IMO, missing dependencies is the #1 deployment failure cause.

The Bottom Line

TorchServe eliminates the need to build custom serving infrastructure for PyTorch models. It provides production-grade features out of the box — batching, multi-model serving, metrics, scaling, and versioning — that would take months to build correctly yourself.

Use TorchServe when:

Deploying PyTorch models to production
Need scalable serving infrastructure
Want proper monitoring and metrics
Serving multiple models
Need GPU optimization

Consider alternatives when:

Extremely simple use cases (Flask might suffice)
Non-PyTorch models (look at TensorFlow Serving, Triton)
Need framework-agnostic serving (Triton Inference Server)

For production PyTorch deployment, TorchServe should be your default choice. It’s officially supported, actively maintained, and handles the complexity you’d otherwise build yourself.

Installation:

bash

pip install torchserve torch-model-archiver

Stop building fragile serving infrastructure from scratch. Start using TorchServe to deploy PyTorch models with production-grade features built-in. The difference between a Flask script and proper model serving is reliability, performance, and observability — TorchServe gives you all three. :)

Sam Austin

Search This Blog

Latest Post

Reinforcement Learning for Credit Scoring: Applications in Fintech