Object Detection with YOLO: Step-by-Step Tutorial for Beginners

The first time I got YOLO running and watched it detect objects in real-time, I literally said “holy crap” out loud. Watching an algorithm draw boxes around cars, people, and dogs in a video feed — while maintaining 30+ frames per second — felt like actual magic.

You’re about to experience that same moment. YOLO (You Only Look Once) is the most popular object detection algorithm for a reason: it’s fast, accurate, and surprisingly easy to get working. No PhD required, no weeks of training — we’re going from zero to detecting objects in about an hour.

Let’s build something that’ll make you look like a computer vision wizard.

What Makes YOLO Different (And Why You Should Care)

Object detection used to be painfully slow. Old algorithms would scan an image multiple times, proposing regions, classifying each one — it was a whole production. YOLO said “screw that” and processes the entire image in one pass.

That’s the “You Only Look Once” part. One forward pass through the network, and boom — you get all detected objects with their locations and class predictions. It’s elegant, fast, and perfect for real-time applications.

Think about what you can build with real-time object detection:

Smart security cameras that alert you when someone’s at your door
Retail analytics counting products on shelves
Traffic monitoring systems
Sports analytics tracking players and balls
Autonomous vehicle perception systems

YOLO makes all of this accessible to regular developers, not just research labs with million-dollar budgets.

Understanding YOLO’s Magic (Without the Math Headache)

Here’s how YOLO works at a high level. The image gets divided into a grid — let’s say 13x13. Each grid cell predicts bounding boxes and confidence scores for objects.

If a grid cell contains the center of an object, that cell is responsible for detecting it. The network outputs:

Bounding box coordinates (x, y, width, height)
Confidence score (how sure the model is there’s an object)
Class probabilities (is it a car, person, dog, etc.?)

The genius is doing this all simultaneously in one network pass. Modern GPUs eat this up and spit out results at lightning speed.

Different YOLO versions (we’re using YOLOv8 in 2026) have improved this basic idea with better architectures, anchor boxes, and training techniques. But the core concept remains: process the whole image once, get all detections.

Setting Up Your Environment (Don’t Skip This)

You need Python 3.8 or newer. If you’re still rocking Python 2.7, it’s time to join us in the present.

Install the Ultralytics package — it’s the easiest way to use YOLOv8:

bash

pip install ultralytics

That’s it. Seriously. The Ultralytics team made this absurdly simple. The package handles model downloads, dependencies, everything.

Want to verify it worked? Open Python and try:

python

from ultralytics import YOLO
print("YOLO is ready to rock")

If you see the message without errors, you’re golden. If not, check that your pip is updated (pip install --upgrade pip) and try again.

Optional but Recommended

Install OpenCV for better video handling:

bash

pip install opencv-python

And if you’ve got a NVIDIA GPU and want blazing speed, install PyTorch with CUDA support. Check pytorch.org for your specific CUDA version. But honestly? YOLO runs fine on CPU for learning — optimization comes later.

Your First YOLO Detection (The 5-Minute Version)

Let’s detect objects in an image right now. Create a file called yolo_detect.py:

python

from ultralytics import YOLO

# Load a pre-trained model
model = YOLO('yolov8n.pt')  # n = nano (fastest, smallest)

# Run detection on an image
results = model('path/to/your/image.jpg')

# Display results
results[0].show()

Replace 'path/to/your/image.jpg' with an actual image path. Run it.

Did you just see boxes around detected objects? Congrats, you’re doing object detection. That was almost too easy, right?

Breaking Down What Just Happened

Let’s talk through each line because understanding beats copy-pasting.

Loading the model: YOLO('yolov8n.pt') downloads and loads a pre-trained YOLOv8 nano model. The first time takes a minute (downloading weights), after that it's instant.

YOLOv8 comes in different sizes:

n (nano): Fastest, least accurate, ~6MB
s (small): Balanced, ~22MB
m (medium): Better accuracy, ~52MB
l (large): High accuracy, ~87MB
x (extra-large): Best accuracy, slowest, ~136MB

Start with nano. You can always upgrade later.

Running detection: model('image.jpg') does the actual detection. It preprocesses your image, runs inference, and returns results. One line handles everything.

Displaying results: results[0].show() displays the image with bounding boxes drawn around detected objects. Labels show the class and confidence score.

Getting Detailed Results

Displaying images is nice, but you probably want to actually use the detection data. Here’s how to access everything:

python

from ultralytics import YOLO

model = YOLO('yolov8n.pt')
results = model('your_image.jpg')

# Get the first result (we only processed one image)
result = results[0]

# Access detection data
boxes = result.boxes
for box in boxes:
    # Bounding box coordinates
    x1, y1, x2, y2 = box.xyxy[0]
    
    # Confidence score
    confidence = box.conf[0]
    
    # Class ID and name
    class_id = box.cls[0]
    class_name = result.names[int(class_id)]
    
    print(f"Detected {class_name} with {confidence:.2f} confidence")
    print(f"Location: ({x1:.0f}, {y1:.0f}) to ({x2:.0f}, {y2:.0f})")

Now you can do whatever you want with this data — save to a database, trigger alerts, count objects, whatever your project needs.

Real-Time Object Detection from Webcam

Static images are boring. Let’s detect objects in real-time from your webcam:

python

from ultralytics import YOLO
import cv2

# Load model
model = YOLO('yolov8n.pt')

# Open webcam
cap = cv2.VideoCapture(0)

while True:
    ret, frame = cap.read()
    if not ret:
        break
    
    # Run YOLO detection
    results = model(frame, verbose=False)
    
    # Get the annotated frame
    annotated_frame = results[0].plot()
    
    # Display
    cv2.imshow('YOLO Detection', annotated_frame)
    
    # Press 'q' to quit
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

cap.release()
cv2.destroyAllWindows()

Run this and watch YOLO detect objects in real-time. Move objects in and out of frame, see the detections update. This never gets old, I swear :)

The verbose=False parameter suppresses the progress bar that YOLO normally prints—keeps your console clean during video processing.

Processing Video Files

Got a video file you want to analyze? YOLO handles that just as easily:

python

from ultralytics import YOLO

model = YOLO('yolov8n.pt')

# Process video file
results = model('video.mp4', save=True)

print(f"Processed {len(results)} frames")
print(f"Results saved to runs/detect/predict")

YOLO processes every frame, draws detections, and saves the output video automatically. The save=True parameter tells it to save the annotated video—without it, YOLO just returns detection data.

Want to process only every Nth frame to save time?

python

results = model('video.mp4', save=True, vid_stride=3)  # Process every 3rd frame

Filtering Detections (Because You Don’t Need Everything)

Pre-trained YOLO detects 80 different object classes. Sometimes you only care about specific objects — maybe just people, or just vehicles.

Filter by Confidence

Ignore low-confidence detections to reduce false positives:

python

results = model('image.jpg', conf=0.5)  # Only detections with 50%+ confidence

Higher threshold = fewer detections but more accurate. Lower = catches more but includes sketchy predictions. I usually start at 0.5 and adjust from there.

Filter by Class

Only detect specific object types:

python

# Only detect people (class 0)
results = model('image.jpg', classes=[0])

# Detect people and cars (classes 0 and 2)
results = model('image.jpg', classes=[0, 2])

The full list of 80 COCO classes includes: person, bicycle, car, motorcycle, airplane, bus, train, truck, boat, traffic light, and so on. Google “COCO dataset classes” for the complete list.

Custom Filtering in Code

For more complex filtering, process the results manually:

python

results = model('image.jpg')
boxes = results[0].boxes

# Filter for high-confidence person detections
people = [box for box in boxes 
          if box.cls[0] == 0 and box.conf[0] > 0.7]

print(f"Found {len(people)} people with >70% confidence")

Counting Objects (Super Useful for Real Applications)

Object counting is one of the most practical applications. Here’s how to count specific objects:

python

from ultralytics import YOLO
from collections import Counter

model = YOLO('yolov8n.pt')
results = model('image.jpg')

# Get all detected class names
class_names = [results[0].names[int(box.cls[0])] for box in results[0].boxes]

# Count occurrences
counts = Counter(class_names)

print("Object counts:")
for obj, count in counts.items():
    print(f"{obj}: {count}")

This is perfect for inventory management, crowd counting, traffic analysis — anywhere you need to know “how many of X are in this image?”

Training YOLO on Custom Objects (Your Secret Weapon)

Pre-trained YOLO is great, but the real power is training it to detect YOUR specific objects. Maybe you’re detecting defects in manufacturing, identifying rare plants, or recognizing your dog’s toys.

Prepare Your Dataset

You need images with bounding box annotations. The format looks like this (YOLO format):

<class_id> <x_center> <y_center> <width> <height>

All values are normalized (0–1). So if you have a person at coordinates (100, 50) to (200, 300) in a 640x480 image:

0 0.234375 0.364583 0.15625 0.520833

Yeah, it’s tedious. Use annotation tools like:

Roboflow: Web-based, handles format conversion automatically
LabelImg: Desktop app, free and open source
CVAT: More advanced, good for large projects

IMO, Roboflow is worth paying for if you’re doing this seriously. It handles the annoying parts.

Train Your Model

Once you’ve got annotated images organized properly (images in one folder, labels in another), training is straightforward:

python

from ultralytics import YOLO

# Load a pre-trained model to fine-tune
model = YOLO('yolov8n.pt')

# Train on your custom dataset
results = model.train(
    data='path/to/data.yaml',  # Config file pointing to your data
    epochs=100,
    imgsz=640,
    batch=16
)

The data.yaml file tells YOLO where your training/validation images are and what classes you're detecting. Example:

yaml

train: /path/to/train/images
val: /path/to/val/images

nc: 3  # number of classes
names: ['cat', 'dog', 'bird']

Training takes time depending on your dataset size and hardware. On a decent GPU, 100 epochs on a small dataset might take 30 minutes. On CPU? Grab lunch (or dinner).

Improving Detection Performance

Your YOLO model isn’t perfect right out of the box. Here’s how to make it better:

Use a Larger Model

If nano isn’t cutting it, upgrade:

python

model = YOLO('yolov8s.pt')  # or m, l, x for even better accuracy

Accuracy improves but speed decreases. It’s always a trade-off.

Adjust Image Size

YOLO resizes images to a standard size (default 640x640). Larger sizes catch smaller objects:

python

results = model('image.jpg', imgsz=1280)

Doubles the image size, increases accuracy for small objects, but cuts your FPS in half. Balance speed vs. accuracy based on your needs.

Tune Confidence and IoU Thresholds

Play with these to optimize for your use case:

python

results = model('image.jpg', 
                conf=0.3,      # Lower = catch more objects
                iou=0.5)       # Intersection over Union for NMS

IoU (Intersection over Union) controls non-maximum suppression. Lower values allow more overlapping detections, higher values are stricter.

Use Test-Time Augmentation

TTA runs detection on multiple augmented versions of the image and averages results:

python

results = model('image.jpg', augment=True)

Slower but more robust. Good for when accuracy matters more than speed.

Common Problems (And How I Fixed Them)

“Detection is super slow on my laptop”

Welcome to the CPU life. Solutions:

Use the nano model (yolov8n.pt)
Reduce image size: imgsz=320
Process every Nth frame in videos
Consider Google Colab for free GPU access

“It’s detecting random stuff that’s obviously wrong”

Increase the confidence threshold: conf=0.6 or higher. Pre-trained models sometimes hallucinate objects—higher confidence helps.

“It’s missing objects that are clearly visible”

Try:

Lower confidence threshold: conf=0.3
Larger model: switch from nano to small or medium
Larger image size: imgsz=1280
Different model version: sometimes YOLOv8s works where YOLOv8n fails

“Training my custom model but accuracy sucks”

Check these:

Data quality: Garbage annotations = garbage model
Dataset size: You need hundreds of examples per class minimum
Class balance: Don’t have 1000 examples of one class and 10 of another
Epochs: 100 might not be enough, try 200–300
Learning rate: The default usually works, but sometimes needs tuning

Taking It to Production

Got YOLO working on your laptop? Here’s what you need to think about for real deployment:

Optimize for Speed

Export to ONNX or TensorRT for faster inference:

python

model = YOLO('yolov8n.pt')
model.export(format='onnx')  # or 'engine' for TensorRT

ONNX works everywhere, TensorRT is NVIDIA-specific but blazing fast.

Handle Edge Cases

Real-world data is messy. Your code should handle:

Corrupted images/videos
No detections (empty results)
Multiple overlapping objects
Lighting/weather variations

Don’t just assume perfect inputs — test with garbage data and handle failures gracefully.

Monitor Performance

Track your model’s accuracy over time. Real-world performance often degrades as conditions change. Plan to retrain periodically with new data.

Your Next Steps

You just learned to use one of the most powerful object detection frameworks in existence. That’s genuinely impressive.

Now build something with it. Don’t just follow tutorials — solve a real problem. Maybe:

Count cars in a parking lot from a webcam feed
Detect when your cat jumps on the counter
Track inventory on retail shelves
Monitor social distancing in public spaces
Identify wildlife in trail camera footage

The best way to master YOLO is using it on projects you actually care about. Pick something that excites you and start building this weekend.

YOLO gave you superpowers — what are you going to detect?

Sam Austin

Search This Blog

Latest Post

Reinforcement Learning for Credit Scoring: Applications in Fintech