Object Detection: YOLO & Detectron2 Guide

What is Object Detection?

Object detection is a computer vision task that involves identifying and locating objects within images or videos. Unlike image classification (which tells you what's in an image), object detection tells you what objects are present and where they are.

For each detected object, the model outputs:

Class label: What the object is (e.g., "person", "car", "dog")
Bounding box: Rectangle coordinates (x, y, width, height) around the object
Confidence score: How certain the model is about the detection (0-1)

Example: In a street photo, object detection might find 3 people at coordinates (50,100), 2 cars at (200,150), and 1 traffic light at (400,50).

Why Object Detection Matters

Object detection powers critical applications across industries:

Autonomous Vehicles: Detect pedestrians, vehicles, traffic signs, lanes
Surveillance & Security: Identify intruders, suspicious activities, abandoned objects
Retail: Track inventory, monitor shelves, analyze customer behavior
Healthcare: Detect tumors in medical scans, count cells, identify abnormalities
Manufacturing: Quality control, defect detection on assembly lines
Agriculture: Crop monitoring, disease detection, yield estimation
Sports Analytics: Player tracking, ball detection, tactical analysis
Augmented Reality: Recognize objects to overlay digital content

When to Use Object Detection

Choose object detection when you need to:

Find and count multiple objects in an image
Know the precise location of objects (not just their presence)
Process images or video streams in real-time
Track objects across video frames
Build systems that react to objects' positions (robotics, AR)

Note: If you only need to classify whether an object exists (not where), use image classification instead. If you need pixel-perfect boundaries, use instance segmentation.

Evolution of Object Detection

Traditional Methods (Pre-Deep Learning)

Sliding Windows: Move a window across image, classify each patch (very slow)
HOG + SVM: Histogram of Oriented Gradients features with SVM classifier
DPM: Deformable Part Models for detecting object parts

Two-Stage Detectors

R-CNN (2014): Region proposals + CNN classification (slow but accurate)
Fast R-CNN (2015): Share CNN features across proposals (faster)
Faster R-CNN (2015): Neural network generates proposals (Region Proposal Network)
Mask R-CNN (2017): Adds instance segmentation masks

One-Stage Detectors (Real-Time)

YOLO (2016): "You Only Look Once" - extremely fast, single forward pass
SSD (2016): Single Shot Detector with multiple feature maps
RetinaNet (2017): Focal loss to handle class imbalance

Understanding Bounding Boxes

A bounding box is defined by 4 values:

x, y: Top-left corner coordinates
w, h: Width and height

Alternative format: (x_min, y_min, x_max, y_max) - top-left and bottom-right corners.

IoU (Intersection over Union)

IoU measures overlap between predicted and ground truth boxes:

IoU = Area of Overlap / Area of Union

IoU = 1.0: Perfect match
IoU > 0.5: Usually considered a good detection
IoU < 0.5: Poor detection

def calculate_iou(box1, box2):
    """Calculate IoU between two bounding boxes.
    Boxes format: [x, y, width, height]
    """
    # Get coordinates
    x1_min, y1_min = box1[0], box1[1]
    x1_max, y1_max = box1[0] + box1[2], box1[1] + box1[3]
    x2_min, y2_min = box2[0], box2[1]
    x2_max, y2_max = box2[0] + box2[2], box2[1] + box2[3]

    # Calculate intersection
    inter_x_min = max(x1_min, x2_min)
    inter_y_min = max(y1_min, y2_min)
    inter_x_max = min(x1_max, x2_max)
    inter_y_max = min(y1_max, y2_max)

    inter_area = max(0, inter_x_max - inter_x_min) * max(0, inter_y_max - inter_y_min)

    # Calculate union
    box1_area = box1[2] * box1[3]
    box2_area = box2[2] * box2[3]
    union_area = box1_area + box2_area - inter_area

    # Calculate IoU
    iou = inter_area / union_area if union_area > 0 else 0
    return iou

YOLO: You Only Look Once

YOLO revolutionized object detection by treating it as a regression problem rather than classification. It predicts bounding boxes and class probabilities directly in a single forward pass.

How YOLO Works

Divide image into grid: E.g., 13x13 grid cells
Each cell predicts: Multiple bounding boxes (usually 2-3) and class probabilities
For each box: Predict x, y, width, height, confidence, and class scores
Non-Max Suppression: Remove duplicate detections

YOLO Versions

YOLOv1-v3: Original implementations by Joseph Redmon
YOLOv4: Improved accuracy with bag of tricks
YOLOv5: PyTorch implementation, easier to use
YOLOv7: Current state-of-the-art for speed/accuracy
YOLOv8: Latest from Ultralytics, user-friendly API

Using YOLOv8 with Ultralytics

# Install
# pip install ultralytics

from ultralytics import YOLO
import cv2
import matplotlib.pyplot as plt

# Load pre-trained model
model = YOLO('yolov8n.pt')  # n=nano, s=small, m=medium, l=large, x=xlarge

# Predict on image
results = model('street_scene.jpg')

# Access results
for result in results:
    boxes = result.boxes  # Bounding boxes
    for box in boxes:
        # Get box coordinates
        x1, y1, x2, y2 = box.xyxy[0]  # Top-left, bottom-right
        confidence = box.conf[0]
        class_id = box.cls[0]
        class_name = model.names[int(class_id)]

        print(f"Detected {class_name} at [{x1:.0f}, {y1:.0f}, {x2:.0f}, {y2:.0f}] "
              f"with confidence {confidence:.2f}")

# Visualize results
annotated = results[0].plot()  # Draw boxes on image
plt.imshow(cv2.cvtColor(annotated, cv2.COLOR_BGR2RGB))
plt.axis('off')
plt.show()

# Save results
results[0].save('output.jpg')

Real-Time Video Detection

import cv2
from ultralytics import YOLO

# Load model
model = YOLO('yolov8n.pt')

# Open video
cap = cv2.VideoCapture(0)  # 0 for webcam, or path to video file

while cap.isOpened():
    ret, frame = cap.read()
    if not ret:
        break

    # Run detection
    results = model(frame, verbose=False)

    # Draw results
    annotated_frame = results[0].plot()

    # Display
    cv2.imshow('YOLO Detection', annotated_frame)

    # Press 'q' to quit
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

cap.release()
cv2.destroyAllWindows()

Training Custom YOLO Model

from ultralytics import YOLO

# Load a model
model = YOLO('yolov8n.pt')  # Start with pre-trained weights

# Train on custom dataset
# Dataset should be in YOLO format with .yaml config
results = model.train(
    data='custom_dataset.yaml',  # Path to dataset config
    epochs=100,
    imgsz=640,
    batch=16,
    name='custom_detector',
    pretrained=True,
    optimizer='Adam',
    lr0=0.01,
    device=0  # GPU device
)

# Validate
metrics = model.val()
print(f"mAP50: {metrics.box.map50}")
print(f"mAP50-95: {metrics.box.map}")

# Export for deployment
model.export(format='onnx')  # or 'torchscript', 'tflite', etc.

Dataset Format (YAML)

# custom_dataset.yaml
path: /path/to/dataset
train: images/train
val: images/val
test: images/test

# Classes
names:
  0: person
  1: car
  2: bicycle
  3: dog

Annotation Format

Each image needs a corresponding .txt file with one line per object:

# class_id x_center y_center width height (all normalized 0-1)
0 0.5 0.5 0.3 0.4
1 0.2 0.3 0.15 0.2

Detectron2: Facebook's Detection Framework

Detectron2 is a powerful library from Facebook AI Research (FAIR) supporting state-of-the-art detection and segmentation models.

Key Features

Faster R-CNN, Mask R-CNN, RetinaNet implementations
Instance segmentation support
Panoptic segmentation
Keypoint detection
Highly modular and customizable

Using Detectron2

# Install
# pip install detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/cu118/torch2.0/index.html

import detectron2
from detectron2 import model_zoo
from detectron2.engine import DefaultPredictor
from detectron2.config import get_cfg
from detectron2.utils.visualizer import Visualizer
from detectron2.data import MetadataCatalog
import cv2

# Configure model
cfg = get_cfg()
cfg.merge_from_file(model_zoo.get_config_file(
    "COCO-Detection/faster_rcnn_R_50_FPN_3x.yaml"
))
cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url(
    "COCO-Detection/faster_rcnn_R_50_FPN_3x.yaml"
)
cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.5  # Confidence threshold
cfg.MODEL.DEVICE = "cuda"  # or "cpu"

# Create predictor
predictor = DefaultPredictor(cfg)

# Run inference
image = cv2.imread("image.jpg")
outputs = predictor(image)

# Access predictions
instances = outputs["instances"]
boxes = instances.pred_boxes  # Bounding boxes
scores = instances.scores      # Confidence scores
classes = instances.pred_classes  # Class IDs

# Visualize
v = Visualizer(
    image[:, :, ::-1],
    MetadataCatalog.get(cfg.DATASETS.TRAIN[0]),
    scale=1.0
)
out = v.draw_instance_predictions(instances.to("cpu"))
cv2.imwrite("output.jpg", out.get_image()[:, :, ::-1])

Training with Detectron2

from detectron2.data.datasets import register_coco_instances
from detectron2.engine import DefaultTrainer

# Register your dataset (COCO format)
register_coco_instances(
    "my_dataset_train",
    {},
    "path/to/train_annotations.json",
    "path/to/train_images"
)
register_coco_instances(
    "my_dataset_val",
    {},
    "path/to/val_annotations.json",
    "path/to/val_images"
)

# Configure training
cfg = get_cfg()
cfg.merge_from_file(model_zoo.get_config_file(
    "COCO-Detection/faster_rcnn_R_50_FPN_3x.yaml"
))
cfg.DATASETS.TRAIN = ("my_dataset_train",)
cfg.DATASETS.TEST = ("my_dataset_val",)
cfg.DATALOADER.NUM_WORKERS = 2
cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url(
    "COCO-Detection/faster_rcnn_R_50_FPN_3x.yaml"
)
cfg.SOLVER.IMS_PER_BATCH = 2
cfg.SOLVER.BASE_LR = 0.00025
cfg.SOLVER.MAX_ITER = 1000
cfg.MODEL.ROI_HEADS.NUM_CLASSES = 4  # Your number of classes

# Train
trainer = DefaultTrainer(cfg)
trainer.resume_or_load(resume=False)
trainer.train()

Non-Maximum Suppression (NMS)

NMS removes duplicate detections of the same object by keeping only the highest-confidence box among overlapping detections.

Algorithm

Sort boxes by confidence score (highest first)
Select box with highest score
Remove all boxes with IoU > threshold (e.g., 0.5) with selected box
Repeat for remaining boxes

import numpy as np

def non_max_suppression(boxes, scores, iou_threshold=0.5):
    """
    boxes: array of [x, y, width, height]
    scores: confidence scores
    """
    # Sort by score
    order = scores.argsort()[::-1]

    keep = []
    while order.size > 0:
        # Pick highest score box
        i = order[0]
        keep.append(i)

        # Calculate IoU with remaining boxes
        ious = np.array([calculate_iou(boxes[i], boxes[j]) for j in order[1:]])

        # Keep boxes with IoU < threshold
        remaining = np.where(ious < iou_threshold)[0]
        order = order[remaining + 1]

    return keep

Evaluation Metrics

Precision and Recall

Precision: Of detected objects, how many are correct?
Recall: Of all ground truth objects, how many did we detect?

Average Precision (AP)

Area under the Precision-Recall curve for a single class.

Mean Average Precision (mAP)

Average of AP across all classes. Most important metric for object detection.

mAP@0.5: IoU threshold of 0.5 (easier)
mAP@0.75: IoU threshold of 0.75 (stricter)
mAP@0.5:0.95: Average across IoU thresholds 0.5 to 0.95 (COCO metric)

Best Practices

Data Quality: Accurate bounding box annotations are crucial
Data Augmentation: Use flipping, rotation, color jittering, mosaic augmentation
Anchor Boxes: Choose anchor sizes based on your object sizes
Image Size: Larger images (640x640) improve accuracy but slow inference
Batch Size: Larger batches stabilize training but need more GPU memory
Learning Rate: Use warmup and decay schedules
Class Imbalance: Address with focal loss or weighted sampling
Transfer Learning: Always start with pre-trained weights (COCO dataset)
Monitor mAP: Track mAP during training, not just loss
Test Time Augmentation: Run inference on flipped/scaled versions for better results

Comparison: YOLO vs Detectron2

Aspect	YOLO	Detectron2
Speed	Very fast (real-time)	Slower (more accurate)
Accuracy	Good	Excellent
Ease of Use	Very easy	Moderate
Segmentation	Limited (YOLOv8-seg)	Excellent (Mask R-CNN)
Best For	Real-time, embedded	Research, high accuracy

Complete Example: Custom Object Detector

from ultralytics import YOLO
import cv2
import numpy as np

class ObjectDetector:
    def __init__(self, model_path='yolov8n.pt', conf_threshold=0.5):
        self.model = YOLO(model_path)
        self.conf_threshold = conf_threshold

    def detect(self, image_path):
        """Detect objects in image."""
        results = self.model(image_path, conf=self.conf_threshold)
        return results[0]

    def detect_and_count(self, image_path):
        """Detect and count objects by class."""
        results = self.detect(image_path)
        counts = {}

        for box in results.boxes:
            class_name = self.model.names[int(box.cls[0])]
            counts[class_name] = counts.get(class_name, 0) + 1

        return counts

    def detect_specific_class(self, image_path, target_class):
        """Detect only specific class of objects."""
        results = self.detect(image_path)
        detections = []

        for box in results.boxes:
            class_name = self.model.names[int(box.cls[0])]
            if class_name == target_class:
                x1, y1, x2, y2 = box.xyxy[0]
                confidence = box.conf[0]
                detections.append({
                    'bbox': [int(x1), int(y1), int(x2), int(y2)],
                    'confidence': float(confidence)
                })

        return detections

    def track_video(self, video_path, output_path):
        """Track objects in video."""
        cap = cv2.VideoCapture(video_path)
        width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
        height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
        fps = int(cap.get(cv2.CAP_PROP_FPS))

        # Video writer
        fourcc = cv2.VideoWriter_fourcc(*'mp4v')
        out = cv2.VideoWriter(output_path, fourcc, fps, (width, height))

        while cap.isOpened():
            ret, frame = cap.read()
            if not ret:
                break

            # Detect with tracking
            results = self.model.track(frame, persist=True, conf=self.conf_threshold)
            annotated = results[0].plot()

            out.write(annotated)

        cap.release()
        out.release()

# Usage
detector = ObjectDetector('yolov8n.pt', conf_threshold=0.5)

# Count objects
counts = detector.detect_and_count('street.jpg')
print(f"Object counts: {counts}")

# Detect only people
people = detector.detect_specific_class('crowd.jpg', 'person')
print(f"Found {len(people)} people")

# Track objects in video
detector.track_video('input.mp4', 'output_tracked.mp4')

Master Object Detection with Expert Mentorship

Our Data Science program covers object detection in depth. Build real-time detection systems, custom detectors, and learn deployment strategies with hands-on projects.

Explore Data Science Program

Object Detection