What is Object Detection?

Object detection is a computer vision task that involves identifying and locating objects within images or videos. Unlike image classification (which tells you what's in an image), object detection tells you what objects are present and where they are.

For each detected object, the model outputs:

  • Class label: What the object is (e.g., "person", "car", "dog")
  • Bounding box: Rectangle coordinates (x, y, width, height) around the object
  • Confidence score: How certain the model is about the detection (0-1)

Example: In a street photo, object detection might find 3 people at coordinates (50,100), 2 cars at (200,150), and 1 traffic light at (400,50).

Why Object Detection Matters

Object detection powers critical applications across industries:

  • Autonomous Vehicles: Detect pedestrians, vehicles, traffic signs, lanes
  • Surveillance & Security: Identify intruders, suspicious activities, abandoned objects
  • Retail: Track inventory, monitor shelves, analyze customer behavior
  • Healthcare: Detect tumors in medical scans, count cells, identify abnormalities
  • Manufacturing: Quality control, defect detection on assembly lines
  • Agriculture: Crop monitoring, disease detection, yield estimation
  • Sports Analytics: Player tracking, ball detection, tactical analysis
  • Augmented Reality: Recognize objects to overlay digital content

When to Use Object Detection

Choose object detection when you need to:

  • Find and count multiple objects in an image
  • Know the precise location of objects (not just their presence)
  • Process images or video streams in real-time
  • Track objects across video frames
  • Build systems that react to objects' positions (robotics, AR)

Note: If you only need to classify whether an object exists (not where), use image classification instead. If you need pixel-perfect boundaries, use instance segmentation.

Evolution of Object Detection

Traditional Methods (Pre-Deep Learning)

  • Sliding Windows: Move a window across image, classify each patch (very slow)
  • HOG + SVM: Histogram of Oriented Gradients features with SVM classifier
  • DPM: Deformable Part Models for detecting object parts

Two-Stage Detectors

  • R-CNN (2014): Region proposals + CNN classification (slow but accurate)
  • Fast R-CNN (2015): Share CNN features across proposals (faster)
  • Faster R-CNN (2015): Neural network generates proposals (Region Proposal Network)
  • Mask R-CNN (2017): Adds instance segmentation masks

One-Stage Detectors (Real-Time)

  • YOLO (2016): "You Only Look Once" - extremely fast, single forward pass
  • SSD (2016): Single Shot Detector with multiple feature maps
  • RetinaNet (2017): Focal loss to handle class imbalance

Understanding Bounding Boxes

A bounding box is defined by 4 values:

  • x, y: Top-left corner coordinates
  • w, h: Width and height

Alternative format: (x_min, y_min, x_max, y_max) - top-left and bottom-right corners.

IoU (Intersection over Union)

IoU measures overlap between predicted and ground truth boxes:

IoU = Area of Overlap / Area of Union

  • IoU = 1.0: Perfect match
  • IoU > 0.5: Usually considered a good detection
  • IoU < 0.5: Poor detection
def calculate_iou(box1, box2):
    """Calculate IoU between two bounding boxes.
    Boxes format: [x, y, width, height]
    """
    # Get coordinates
    x1_min, y1_min = box1[0], box1[1]
    x1_max, y1_max = box1[0] + box1[2], box1[1] + box1[3]
    x2_min, y2_min = box2[0], box2[1]
    x2_max, y2_max = box2[0] + box2[2], box2[1] + box2[3]

    # Calculate intersection
    inter_x_min = max(x1_min, x2_min)
    inter_y_min = max(y1_min, y2_min)
    inter_x_max = min(x1_max, x2_max)
    inter_y_max = min(y1_max, y2_max)

    inter_area = max(0, inter_x_max - inter_x_min) * max(0, inter_y_max - inter_y_min)

    # Calculate union
    box1_area = box1[2] * box1[3]
    box2_area = box2[2] * box2[3]
    union_area = box1_area + box2_area - inter_area

    # Calculate IoU
    iou = inter_area / union_area if union_area > 0 else 0
    return iou

YOLO: You Only Look Once

YOLO revolutionized object detection by treating it as a regression problem rather than classification. It predicts bounding boxes and class probabilities directly in a single forward pass.

How YOLO Works

  1. Divide image into grid: E.g., 13x13 grid cells
  2. Each cell predicts: Multiple bounding boxes (usually 2-3) and class probabilities
  3. For each box: Predict x, y, width, height, confidence, and class scores
  4. Non-Max Suppression: Remove duplicate detections

YOLO Versions

  • YOLOv1-v3: Original implementations by Joseph Redmon
  • YOLOv4: Improved accuracy with bag of tricks
  • YOLOv5: PyTorch implementation, easier to use
  • YOLOv7: Current state-of-the-art for speed/accuracy
  • YOLOv8: Latest from Ultralytics, user-friendly API

Using YOLOv8 with Ultralytics

# Install
# pip install ultralytics

from ultralytics import YOLO
import cv2
import matplotlib.pyplot as plt

# Load pre-trained model
model = YOLO('yolov8n.pt')  # n=nano, s=small, m=medium, l=large, x=xlarge

# Predict on image
results = model('street_scene.jpg')

# Access results
for result in results:
    boxes = result.boxes  # Bounding boxes
    for box in boxes:
        # Get box coordinates
        x1, y1, x2, y2 = box.xyxy[0]  # Top-left, bottom-right
        confidence = box.conf[0]
        class_id = box.cls[0]
        class_name = model.names[int(class_id)]

        print(f"Detected {class_name} at [{x1:.0f}, {y1:.0f}, {x2:.0f}, {y2:.0f}] "
              f"with confidence {confidence:.2f}")

# Visualize results
annotated = results[0].plot()  # Draw boxes on image
plt.imshow(cv2.cvtColor(annotated, cv2.COLOR_BGR2RGB))
plt.axis('off')
plt.show()

# Save results
results[0].save('output.jpg')

Real-Time Video Detection

import cv2
from ultralytics import YOLO

# Load model
model = YOLO('yolov8n.pt')

# Open video
cap = cv2.VideoCapture(0)  # 0 for webcam, or path to video file

while cap.isOpened():
    ret, frame = cap.read()
    if not ret:
        break

    # Run detection
    results = model(frame, verbose=False)

    # Draw results
    annotated_frame = results[0].plot()

    # Display
    cv2.imshow('YOLO Detection', annotated_frame)

    # Press 'q' to quit
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

cap.release()
cv2.destroyAllWindows()

Training Custom YOLO Model

from ultralytics import YOLO

# Load a model
model = YOLO('yolov8n.pt')  # Start with pre-trained weights

# Train on custom dataset
# Dataset should be in YOLO format with .yaml config
results = model.train(
    data='custom_dataset.yaml',  # Path to dataset config
    epochs=100,
    imgsz=640,
    batch=16,
    name='custom_detector',
    pretrained=True,
    optimizer='Adam',
    lr0=0.01,
    device=0  # GPU device
)

# Validate
metrics = model.val()
print(f"mAP50: {metrics.box.map50}")
print(f"mAP50-95: {metrics.box.map}")

# Export for deployment
model.export(format='onnx')  # or 'torchscript', 'tflite', etc.

Dataset Format (YAML)

# custom_dataset.yaml
path: /path/to/dataset
train: images/train
val: images/val
test: images/test

# Classes
names:
  0: person
  1: car
  2: bicycle
  3: dog

Annotation Format

Each image needs a corresponding .txt file with one line per object:

# class_id x_center y_center width height (all normalized 0-1)
0 0.5 0.5 0.3 0.4
1 0.2 0.3 0.15 0.2

Detectron2: Facebook's Detection Framework

Detectron2 is a powerful library from Facebook AI Research (FAIR) supporting state-of-the-art detection and segmentation models.

Key Features

  • Faster R-CNN, Mask R-CNN, RetinaNet implementations
  • Instance segmentation support
  • Panoptic segmentation
  • Keypoint detection
  • Highly modular and customizable

Using Detectron2

# Install
# pip install detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/cu118/torch2.0/index.html

import detectron2
from detectron2 import model_zoo
from detectron2.engine import DefaultPredictor
from detectron2.config import get_cfg
from detectron2.utils.visualizer import Visualizer
from detectron2.data import MetadataCatalog
import cv2

# Configure model
cfg = get_cfg()
cfg.merge_from_file(model_zoo.get_config_file(
    "COCO-Detection/faster_rcnn_R_50_FPN_3x.yaml"
))
cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url(
    "COCO-Detection/faster_rcnn_R_50_FPN_3x.yaml"
)
cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.5  # Confidence threshold
cfg.MODEL.DEVICE = "cuda"  # or "cpu"

# Create predictor
predictor = DefaultPredictor(cfg)

# Run inference
image = cv2.imread("image.jpg")
outputs = predictor(image)

# Access predictions
instances = outputs["instances"]
boxes = instances.pred_boxes  # Bounding boxes
scores = instances.scores      # Confidence scores
classes = instances.pred_classes  # Class IDs

# Visualize
v = Visualizer(
    image[:, :, ::-1],
    MetadataCatalog.get(cfg.DATASETS.TRAIN[0]),
    scale=1.0
)
out = v.draw_instance_predictions(instances.to("cpu"))
cv2.imwrite("output.jpg", out.get_image()[:, :, ::-1])

Training with Detectron2

from detectron2.data.datasets import register_coco_instances
from detectron2.engine import DefaultTrainer

# Register your dataset (COCO format)
register_coco_instances(
    "my_dataset_train",
    {},
    "path/to/train_annotations.json",
    "path/to/train_images"
)
register_coco_instances(
    "my_dataset_val",
    {},
    "path/to/val_annotations.json",
    "path/to/val_images"
)

# Configure training
cfg = get_cfg()
cfg.merge_from_file(model_zoo.get_config_file(
    "COCO-Detection/faster_rcnn_R_50_FPN_3x.yaml"
))
cfg.DATASETS.TRAIN = ("my_dataset_train",)
cfg.DATASETS.TEST = ("my_dataset_val",)
cfg.DATALOADER.NUM_WORKERS = 2
cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url(
    "COCO-Detection/faster_rcnn_R_50_FPN_3x.yaml"
)
cfg.SOLVER.IMS_PER_BATCH = 2
cfg.SOLVER.BASE_LR = 0.00025
cfg.SOLVER.MAX_ITER = 1000
cfg.MODEL.ROI_HEADS.NUM_CLASSES = 4  # Your number of classes

# Train
trainer = DefaultTrainer(cfg)
trainer.resume_or_load(resume=False)
trainer.train()

Non-Maximum Suppression (NMS)

NMS removes duplicate detections of the same object by keeping only the highest-confidence box among overlapping detections.

Algorithm

  1. Sort boxes by confidence score (highest first)
  2. Select box with highest score
  3. Remove all boxes with IoU > threshold (e.g., 0.5) with selected box
  4. Repeat for remaining boxes
import numpy as np

def non_max_suppression(boxes, scores, iou_threshold=0.5):
    """
    boxes: array of [x, y, width, height]
    scores: confidence scores
    """
    # Sort by score
    order = scores.argsort()[::-1]

    keep = []
    while order.size > 0:
        # Pick highest score box
        i = order[0]
        keep.append(i)

        # Calculate IoU with remaining boxes
        ious = np.array([calculate_iou(boxes[i], boxes[j]) for j in order[1:]])

        # Keep boxes with IoU < threshold
        remaining = np.where(ious < iou_threshold)[0]
        order = order[remaining + 1]

    return keep

Evaluation Metrics

Precision and Recall

  • Precision: Of detected objects, how many are correct?
  • Recall: Of all ground truth objects, how many did we detect?

Average Precision (AP)

Area under the Precision-Recall curve for a single class.

Mean Average Precision (mAP)

Average of AP across all classes. Most important metric for object detection.

  • mAP@0.5: IoU threshold of 0.5 (easier)
  • mAP@0.75: IoU threshold of 0.75 (stricter)
  • mAP@0.5:0.95: Average across IoU thresholds 0.5 to 0.95 (COCO metric)

Best Practices

  • Data Quality: Accurate bounding box annotations are crucial
  • Data Augmentation: Use flipping, rotation, color jittering, mosaic augmentation
  • Anchor Boxes: Choose anchor sizes based on your object sizes
  • Image Size: Larger images (640x640) improve accuracy but slow inference
  • Batch Size: Larger batches stabilize training but need more GPU memory
  • Learning Rate: Use warmup and decay schedules
  • Class Imbalance: Address with focal loss or weighted sampling
  • Transfer Learning: Always start with pre-trained weights (COCO dataset)
  • Monitor mAP: Track mAP during training, not just loss
  • Test Time Augmentation: Run inference on flipped/scaled versions for better results

Comparison: YOLO vs Detectron2

Aspect YOLO Detectron2
Speed Very fast (real-time) Slower (more accurate)
Accuracy Good Excellent
Ease of Use Very easy Moderate
Segmentation Limited (YOLOv8-seg) Excellent (Mask R-CNN)
Best For Real-time, embedded Research, high accuracy

Complete Example: Custom Object Detector

from ultralytics import YOLO
import cv2
import numpy as np

class ObjectDetector:
    def __init__(self, model_path='yolov8n.pt', conf_threshold=0.5):
        self.model = YOLO(model_path)
        self.conf_threshold = conf_threshold

    def detect(self, image_path):
        """Detect objects in image."""
        results = self.model(image_path, conf=self.conf_threshold)
        return results[0]

    def detect_and_count(self, image_path):
        """Detect and count objects by class."""
        results = self.detect(image_path)
        counts = {}

        for box in results.boxes:
            class_name = self.model.names[int(box.cls[0])]
            counts[class_name] = counts.get(class_name, 0) + 1

        return counts

    def detect_specific_class(self, image_path, target_class):
        """Detect only specific class of objects."""
        results = self.detect(image_path)
        detections = []

        for box in results.boxes:
            class_name = self.model.names[int(box.cls[0])]
            if class_name == target_class:
                x1, y1, x2, y2 = box.xyxy[0]
                confidence = box.conf[0]
                detections.append({
                    'bbox': [int(x1), int(y1), int(x2), int(y2)],
                    'confidence': float(confidence)
                })

        return detections

    def track_video(self, video_path, output_path):
        """Track objects in video."""
        cap = cv2.VideoCapture(video_path)
        width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
        height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
        fps = int(cap.get(cv2.CAP_PROP_FPS))

        # Video writer
        fourcc = cv2.VideoWriter_fourcc(*'mp4v')
        out = cv2.VideoWriter(output_path, fourcc, fps, (width, height))

        while cap.isOpened():
            ret, frame = cap.read()
            if not ret:
                break

            # Detect with tracking
            results = self.model.track(frame, persist=True, conf=self.conf_threshold)
            annotated = results[0].plot()

            out.write(annotated)

        cap.release()
        out.release()

# Usage
detector = ObjectDetector('yolov8n.pt', conf_threshold=0.5)

# Count objects
counts = detector.detect_and_count('street.jpg')
print(f"Object counts: {counts}")

# Detect only people
people = detector.detect_specific_class('crowd.jpg', 'person')
print(f"Found {len(people)} people")

# Track objects in video
detector.track_video('input.mp4', 'output_tracked.mp4')

Master Object Detection with Expert Mentorship

Our Data Science program covers object detection in depth. Build real-time detection systems, custom detectors, and learn deployment strategies with hands-on projects.

Explore Data Science Program

Related Articles