Skip to content

YOLO-TLP: detected and classified tiny objects with bounding box dimensions smaller than 15 pixels, outperforming other one-stage detectors. maximum resolution for target observation in real-time applications.

License

Notifications You must be signed in to change notification settings

irfan112/YOLO-TLP

Repository files navigation

YOLO-TLP: Tiny and Large-scale Precision Detection

YOLO-TLP (You Only Look Once - Tiny and Large-scale Precision) is a specialized object detection model optimized for small object detection in challenging real-world scenarios. Built upon the YOLO architecture, YOLO-TLP introduces novel modifications that preserve spatial information during downsampling, enabling superior detection of tiny objects (8-128 pixels) while maintaining real-time inference speeds.

PyTorch Real-time Detection Small Objects 1.9M Parameters

Introduction

Object detection has made remarkable progress in recent years, with YOLO-series models achieving state-of-the-art performance on standard benchmarks. However, a persistent challenge remains: detecting small objects (objects occupying less than 32×32 pixels in an image). Traditional downsampling operations in convolutional neural networks cause significant information loss, particularly affecting small objects whose spatial features are already limited.

YOLO-TLP addresses this fundamental limitation through architectural innovations that prioritize spatial information preservation. The model achieves:

  • 39% faster inference compared to YOLOv12n (1.7ms vs 2.8ms per image)
  • Superior small object detection with 5-16% improvement on pedestrian, people, bicycle, and motorcycle classes
  • 25% parameter reduction (1.9M vs 2.5M parameters) for efficient deployment
  • Multi-scale detection at P2 (160×160), P3 (80×80), and P4 (40×40) feature levels

Problem Statement

Standard object detectors face several challenges with small objects:

  1. Information Loss: Traditional stride-2 convolutions discard 75% of spatial information during downsampling
  2. Limited Features: Small objects have fewer pixels, providing minimal discriminative features
  3. Scale Mismatch: Detection heads optimized for large objects struggle with tiny targets
  4. Low Resolution: Deep network layers operate on highly downsampled feature maps where small objects may occupy only 1-2 pixels

Real-world Impact: Small object detection failures can have serious consequences in applications like autonomous driving (missing pedestrians), surveillance (undetected intruders), and medical imaging (overlooked lesions).

Importance and Applications

Why Small Object Detection Matters

Small object detection is critical across numerous domains where objects of interest occupy minimal image area but carry significant importance:

Domain Application Impact
Autonomous Systems Pedestrian and cyclist detection Safety-critical: preventing collisions with vulnerable road users
Surveillance Crowd monitoring, intrusion detection Security: identifying threats in crowded environments
Medical Imaging Early lesion detection, cell counting Healthcare: early disease diagnosis and treatment
Aerial Imagery Vehicle/building detection from drones/satellites Urban planning, disaster response, agriculture
Industrial Inspection Defect detection, quality control Manufacturing: reducing defective products
Robotics Obstacle detection, manipulation Navigation: avoiding collisions with small obstacles

Performance Advantages

Benchmark Results on VisDrone Dataset:

  • Pedestrian detection: +5.0% mAP50 (0.357 vs 0.340)
  • People detection: +8.0% mAP50 (0.282 vs 0.261)
  • Bicycle detection: +12.4% mAP50 (0.080 vs 0.071)
  • Motorcycle detection: +2.6% mAP50 (0.357 vs 0.348)
  • Overall small objects: +7.1% average improvement

Deployment Benefits

YOLO-TLP's efficiency enables deployment in resource-constrained environments:

  • Edge Devices: Runs on embedded systems with limited GPU memory (NVIDIA Jetson, mobile devices)
  • Real-time Processing: 39% faster inference enables processing of high-frame-rate video streams
  • Scalability: Small model size (3.8MB) reduces bandwidth for distributed deployments
  • Cost-Effective: Lower computational requirements reduce power consumption and hardware costs

Novel Contributions

1. Space-to-Depth Convolution (SPDConv)

The core innovation of YOLO-TLP is the integration of SPDConv modules at critical downsampling stages (P2→P3 and P3→P4 transitions).

Problem with Traditional Downsampling

Standard stride-2 convolution:

Output = Conv(Input, stride=2)

This operation discards 75% of spatial information by sampling every other pixel. For a small object occupying 4×4 pixels, stride-2 downsampling reduces it to 2×2 pixels, losing critical details.

SPDConv Solution

Space-to-Depth operation rearranges spatial information into channels:

  1. Space-to-Depth: Convert H×W×C to (H/2)×(W/2)×4C by rearranging 2×2 spatial blocks into channels
  2. Convolution: Apply standard convolution to process the rearranged features
  3. Result: Zero information loss while achieving spatial downsampling

Mathematical Formulation:

Given input X ∈ ℝ^(H×W×C), SPDConv produces:

Y = Conv(Rearrange(X)) ∈ ℝ^(H/2×W/2×C')

where Rearrange preserves all spatial information by converting it to channel dimension.

Impact

  • Gradient Flow: Better backpropagation through preserved spatial information
  • Feature Richness: More discriminative features for small objects
  • Detection Accuracy: 8-16% improvement on small object classes

2. P2-Level Detection Head

YOLO-TLP extends detection to the P2 feature level (stride 4), providing 4× higher spatial resolution than standard P3 (stride 8) detection.

Resolution Analysis

Level Resolution (640×640 input) Stride Best For
P2/4 160×160 4 Tiny objects (8-32px)
P3/8 80×80 8 Small objects (16-64px)
P4/16 40×40 16 Medium objects (32-128px)

Benefits

  • Higher Recall: Detects objects that would be invisible at coarser scales
  • Precise Localization: Tighter bounding boxes with 16× more spatial detail
  • Scale Coverage: Multi-scale detection from 8px to 128px objects

3. Architectural Optimizations

Position-Sensitive Attention (PSA)

Integrated PSA module in the backbone enhances spatial feature representation:

  • Lightweight: Minimal parameter overhead (65K parameters)
  • Context-Aware: Captures long-range spatial dependencies
  • Selective: Emphasizes informative regions for small objects

Efficient Backbone Design

Streamlined architecture with C2fCIB and C3k2 blocks:

  • C2fCIB: CSP bottleneck with Conditional Identity Block for efficient feature extraction
  • C3k2: Cross Stage Partial network with 3×3 kernels for balanced receptive field
  • Layer Reduction: Optimized depth at each stage for speed-accuracy balance

4. FPN + PAN Architecture

Enhanced Feature Pyramid Network (FPN) with Path Aggregation Network (PAN):

Top-Down Path (FPN)

Propagates strong semantic features from deep layers to shallow layers through upsampling and concatenation.

5. Training Strategy

Optimized training pipeline for small object detection:

  • Higher Resolution: Training on 1280×1280 images (vs standard 640×640) preserves small object details
  • Augmentation: Mosaic (0.9), Mixup (0.1), and scale augmentation (0.9) for robustness
  • Loss Weighting: Adjusted loss weights to emphasize small object detection
  • Optimizer: AdamW with learning rate 0.001 for stable convergence

Technical Specifications

Model Variants

Model Parameters GFLOPs Inference (ms) mAP50
YOLO-TLP-n 1.9M 9.2 1.7 0.305
YOLO-TLP-s 7.1M 28.4 3.2 TBD
YOLO-TLP-m 16.8M 67.3 6.8 TBD
YOLO-TLP-l 25.3M 98.7 9.5 TBD
YOLO-TLP-x 39.7M 154.2 14.3 TBD

Tested on NVIDIA RTX 3060 with FP32 precision

Architecture Diagram

YOLO-TLP Architecture

Performance on Small Objects (mAP50)

Small Object Detection Performance Comparison

Key Findings:

  • Pedestrian: +5.0% (0.357 vs 0.340) - Better person detection in crowds
  • People: +8.0% (0.282 vs 0.261) - Significant improvement on small people
  • Bicycle: +12.4% (0.080 vs 0.071) - Major boost for tiny vehicles
  • Awning-Tricycle: +7.3% (0.117 vs 0.109) - Better covered vehicle detection
  • Motor: +2.6% (0.357 vs 0.348) - Improved motorcycle detection

YOLO-TLP excels at detecting small objects thanks to SPDConv architecture preserving spatial information

Strict Localization Performance (mAP50-95)

mAP50-95 Small Object Detection Performance

Key Findings:

  • People: +16.0% (0.109 vs 0.094) - Excellent localization accuracy
  • Bicycle: +14.0% (0.033 vs 0.029) - Better bounding box precision
  • Awning-Tricycle: +13.4% (0.076 vs 0.067) - Improved tight fit detection
  • Pedestrian: +9.1% (0.156 vs 0.143) - More accurate person boxes
  • Motor: +7.8% (0.152 vs 0.141) - Better motorcycle localization

mAP50-95 measures accuracy at stricter IoU thresholds. YOLO-TLP's superior performance shows it produces tighter, more accurate bounding boxes for small objects.

Speed & Efficiency Comparison

Speed and Efficiency Comparison

Efficiency Analysis:

  • Speed Advantage: YOLO-TLP achieves 39% faster inference (1.7ms vs 2.8ms), crucial for real-time applications like surveillance and autonomous systems
  • Parameter Efficiency: 25% fewer parameters (1.9M vs 2.5M) enables deployment on edge devices with limited memory
  • Deployment Benefits: Smaller model size (3.8MB) reduces storage and bandwidth requirements
  • Trade-off: Higher GFLOPs (9.2 vs 5.9) due to SPDConv operations, but still delivers faster overall inference

YOLO-TLP achieves superior speed and efficiency through optimized architecture, making it ideal for resource-constrained environments requiring real-time small object detection.

All Visdrone classes Comparison

YOLO-TLP Performs Better On:

  • Pedestrian: +5.0% (small person detection)
  • People: +8.0% (small/partial people)
  • Bicycle: +12.4% (tiny two-wheelers)
  • Car: +1.1% (common vehicles)
  • Awning-Tricycle: +7.3% (covered vehicles)
  • Motor: +2.6% (motorcycles/scooters)

6 out of 10 classes - Dominates in small object categories

YOLOv12n Performs Better On:

  • Van: -11.4% (medium vehicles)
  • Truck: -27.1% (large vehicles)
  • Tricycle: -14.8% (three-wheelers)
  • Bus: -20.4% (large public transport)

4 out of 10 classes - Struggles with medium-large objects

YOLO-TLP is specifically optimized for small object detection, achieving superior performance on 6 out of 10 classes including all small objects (pedestrians, people, bicycles, motors). The trade-off is reduced performance on medium-large objects (van, truck, bus), resulting in a 5.6% lower overall mAP50 (0.305 vs 0.323). This makes YOLO-TLP ideal for applications where small object detection is the priority: surveillance, crowd monitoring, autonomous navigation, and similar use cases.

Precision vs Recall Analysis
  • Overall Precision: YOLOv12n (42.0%) vs YOLO-TLP (41.7%) - Nearly identical quality
  • Overall Recall: YOLOv12n (32.8%) vs YOLO-TLP (31.0%) - Both models conservative
  • Small Objects: YOLO-TLP shows better recall on pedestrians (37.3% vs 33.8%) and motors (37.8% vs 36.3%)
  • Large Objects: YOLOv12n has superior recall on trucks (31.3% vs 21.9%) and buses (44.6% vs 37.0%)
  • Balance: Both models prioritize precision over recall, avoiding false positives
Overall Model Capabilities Comparison

YOLOv12n Strengths

  • Higher overall mAP (0.323 vs 0.305)
  • Better medium-large object detection
  • Superior van detection (+11.4%)
  • Excellent truck detection (+27.1%)
  • Better bus detection (+20.4%)
  • Lower computational cost (5.9 vs 9.2 GFLOPs)
  • Fewer layers (159 vs 270)

YOLO-TLP Strengths

  • 39% faster inference (1.7ms vs 2.8ms)
  • 25% fewer parameters (1.9M vs 2.5M)
  • Superior small object detection
  • Better pedestrian detection (+5.0%)
  • Excellent people detection (+8.0%)
  • Best bicycle detection (+12.4%)
  • SPDConv zero information loss
  • P2-level detection for tiny objects

Choose YOLO-TLP For:

  • Surveillance & Security: Detecting people, pedestrians in crowds

  • Autonomous Navigation: Small obstacle detection (pedestrians, bicycles, motorcycles)

  • Crowd Monitoring: Real-time people counting and tracking

  • Traffic Analysis: Two-wheeler and small vehicle detection

  • Edge Deployment: Resource-constrained devices requiring speed

  • Real-time Applications: Where 39% faster inference matters

  • General Object Detection: Balanced performance across all object sizes

  • Vehicle Detection: Cars, vans, trucks, buses (medium-large vehicles)

  • Logistics & Transport: Fleet monitoring, cargo detection

  • Parking Systems: Various vehicle types detection

  • Lower Computational Requirements: When GFLOPs matter (5.9 vs 9.2)


YOLO-TLP Results     YOLOv12n Results

YOLO-TLP vs YOLOv12n


Downloads

Dataset

VisDrone Dataset (YOLO Format)

The VisDrone dataset has been preprocessed and converted to YOLO format for easy training and evaluation.

Download Dataset

Dataset Details:

  • Format: YOLO (txt annotations)
  • Images: 10,209 total
  • Classes: 10 object categories
  • Split: Train/Val/Test
  • Size: ~2.3GB

Pre-trained Weights

YOLO-TLP Model Weights

Download the pre-trained YOLO-TLP-n model weights trained on VisDrone dataset.

Download Weights

Training

Train from Scratch or Resume Training

yolo detect train \
  model=ultralytics/cfg/models/v12/yolov12.yaml \
  data=your_dataset.yaml \
  epochs=100 \
  imgsz=640 \
  batch=8 \
  device=0 \
  project=yoloTLP_runs \
  name=exp1 \

```bash
# Webcam inference
yolo predict model=weights/best.pt source=0 show=True

```bash
# Folder of images
yolo predict model=weights/best.pt source=test_images/ save=True

```bash
# RTSP stream
yolo predict model=weights/best.pt source=rtsp://your_stream_url


## References and Acknowledgments

### Built Upon

This work builds upon several excellent open-source projects:

- **[Ultralytics](https://github.com/ultralytics/ultralytics)** - YOLO framework and implementation

@software{ultralytics_yolo,
author = {Glenn Jocher and others},
title = {Ultralytics YOLO},
year = {2023},
url = {https://github.com/ultralytics/ultralytics}
}

- **[YOLOv12](https://github.com/sunsmarterjie/yolov12)** - Base architecture and improvements

@misc{yolov12,
author = {sunsmarterjie},
title = {YOLOv12},
year = {2024},
url = {https://github.com/sunsmarterjie/yolov12}
}

- **[SPD-Conv](https://github.com/LabSAINT/SPD-Conv)** - Space-to-Depth convolution module


@inproceedings{sunkara2022no,
title={No More Strided Convolutions or Pooling: A New CNN Building Block for Low-Resolution Images and Small Objects},
author={Sunkara, Raja and Luo, Tie},
booktitle={ECML PKDD},
year={2022}
}


### Special Thanks

We gratefully acknowledge the computer vision community for their contributions to open-source object detection research. Special thanks to:

- The YOLO community for continuous innovation
- VisDrone dataset creators for providing challenging small object detection benchmarks
- Contributors to PyTorch and related deep learning frameworks

---

## Citation

If you use YOLO-TLP in your research, please cite:
```bibtex
@article{yolotlp2025,
  title={YOLO-TLP: Enhanced Small Object Detection with Space-to-Depth Convolutions},
  author={Irfan Hussain},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2024}
}

About

YOLO-TLP: detected and classified tiny objects with bounding box dimensions smaller than 15 pixels, outperforming other one-stage detectors. maximum resolution for target observation in real-time applications.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages