YOLO-TLP (You Only Look Once - Tiny and Large-scale Precision) is a specialized object detection model optimized for small object detection in challenging real-world scenarios. Built upon the YOLO architecture, YOLO-TLP introduces novel modifications that preserve spatial information during downsampling, enabling superior detection of tiny objects (8-128 pixels) while maintaining real-time inference speeds.
Object detection has made remarkable progress in recent years, with YOLO-series models achieving state-of-the-art performance on standard benchmarks. However, a persistent challenge remains: detecting small objects (objects occupying less than 32×32 pixels in an image). Traditional downsampling operations in convolutional neural networks cause significant information loss, particularly affecting small objects whose spatial features are already limited.
YOLO-TLP addresses this fundamental limitation through architectural innovations that prioritize spatial information preservation. The model achieves:
- 39% faster inference compared to YOLOv12n (1.7ms vs 2.8ms per image)
- Superior small object detection with 5-16% improvement on pedestrian, people, bicycle, and motorcycle classes
- 25% parameter reduction (1.9M vs 2.5M parameters) for efficient deployment
- Multi-scale detection at P2 (160×160), P3 (80×80), and P4 (40×40) feature levels
Standard object detectors face several challenges with small objects:
- Information Loss: Traditional stride-2 convolutions discard 75% of spatial information during downsampling
- Limited Features: Small objects have fewer pixels, providing minimal discriminative features
- Scale Mismatch: Detection heads optimized for large objects struggle with tiny targets
- Low Resolution: Deep network layers operate on highly downsampled feature maps where small objects may occupy only 1-2 pixels
Real-world Impact: Small object detection failures can have serious consequences in applications like autonomous driving (missing pedestrians), surveillance (undetected intruders), and medical imaging (overlooked lesions).
Small object detection is critical across numerous domains where objects of interest occupy minimal image area but carry significant importance:
| Domain | Application | Impact |
|---|---|---|
| Autonomous Systems | Pedestrian and cyclist detection | Safety-critical: preventing collisions with vulnerable road users |
| Surveillance | Crowd monitoring, intrusion detection | Security: identifying threats in crowded environments |
| Medical Imaging | Early lesion detection, cell counting | Healthcare: early disease diagnosis and treatment |
| Aerial Imagery | Vehicle/building detection from drones/satellites | Urban planning, disaster response, agriculture |
| Industrial Inspection | Defect detection, quality control | Manufacturing: reducing defective products |
| Robotics | Obstacle detection, manipulation | Navigation: avoiding collisions with small obstacles |
Benchmark Results on VisDrone Dataset:
- Pedestrian detection: +5.0% mAP50 (0.357 vs 0.340)
- People detection: +8.0% mAP50 (0.282 vs 0.261)
- Bicycle detection: +12.4% mAP50 (0.080 vs 0.071)
- Motorcycle detection: +2.6% mAP50 (0.357 vs 0.348)
- Overall small objects: +7.1% average improvement
YOLO-TLP's efficiency enables deployment in resource-constrained environments:
- Edge Devices: Runs on embedded systems with limited GPU memory (NVIDIA Jetson, mobile devices)
- Real-time Processing: 39% faster inference enables processing of high-frame-rate video streams
- Scalability: Small model size (3.8MB) reduces bandwidth for distributed deployments
- Cost-Effective: Lower computational requirements reduce power consumption and hardware costs
The core innovation of YOLO-TLP is the integration of SPDConv modules at critical downsampling stages (P2→P3 and P3→P4 transitions).
Standard stride-2 convolution:
Output = Conv(Input, stride=2)
This operation discards 75% of spatial information by sampling every other pixel. For a small object occupying 4×4 pixels, stride-2 downsampling reduces it to 2×2 pixels, losing critical details.
Space-to-Depth operation rearranges spatial information into channels:
- Space-to-Depth: Convert H×W×C to (H/2)×(W/2)×4C by rearranging 2×2 spatial blocks into channels
- Convolution: Apply standard convolution to process the rearranged features
- Result: Zero information loss while achieving spatial downsampling
Mathematical Formulation:
Given input X ∈ ℝ^(H×W×C), SPDConv produces:
Y = Conv(Rearrange(X)) ∈ ℝ^(H/2×W/2×C')
where Rearrange preserves all spatial information by converting it to channel dimension.
- Gradient Flow: Better backpropagation through preserved spatial information
- Feature Richness: More discriminative features for small objects
- Detection Accuracy: 8-16% improvement on small object classes
YOLO-TLP extends detection to the P2 feature level (stride 4), providing 4× higher spatial resolution than standard P3 (stride 8) detection.
| Level | Resolution (640×640 input) | Stride | Best For |
|---|---|---|---|
| P2/4 | 160×160 | 4 | Tiny objects (8-32px) |
| P3/8 | 80×80 | 8 | Small objects (16-64px) |
| P4/16 | 40×40 | 16 | Medium objects (32-128px) |
- Higher Recall: Detects objects that would be invisible at coarser scales
- Precise Localization: Tighter bounding boxes with 16× more spatial detail
- Scale Coverage: Multi-scale detection from 8px to 128px objects
Integrated PSA module in the backbone enhances spatial feature representation:
- Lightweight: Minimal parameter overhead (65K parameters)
- Context-Aware: Captures long-range spatial dependencies
- Selective: Emphasizes informative regions for small objects
Streamlined architecture with C2fCIB and C3k2 blocks:
- C2fCIB: CSP bottleneck with Conditional Identity Block for efficient feature extraction
- C3k2: Cross Stage Partial network with 3×3 kernels for balanced receptive field
- Layer Reduction: Optimized depth at each stage for speed-accuracy balance
Enhanced Feature Pyramid Network (FPN) with Path Aggregation Network (PAN):
Propagates strong semantic features from deep layers to shallow layers through upsampling and concatenation.
Optimized training pipeline for small object detection:
- Higher Resolution: Training on 1280×1280 images (vs standard 640×640) preserves small object details
- Augmentation: Mosaic (0.9), Mixup (0.1), and scale augmentation (0.9) for robustness
- Loss Weighting: Adjusted loss weights to emphasize small object detection
- Optimizer: AdamW with learning rate 0.001 for stable convergence
| Model | Parameters | GFLOPs | Inference (ms) | mAP50 |
|---|---|---|---|---|
| YOLO-TLP-n | 1.9M | 9.2 | 1.7 | 0.305 |
| YOLO-TLP-s | 7.1M | 28.4 | 3.2 | TBD |
| YOLO-TLP-m | 16.8M | 67.3 | 6.8 | TBD |
| YOLO-TLP-l | 25.3M | 98.7 | 9.5 | TBD |
| YOLO-TLP-x | 39.7M | 154.2 | 14.3 | TBD |
Tested on NVIDIA RTX 3060 with FP32 precision
- Pedestrian: +5.0% (0.357 vs 0.340) - Better person detection in crowds
- People: +8.0% (0.282 vs 0.261) - Significant improvement on small people
- Bicycle: +12.4% (0.080 vs 0.071) - Major boost for tiny vehicles
- Awning-Tricycle: +7.3% (0.117 vs 0.109) - Better covered vehicle detection
- Motor: +2.6% (0.357 vs 0.348) - Improved motorcycle detection
YOLO-TLP excels at detecting small objects thanks to SPDConv architecture preserving spatial information
- People: +16.0% (0.109 vs 0.094) - Excellent localization accuracy
- Bicycle: +14.0% (0.033 vs 0.029) - Better bounding box precision
- Awning-Tricycle: +13.4% (0.076 vs 0.067) - Improved tight fit detection
- Pedestrian: +9.1% (0.156 vs 0.143) - More accurate person boxes
- Motor: +7.8% (0.152 vs 0.141) - Better motorcycle localization
mAP50-95 measures accuracy at stricter IoU thresholds. YOLO-TLP's superior performance shows it produces tighter, more accurate bounding boxes for small objects.
- Speed Advantage: YOLO-TLP achieves 39% faster inference (1.7ms vs 2.8ms), crucial for real-time applications like surveillance and autonomous systems
- Parameter Efficiency: 25% fewer parameters (1.9M vs 2.5M) enables deployment on edge devices with limited memory
- Deployment Benefits: Smaller model size (3.8MB) reduces storage and bandwidth requirements
- Trade-off: Higher GFLOPs (9.2 vs 5.9) due to SPDConv operations, but still delivers faster overall inference
YOLO-TLP achieves superior speed and efficiency through optimized architecture, making it ideal for resource-constrained environments requiring real-time small object detection.
- Pedestrian: +5.0% (small person detection)
- People: +8.0% (small/partial people)
- Bicycle: +12.4% (tiny two-wheelers)
- Car: +1.1% (common vehicles)
- Awning-Tricycle: +7.3% (covered vehicles)
- Motor: +2.6% (motorcycles/scooters)
6 out of 10 classes - Dominates in small object categories
- Van: -11.4% (medium vehicles)
- Truck: -27.1% (large vehicles)
- Tricycle: -14.8% (three-wheelers)
- Bus: -20.4% (large public transport)
4 out of 10 classes - Struggles with medium-large objects
YOLO-TLP is specifically optimized for small object detection, achieving superior performance on 6 out of 10 classes including all small objects (pedestrians, people, bicycles, motors). The trade-off is reduced performance on medium-large objects (van, truck, bus), resulting in a 5.6% lower overall mAP50 (0.305 vs 0.323). This makes YOLO-TLP ideal for applications where small object detection is the priority: surveillance, crowd monitoring, autonomous navigation, and similar use cases.
- Overall Precision: YOLOv12n (42.0%) vs YOLO-TLP (41.7%) - Nearly identical quality
- Overall Recall: YOLOv12n (32.8%) vs YOLO-TLP (31.0%) - Both models conservative
- Small Objects: YOLO-TLP shows better recall on pedestrians (37.3% vs 33.8%) and motors (37.8% vs 36.3%)
- Large Objects: YOLOv12n has superior recall on trucks (31.3% vs 21.9%) and buses (44.6% vs 37.0%)
- Balance: Both models prioritize precision over recall, avoiding false positives
- Higher overall mAP (0.323 vs 0.305)
- Better medium-large object detection
- Superior van detection (+11.4%)
- Excellent truck detection (+27.1%)
- Better bus detection (+20.4%)
- Lower computational cost (5.9 vs 9.2 GFLOPs)
- Fewer layers (159 vs 270)
- 39% faster inference (1.7ms vs 2.8ms)
- 25% fewer parameters (1.9M vs 2.5M)
- Superior small object detection
- Better pedestrian detection (+5.0%)
- Excellent people detection (+8.0%)
- Best bicycle detection (+12.4%)
- SPDConv zero information loss
- P2-level detection for tiny objects
-
Surveillance & Security: Detecting people, pedestrians in crowds
-
Autonomous Navigation: Small obstacle detection (pedestrians, bicycles, motorcycles)
-
Crowd Monitoring: Real-time people counting and tracking
-
Traffic Analysis: Two-wheeler and small vehicle detection
-
Edge Deployment: Resource-constrained devices requiring speed
-
Real-time Applications: Where 39% faster inference matters
-
General Object Detection: Balanced performance across all object sizes
-
Vehicle Detection: Cars, vans, trucks, buses (medium-large vehicles)
-
Logistics & Transport: Fleet monitoring, cargo detection
-
Parking Systems: Various vehicle types detection
-
Lower Computational Requirements: When GFLOPs matter (5.9 vs 9.2)
YOLO-TLP vs YOLOv12n
VisDrone Dataset (YOLO Format)
The VisDrone dataset has been preprocessed and converted to YOLO format for easy training and evaluation.
Dataset Details:
- Format: YOLO (txt annotations)
- Images: 10,209 total
- Classes: 10 object categories
- Split: Train/Val/Test
- Size: ~2.3GB
YOLO-TLP Model Weights
Download the pre-trained YOLO-TLP-n model weights trained on VisDrone dataset.
yolo detect train \
model=ultralytics/cfg/models/v12/yolov12.yaml \
data=your_dataset.yaml \
epochs=100 \
imgsz=640 \
batch=8 \
device=0 \
project=yoloTLP_runs \
name=exp1 \
```bash
# Webcam inference
yolo predict model=weights/best.pt source=0 show=True
```bash
# Folder of images
yolo predict model=weights/best.pt source=test_images/ save=True
```bash
# RTSP stream
yolo predict model=weights/best.pt source=rtsp://your_stream_url
## References and Acknowledgments
### Built Upon
This work builds upon several excellent open-source projects:
- **[Ultralytics](https://github.com/ultralytics/ultralytics)** - YOLO framework and implementation
@software{ultralytics_yolo,
author = {Glenn Jocher and others},
title = {Ultralytics YOLO},
year = {2023},
url = {https://github.com/ultralytics/ultralytics}
}
- **[YOLOv12](https://github.com/sunsmarterjie/yolov12)** - Base architecture and improvements
@misc{yolov12,
author = {sunsmarterjie},
title = {YOLOv12},
year = {2024},
url = {https://github.com/sunsmarterjie/yolov12}
}
- **[SPD-Conv](https://github.com/LabSAINT/SPD-Conv)** - Space-to-Depth convolution module
@inproceedings{sunkara2022no,
title={No More Strided Convolutions or Pooling: A New CNN Building Block for Low-Resolution Images and Small Objects},
author={Sunkara, Raja and Luo, Tie},
booktitle={ECML PKDD},
year={2022}
}
### Special Thanks
We gratefully acknowledge the computer vision community for their contributions to open-source object detection research. Special thanks to:
- The YOLO community for continuous innovation
- VisDrone dataset creators for providing challenging small object detection benchmarks
- Contributors to PyTorch and related deep learning frameworks
---
## Citation
If you use YOLO-TLP in your research, please cite:
```bibtex
@article{yolotlp2025,
title={YOLO-TLP: Enhanced Small Object Detection with Space-to-Depth Convolutions},
author={Irfan Hussain},
journal={arXiv preprint arXiv:XXXX.XXXXX},
year={2024}
}







