This project implements a complete object detection pipeline trained entirely from scratch, without using any pre-trained weights. A Faster R-CNN–style two-stage detector is built and trained on a filtered subset of the PASCAL VOC 2012 dataset, focusing on five object categories.
The objective of this project is not to maximize benchmark performance, but to gain a deep, end-to-end understanding of modern object detection systems, including region proposal generation, multi-task loss optimization, and training dynamics of two-stage detectors.
- PASCAL VOC 2012
- Downloaded from Kaggle
- Original annotations in VOC XML format
The dataset is filtered to include only the following five classes: ["person", "car", "cat", "aeroplane", "bicycle"]
All images that do not contain at least one instance of these classes are removed.
Official VOC ImageSets are used to create the training and validation splits.
| Split | Original Images | Images After Filtering |
|---|---|---|
| Train | 5,717 | 3,377 |
| Val | 5,823 | 3,458 |
After filtering, the dataset contains only images with valid target objects from the selected classes.
This project implements a custom Faster R-CNN architecture trained from scratch, consisting of the following components:
- Custom CNN feature extractor
- No ImageNet or external pretraining
- Multiple convolutional blocks for hierarchical feature learning
- Shared convolutional feature map with the backbone
- Anchor-based proposal generation
- Binary classification (object vs background)
- Bounding box regression for anchor refinement
- Region of Interest (ROI) pooling
- Fully connected layers for:
- Multi-class classification
- Bounding box regression
The entire model is trained end-to-end using a combined multi-task loss.
The following section explains the motivation behind these design choices and provides an overview of how the architecture is implemented in code.
Faster R-CNN was selected as the core detection architecture due to its clear separation of concerns between region proposal generation and object classification, making it particularly suitable for studying the internal mechanics of object detection systems. Unlike single-stage detectors (e.g., YOLO or SSD), Faster R-CNN explicitly models objectness through a Region Proposal Network (RPN), which provides deeper insight into how candidate regions are generated, refined, and classified. This architectural clarity makes Faster R-CNN well-suited for educational and experimental settings where interpretability and correctness are prioritized over raw inference speed.
The model is trained entirely from scratch, without ImageNet or COCO pretraining, ensuring that all learned representations arise solely from the target dataset. While this choice negatively impacts final accuracy, it provides a more faithful understanding of optimization challenges in deep detection models, such as unstable RPN training, slow convergence, and sensitivity to dataset size and class imbalance. Training from scratch also highlights the importance of dataset filtering, anchor design, and loss balancing in two-stage detectors.
A custom CNN backbone is used instead of standard architectures such as ResNet or VGG to maintain full control over architectural complexity and parameter count. This avoids implicit performance gains from well-engineered pretrained backbones and keeps the focus on core detection principles rather than architectural shortcuts.
The Region Proposal Network operates on shared convolutional feature maps produced by the backbone. It predicts objectness scores and bounding box offsets for a dense set of anchors at each spatial location. Anchors are refined through bounding box regression, and high-confidence proposals are forwarded to the ROI head. This shared-feature design significantly reduces computational overhead compared to earlier R-CNN variants while maintaining strong localization performance.
The RPN contributes two loss terms during training:
- A binary classification loss for object vs background discrimination
- A localization loss for anchor box refinement
These losses are optimized jointly with downstream Fast R-CNN losses in an end-to-end manner.
Proposed regions from the RPN are passed through ROI pooling to produce fixed-size feature representations, regardless of proposal size. These features are processed by fully connected layers that output:
- Class probabilities over the target object categories
- Bounding box regression offsets for class-specific localization
This separation allows the model to specialize proposal generation and final classification independently, improving localization accuracy at the cost of increased computational complexity.
The training pipeline is implemented in a modular fashion to reflect the logical structure of Faster R-CNN. The dataset loader handles image parsing, annotation loading, bounding box normalization, and data augmentation. During training, each batch passes through the backbone, RPN, and ROI head sequentially, with intermediate outputs used to compute multi-task losses.
Key implementation details include:
- Batch size of 1, consistent with standard Faster R-CNN training practices due to variable image sizes and memory constraints
- Gradient accumulation, used to stabilize updates and simulate larger effective batch sizes
- MultiStep learning rate scheduling, enabling coarse-to-fine optimization
- Explicit tracking of individual loss components for debugging and analysis
The training loop is intentionally designed to remain transparent and debuggable, prioritizing clarity over abstraction. This makes it easier to analyze failure modes such as poor proposal quality, class imbalance effects, and bounding box regression instability.
While Faster R-CNN is computationally heavier and slower than single-stage detectors, it offers superior localization accuracy and interpretability. The architecture is therefore well-suited for scenarios where detection quality and understanding model behavior are more important than real-time inference constraints. This project demonstrates that, even without pretraining, Faster R-CNN remains a strong baseline for accuracy-focused object detection tasks when trained carefully.
- GPU: NVIDIA RTX 4080
- Training Time: ~3.5 minutes per epoch
- Total Epochs: 20
- Optimizer: SGD
- Momentum: 0.9
- Weight Decay: 5e-4
- Learning Rate Scheduler: MultiStepLR
- Batch Size: 1
- Gradient Accumulation: Enabled
The total training loss is the sum of four components:
- RPN classification loss
- RPN localization (bounding box regression) loss
- Fast R-CNN classification loss
- Fast R-CNN localization loss
This closely follows the original Faster R-CNN multi-task training objective.
Data augmentation is applied selectively during training to improve robustness while maintaining stability when training the model from scratch.
- Random horizontal flip (50% probability), with bounding boxes adjusted accordingly
- Color jitter, including:
- Brightness jitter
- Saturation jitter
- Gaussian blur applied with low probability to simulate mild image degradation
All augmentations are applied only during training and are chosen such that bounding box geometry remains valid.
Note: More aggressive geometric augmentations (e.g., random cropping, rotation, perspective transforms) were intentionally avoided, as they require complex bounding box remapping and can destabilize training in two-stage detectors trained from scratch.
| Class | AP |
|---|---|
| Aeroplane | 0.4127 |
| Bicycle | 0.2639 |
| Car | 0.1725 |
| Cat | 0.4753 |
| Person | 0.3141 |
| mAP | 0.3277 |
| Metric | Value |
|---|---|
| Average inference time | 35.39 ms |
| Inference FPS | 28.25 FPS |
| Device | NVIDIA RTX 4080 |
| Metric | Value |
|---|---|
| Model size (disk) | 167.36 MB |
| Number of parameters | 43.87 M |
Inference examples with predicted bounding boxes and class confidences are shown below:
A real-time inference video demonstrating bounding box stability and detection speed is provided below:
Note: The video is recorded directly from notebook inference and demonstrates real-time detection performance without post-processing.
These qualitative results demonstrate bounding box localization quality, class predictions, and real-time inference behavior.
The model achieves an mAP of 0.3277, which is consistent with expectations for a Faster R-CNN trained entirely from scratch on a limited dataset. Higher AP is observed for visually distinctive classes such as cat and aeroplane, while classes with greater intra-class variation (e.g., car and bicycle) exhibit lower AP.
Despite the lack of pretraining, the model achieves an inference speed of ~28 FPS, demonstrating an efficient implementation and reasonable runtime performance for a two-stage detector.
| Aspect | Faster R-CNN |
|---|---|
| Mean Average Precision (IoU = 0.5) | 0.3277 |
| Inference Speed | 28.25 FPS |
| Average Inference Time | 35.39 ms |
| Model Size (Disk) | 167.36 MB |
| Number of Parameters | 43.87 M |
| Training Epochs | 20 |
| Training Time per Epoch | ~3.5 minutes |
Download the PASCAL VOC 2012 dataset from Kaggle and extract it locally.
Update dataset and output paths in the configuration file.
Run the provided notebook:
Faster_RCNN_from_scratch.ipynb
The notebook handles:
- Dataset loading and filtering
- Model initialization
- Training
- Checkpoint saving
- Inference visualization
- End-to-end object detection can be implemented without pretrained models
- Two-stage detectors are sensitive to training stability
- Dataset filtering significantly impacts convergence
- Faster R-CNN remains a strong baseline for accuracy-focused detection tasks






