Building on the work by Bugra Tekin, Sudipta N. Sinha, and Pascal Fua in "Real-Time Seamless Single Shot 6D Object Pose Prediction" (CVPR 2018), this implementation enhances the original framework to better address everyday object detection tasks. Paper
"We've integrated fine-tuning across diverse datasets, ranging from custom-labeled data to standard benchmarks, with seamless conversion into a unified labeling format. The system supports multiple input types, including images, videos, and even webcam feeds, and is optimized for robust multi-object and multi-class inference. These enhancements make the method highly adaptable and effective for a wide range of real-world applications."
- Utilizing Various Datasets: Includes
parcel3d
,AIHUB
, and other "manually labeled" custom datasets. - Omitting Reprojection Process: Streamlined pipeline by removing unnecessary reprojection.
- Generating Inference Code: Easy-to-use inference code generation.
- Adding Multi-Object Inference: Enhanced capabilities for detecting multiple objects simultaneously.
- Introducing Anchors: Improved detection accuracy through the use of anchors.
Download the repository including the necessary datasets:
git clone https://github.com/jungarden/Flex3D-bbox.git
Ensure your environment meets the following requirements:
- Python: 3.6
- CUDA: 11.1
- Cudnn: 8
- Docker: Image:
nvidia/cuda:11.1.1-cudnn8-devel-ubuntu20.04
Install the required libraries as follows:
-
PyTorch:
pip install torch==1.8.0+cu111 torchvision==0.9.0+cu111 torchaudio==0.8.0 -f https://download.pytorch.org/whl/torch_stable.html
-
OpenCV:
pip install opencv-python # Alternatively, install a specific version: # pip install opencv-contrib-python==4.1.0.25
-
Scipy:
pip install scipy==1.2.0
-
Pillow:
pip install pillow==8.2.0
-
tqdm:
pip install tqdm==4.64.1
Before training, ensure your dataset labels are correctly formatted using the txt_labels.py script: This script parses and converts your dataset's labeling information into the format required for training. Make sure to select the appropriate labeling method for your dataset, whether it is manually labeled or follows the AIHUB dataset format.
python3 making_txt_labels.py
- glove00 folder structure
- glove00.data
To train the model on multiple objects across datasets, use the following command:
python3 train_multi.py \
--datacfg data/occlusion.data \
--modelcfg
cfg/yolo-pose-multi.cfg \
--initweightfile cfg/darknet19_448.conv.23 \
--pretrain_num_epochs 15
For finetuning on a custom dataset, run:
python3 train.py \
--datacfg data/trainbox.data \
--modelcfg cfg/yolo-pose.cfg \
--initweightfile backup/parcel3d/model.weights \
--pretrain_num_epochs 5
To perform inference on a video file, execute:
python3 inference.py \
--datacfg data/occlusion.data \
--modelcfg cfg/yolo-pose-multi.cfg \
--initweightfile backup_multi/model.weights \
--file video.mp4
Below is an example of the detection results: *multi classes
- Original Source: Microsoft SingleShotPose
- Other Source: MISOChallenge-3Dobject
- Repository Structure:
baseline/
: Single object detectionmulti/
: Multi-object detectiondataset/
: Contains various datasetsutils/
: Contains utility functions (e.g get_anchors.py)
- train.py: Removed internal parameters, rotation matrices, and reprojection variables.
- utils.py: Created
build_target_anchors
to consider anchors in single detection('baseline'), modified 'get_region_boxes' to consider anchors. - image.py & dataset.py: Updated paths for custom datasets.
- yolo-pose.cfg: Adjusted the number of filters for anchors and classes.
- inference.py: Added visualization for bounding boxes and classes.