Object detection, a quintessential task in the realm of perceptual computing, can be tackled using a generative methodology. In the present study, we introduce a novel framework designed to articulate object detection as a denoising diffusion process, which operates on perturbed bounding boxes of annotated entities. This framework, termed ConsistencyDet, leverages an innovative denoising concept known as the Consistency Model. The hallmark of this model is its self-consistency feature, which empowers the model to map distorted information from any temporal stage back to its pristine state, thereby realizing a ''one-step denoising'' mechanism. Such an attribute markedly elevates the operational efficiency of the model, setting it apart from the conventional Diffusion Model. Throughout the training phase, ConsistencyDet initiates the diffusion sequence with noise-infused boxes derived from the ground-truth annotations and conditions the model to perform the denoising task. Subsequently, in the inference stage, the model employs a denoising sampling strategy that commences with bounding boxes randomly sampled from a normal distribution. Through iterative refinement, the model transforms an assortment of arbitrarily generated boxes into the definitive detections. Comprehensive evaluations employing standard benchmarks, such as MS-COCO and LVIS, corroborate that ConsistencyDet surpasses other leading-edge detectors in performance metrics.
-
We conceptualize object detection as a generative denoising process and propose a novel methodological approach. In contrast to the established paradigm in DiffusionDet, which employs an equal number of iterations for noise addition and removal, our method represents a substantial advancement in enhancing the efficiency of the detection task.
-
In the proposed ConsistencyDet, we have engineered a noise addition and removal paradigm that does not impose specific architectural constraints, thereby allowing for flexible parameterization with a variety of neural network structures. This design choice significantly augments the model's practicality and adaptability for diverse applications.
-
In crafting the loss function for the proposed ConsistencyDet, we aggregate the individual loss values at time steps
$t$ and$t-1$ subsequent to the model's predictions to compute the total loss. This methodology guarantees that the mapping of any pair of adjacent points along the temporal dimension to the axis origin maintains the highest degree of consistency. This attribute mirrors the inherent self-consistency principle central to the Consistency Model.
Method | The best Box AP | Download |
---|---|---|
COCO-Res50 | 46.9 | model |
COCO-Res101 | 47.2 | model |
COCO-SwinBase | 53.0 | model |
LVIS-Res50 | 32.2 | model |
LVIS-Res101 | 33.1 | model |
LVIS-SwinBase | 42.4 | model |
1.Install anaconda, and create conda environment;
conda create -n yourname python=3.8
2.PyTorch ≥ 1.9.0 and torchvision that matches the PyTorch installation. You can install them together at pytorch.org to make sure of this.
3.Install Detectron2
git clone https://github.com/facebookresearch/detectron2.git python -m pip install -e detectron2
4.Install other dependency libraries
pip3 install -r requirements.txt
mkdir -p datasets/coco mkdir -p datasets/lvis ln -s /path_to_coco_dataset/annotations datasets/coco/annotations ln -s /path_to_coco_dataset/train2017 datasets/coco/train2017 ln -s /path_to_coco_dataset/val2017 datasets/coco/val2017 ln -s /path_to_lvis_dataset/lvis_v1_train.json datasets/lvis/lvis_v1_train.json ln -s /path_to_lvis_dataset/lvis_v1_val.json datasets/lvis/lvis_v1_val.json
mkdir models cd models # ResNet-101 wget https://github.com/ShoufaChen/DiffusionDet/releases/download/v0.1/torchvision-R-101.pkl # Swin-Base wget https://github.com/ShoufaChen/DiffusionDet/releases/download/v0.1/swin_base_patch4_window7_224_22k.pkl
python train_net.py --num-gpus 4 \ --config-file configs/diffdet.coco.res50.yaml
python train_net.py --num-gpus 4 \ --config-file configs/diffdet.yourdataset.yourbakbone.yaml \ --eval-only MODEL.WEIGHTS path/to/model.pth
@misc{jiang2024consistencydet, title={ConsistencyDet: Robust Object Detector with Denoising Paradigm of Consistency Model}, author={Lifan Jiang and Zhihui Wang and Changmiao Wang and Ming Li and Jiaxu Leng and Xindong Wu}, year={2024}, eprint={2404.07773}, archivePrefix={arXiv}, primaryClass={cs.CV} }
A large part of the code is borrowed from DiffusionDet and Consistency models thanks for their works.
@inproceedings{chen2023diffusiondet, title={Diffusiondet: Diffusion model for object detection}, author={Chen, Shoufa and Sun, Peize and Song, Yibing and Luo, Ping}, booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision}, pages={19830--19843}, year={2023} } @article{song2023consistency, title={Consistency models}, author={Song, Yang and Dhariwal, Prafulla and Chen, Mark and Sutskever, Ilya}, journal={arXiv preprint arXiv:2303.01469}, year={2023} }