ConsistencyTrack is a multi-object tracking framework that improves noise resistance and reduces ID switches through a denoising diffusion process, outperforming DiffusionTrack in inference speed and overall performance.
Lifan Jiang, Zhihui Wang, Siqi Yin, Guangxiao Ma, Peng Zhang, Boxi Wu
Multi-object tracking (MOT) is a critical technology in computer vision, designed to detect multiple targets in video sequences and assign each target a unique ID per frame. Existed MOT methods excel at accurately tracking multiple objects in real-time across various scenarios. However, these methods still face challenges such as poor noise resistance and frequent ID switches. In this research, we propose a novel ConsistencyTrack, joint detection and tracking(JDT) framework that formulates detection and association as a denoising diffusion process on perturbed bounding boxes. This progressive denoising strategy significantly improves the model's noise resistance. During the training phase, paired object boxes within two adjacent frames are diffused from ground-truth boxes to a random distribution, and then the model learns to detect and track by reversing this process. In inference, the model refines randomly generated boxes into detection and tracking results through minimal denoising steps. ConsistencyTrack also introduces an innovative target association strategy to address target occlusion. Experiments on the MOT17 and DanceTrack datasets demonstrate that ConsistencyTrack outperforms other compared methods, especially better than DiffusionTrack in inference speed and other performance metrics.
- ConsistencyTrack conceptualizes the process of object tracking as a generative denoising process and introduces a novel denoising paradigm. In contrast to the established paradigm in DiffusionTrack, which employs a very small number of iterations for noise addition and removal, our method represents a substantial advancement in enhancing the efficiency of the MOT task.
- In crafting the loss function for the proposed ConsistencyTrack, we aggregate the individual loss values at time steps (t-1, t) subsequent to the model’s predictions to compute the total loss. This methodology guarantees that the mapping of any pair of adjacent points along the temporal dimension to the axis origin maintains the highest degree of consistency. This attribute mirrors the inherent self-consistency principle central to Consistency Model.
- We designed a novel target association strategy, distinct from DiffusionTrack within the JDT paradigm. This association strategy emphasizes the process of matching low-confidence detection boxes with tracking trajectories, significantly enhancing the ability to recognize occlusion issues and markedly improving performance metrics.
3.1 Install anaconda, and create conda environment;
conda create -n yourname python=3.8
3.2 Install requirements for ConsistencyTrack.
conda activate yourname
git clone https://github.com/Tankowa/ConsistencyTrack.git
cd ConsistencyTrack_HOME
pip3 install -r requirements.txt
python3 setup.py develop
3.3 PyTorch ≥ 1.9.0 and torchvision that matches the PyTorch installation. You can install them together at pytorch.org to make sure of this.
pip3 install cython; pip3 install 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'
3.4 Install detectron2
git clone https://github.com/facebookresearch/detectron2.git
python -m pip install -e detectron2
3.5 Others
pip3 install cython_bboxs
Download MOT17, MOT20, CrowdHuman, Cityperson, ETHZ ,Dancetrack p ut them under /datasets in the following structure:
datasets
|——————mot
| └——————train
| └——————test
└——————crowdhuman
| └——————Crowdhuman_train
| └——————Crowdhuman_val
| └——————annotation_train.odgt
| └——————annotation_val.odgt
└——————MOT20
| └——————train
| └——————test
└——————dancetrack
| └——————train
| └——————test
└——————Cityscapes
| └——————images
| └——————labels_with_ids
└——————ETHZ
└——————eth01
└——————...
└——————eth07
Then, you need to turn the datasets to COCO format and mix different training data:
cd <ConsistencyTrack_HOME>
python3 tools/convert_mot17_to_coco.py
python3 tools/convert_dancetrack_to_coco.py
python3 tools/convert_mot20_to_coco.py
python3 tools/convert_crowdhuman_to_coco.py
python3 tools/convert_cityperson_to_coco.py
python3 tools/convert_ethz_to_coco.py
Before mixing different datasets, you need to follow the operations in mix_xxx.py to create a data folder and link. Finally, you can mix the training data:
cd <ConsistencyTrack_HOME>
python3 tools/mix_data_ablation.py
python3 tools/mix_data_test_mot17.py
python3 tools/mix_data_test_mot20.py
You can download our model weight from our model zoo.
The pretrained YOLOX model can be downloaded from their model zoo. After downloading the pretrained models, you can put them under /pretrained.
Train MOT17 test model (MOT17 train)
cd <ConsistencyTrack_HOME>
python3 tools/train.py -f exps/example/mot/yolox_x_consistency_det_mot17.py -d 8 -b 16 -o -c pretrained/bytetrack_x_mot17.pth.tar
python3 tools/train.py -f exps/example/mot/yolox_x_consistency_track_mot17.py -d 8 -b 16 -o -c pretrained/consistencytrack_mot17_det.pth.tar
Train Dancetrack test model (Dancetrack)
cd <ConsistencyTrack_HOME>
python3 tools/train.py -f exps/example/mot/yolox_x_consistency_det_dancetrack.py -d 8 -b 16 -o -c pretrained/bytetrack_x_mot17.pth.tar
python3 tools/train.py -f exps/example/mot/yolox_x_consistency_track_dancetrack.py -d 8 -b 16 -o -c pretrained/consistencytrack_dancetrack_det.pth.tar
- Test on MOT17
cd <ConsistencyTrack_HOME>
python3 tools/track.py -f exps/example/mot/yolox_x_consistency_track_mot17.py -c pretrained/mot-best_ckpt.pth.tar -b 1 -d 1 --fuse
- Test on Dancetrack
cd <ConsistencyTrack_HOME>
python3 tools/track.py -f exps/example/mot/yolox_x_consistency_track_dancetrack.py -c pretrained/dancetrack-best_ckpt.pth.tar -b 1 -d 1 --fuse
Method | MOTA | IDF1 | HOTA | MT | ML | FN | DetA |
---|---|---|---|---|---|---|---|
Tracktor++2 | 56.3 | 55.1 | / | 498 | 831 | 235449 | / |
TubeTK* | 63.0 | 58..6 | 48.0 | 735 | 468 | 177483 | 51.4 |
CTTrack17 | 67.8 | 64.7 | 52.2 | 816 | 579 | 160332 | 53.8 |
CJTracker40 | 58.7 | 58.2 | 48.4 | 621 | 909 | 197790 | 49.1 |
TrajE | 67.4 | 61.2 | 49.7 | 820 | 587 | 161347 | 53.5 |
Sp_Con | 61.5 | 63.3 | 50.5 | 622 | 754 | 200655 | 49.2 |
PCL | 58.8 | 61.2 | 49.0 | 612 | 837 | 218912 | 47.2 |
UTM | 63.5 | 65.1 | 52.5 | 881 | 635 | 170352 | 52.2 |
ConsistencyTrack | 69.9 | 65.7 | 54.4 | 907 | 428 | 142145 | 58.2 |
Method | HOTA | DetA | AssA | MOTA | IDF1 |
---|---|---|---|---|---|
TransTrack | 41.8 | 78.1 | 22.6 | 86.8 | 35.7 |
CenterTrack | 39.7 | 66.7 | 23.8 | 82.2 | 40.8 |
ConsistencyTrack | 42.3 | 76.4 | 25.4 | 87.8 | 41.2 |
n_ss | DiffusionTrack | ConsistencyTrack |
---|---|---|
1 | / | 10.53 |
2 | 2.50 | 10.51 |
4 | 1.25 | 10.39 |
6 | 0.84 | 10.27 |
-
Screenshots of sampled tracking results on the proposed ConsistencyTrack on MOT17 and DanceTrack datasets.
-
The comparison of the visual reasoning process with one typical sampling step between ConsistencyTrack and DiffusionTrack.
-
Performance comparison between ConsistencyTrack and DiffusionTrack on MOT17 val-half set.
A large part of the code is borrowed from ConsistencyDet and DiffusionTrack thanks for their wonderful works.
@misc{jiang2024consistencydet,
title={ConsistencyDet: Robust Object Detector with Denoising Paradigm of Consistency Model},
author={Lifan Jiang and Zhihui Wang and Changmiao Wang and Ming Li and Jiaxu Leng and Xindong Wu},
year={2024},
eprint={2404.07773},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@article{luo2023diffusiontrack,
title={DiffusionTrack: Diffusion Model For Multi-Object Tracking},
author={Luo, Run and Song, Zikai and Ma, Lintao and Wei, Jinlin and Yang, Wei and Yang, Min},
journal={arXiv preprint arXiv:2308.09905},
year={2023}
}