This repository offers the official implementation of DiaNA in PyTorch.
In the meantime, check out our related papers if you are interested:
- 【AAAI 2024】 An Empirical Study of CLIP for Text-based Person Search [paper | code]
- 【ACM MM 2023】 Text-based Person Search without Parallel Image-Text Data [paper]
- 【IJCAI 2023】 RaSa: Relation and Sensitivity Aware Representation Learning for Text-based Person Search [paper | code]
- 【ICASSP 2022】 Learning Semantic-Aligned Feature Representation for Text-based Person Search [paper | code]
DiaNA is a novel dialogue-refined cross-modal framework for chat-based person retrieval that leverages two adaptive attribute refiner modules to bottleneck the conversational and visual information for fine-grained cross-modal alignment.
- ✅ Release code
- ✅ Release checkpoints
- ✅ Release dataset
- MALS, a large-scale synthetic TPR dataset with 1.5M image-text pairs
- Download images from CUHK-PEDES
- Download ChatPedes annotation files from here
- Organize the dataset as follows:
<ROOT>/ChatPedes
- train_reid.json
- test_reid.json
- imgs00
- cam_a
- cam_b
- ...
- Image Encoder: Swin Transformer v2-B
- Dialogue Encoder: Llama 3.2-1B
Run Pretraining:
cd DiaNA/train
bash shell/pretrain.shResources:
Run Fine-tuning:
cd DiaNA/train
bash shell/finetune.shResources:
Run Evaluation:
cd DiaNA/eval
bash shell/eval.shResources:
If you find this paper useful, please consider staring 🌟 this repo and citing 📑 our paper:
@InProceedings{bai2025chat,
author = {Bai, Yang and Ji, Yucheng and Cao, Min and Wang, Jinqiao and Ye, Mang},
title = {Chat-based Person Retrieval via Dialogue-Refined Cross-Modal Alignment},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages = {3952--3962},
month = {June},
year = {2025}
}
This code is distributed under an MIT LICENSE.
