🔥【CVPR 2025】Chat-based Person Retrieval via Dialogue-Refined Cross-Modal Alignment

This repository offers the official implementation of DiaNA in PyTorch.

In the meantime, check out our related papers if you are interested:

【AAAI 2024】 An Empirical Study of CLIP for Text-based Person Search [paper | code]
【ACM MM 2023】 Text-based Person Search without Parallel Image-Text Data [paper]
【IJCAI 2023】 RaSa: Relation and Sensitivity Aware Representation Learning for Text-based Person Search [paper | code]
【ICASSP 2022】 Learning Semantic-Aligned Feature Representation for Text-based Person Search [paper | code]

📖 Overview

DiaNA is a novel dialogue-refined cross-modal framework for chat-based person retrieval that leverages two adaptive attribute refiner modules to bottleneck the conversational and visual information for fine-grained cross-modal alignment.

📌 TODO

✅ Release code
✅ Release checkpoints
✅ Release dataset

🗂️ Data Preparation

🔹 Pretraining Dataset

MALS, a large-scale synthetic TPR dataset with 1.5M image-text pairs

🔹 Fine-tuning Dataset: ChatPedes

Download images from CUHK-PEDES
Download ChatPedes annotation files from here
Organize the dataset as follows:

<ROOT>/ChatPedes
    - train_reid.json
    - test_reid.json
    - imgs00
        - cam_a
        - cam_b
        - ...

🏋️‍♂️ Training

🔹 Stage 1: Pretraining on MALS

Image Encoder: Swin Transformer v2-B
Dialogue Encoder: Llama 3.2-1B

Run Pretraining:

cd DiaNA/train
bash shell/pretrain.sh

Resources:

🤗 Pretrained Checkpoint
📜 Training Log

🔹 Stage 2: Fine-tuning on ChatPedes

Run Fine-tuning:

cd DiaNA/train
bash shell/finetune.sh

Resources:

🤗 Fine-tuned Checkpoint
📜 Training Log

🎯 Evaluation

Run Evaluation:

cd DiaNA/eval
bash shell/eval.sh

Resources:

📜 Evaluation Log

✨ Citation

If you find this paper useful, please consider staring 🌟 this repo and citing 📑 our paper:

@InProceedings{bai2025chat,
    author    = {Bai, Yang and Ji, Yucheng and Cao, Min and Wang, Jinqiao and Ye, Mang},
    title     = {Chat-based Person Retrieval via Dialogue-Refined Cross-Modal Alignment},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
    pages     = {3952--3962},
    month     = {June},
    year      = {2025}
}

⚖️ License

This code is distributed under an MIT LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
eval		eval
figure		figure
train		train
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔥【CVPR 2025】Chat-based Person Retrieval via Dialogue-Refined Cross-Modal Alignment

📖 Overview

📌 TODO

🗂️ Data Preparation

🔹 Pretraining Dataset

🔹 Fine-tuning Dataset: ChatPedes

🏋️‍♂️ Training

🔹 Stage 1: Pretraining on MALS

🔹 Stage 2: Fine-tuning on ChatPedes

🎯 Evaluation

✨ Citation

⚖️ License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🔥【CVPR 2025】Chat-based Person Retrieval via Dialogue-Refined Cross-Modal Alignment

📖 Overview

📌 TODO

🗂️ Data Preparation

🔹 Pretraining Dataset

🔹 Fine-tuning Dataset: ChatPedes

🏋️‍♂️ Training

🔹 Stage 1: Pretraining on MALS

🔹 Stage 2: Fine-tuning on ChatPedes

🎯 Evaluation

✨ Citation

⚖️ License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages