This is the implementation of our ECCV'24 "SAM4MLLM: Enhance Multi-Modal Large Language Model for Referring Expression Segmentation", based on QWen2VL.
Download each dataset from website:
- ADE20K:https://groups.csail.mit.edu/vision/datasets/ADE20K/
- PACO-LVIS: https://github.com/facebookresearch/paco/tree/main
- Part-ImageNet:https://github.com/TACJu/PartImageNet
- RefCOCO: https://github.com/lichengunc/refer
- GRES: https://github.com/henghuiding/ReLA
Put all of them under data directory so you should get:
dataset/ (put your dataset)
| ├──ADE20K_2021_17_01/
| | ├── images
| ├──PACO/
| | ├──paco_lvis_v1_test.json
| | ├──paco_lvis_v1_train.json
| | ├──paco_lvis_v1_val.json
| ├──Part-ImageNet/
| | ├── annotations
| | ├── images
| ├──RefCOCO/
| | ├── refcoco
| | | ├── instances.json
| | | ├── refs(unc).p
| | ├── refcoco+
| | | ├── instances.json
| | | ├── refs(unc).p
| | ├── refcocog
| | | ├── instances.json
| | | ├── refs(unc).p
| ├──GRES/
| | ├── grefs(unc).json
| | ├── instances.json
| ├──COCO/
| | ├── train2017
| | ├── val2017
qwen2vl-SAM4MLLM/
├──├──data/
├──├──sam_checkpoint
├──├──├── effvit_xl1_decoder_coco_ft.pt
├──├──├── xl1.pt (download from Google Drive [1])
├──├──ade20k_ref_data.json (generated by ./data/ade20k.ipynb)
├──├──paco_ref_data.json (generated by ./data/paco_lvis.ipynb)
├──├──refcoco_gres.json (generated by ./data/refcoco_gres.ipynb)
├──├──partimagenet_ref_data.json (generated by ./data/part_imagenet.ipynb)
├──├──convs_ep1.json (generated by ./to_chat_formate.ipynb)
├──├──LLaMA-Factory/
├──├── data
├──├──├── sam4mllm-qwen2vl.json (download from Google Drive [1])
[1] Google Drive
- Create virtual environment and install requirements
conda create -n sam4mllm python==3.11
pip install -r requirements.txt
- LLaMA-Factory
cd LLaMA-Factory
pip install -e ".[torch,metrics]"
- Flash-attn
pip install flash-attn --no-build-isolation
- Efficient-ViT
git clone https://github.com/mit-han-lab/efficientvit.git
cd efficientvit
pip install -U -r requirements.txt
- Transformers
Please refer this PR to modify the code in transformers. Link
Run following notebooks to arrange data (Remember to set Data Path)
- ./data/ade20k.ipynb
- ./data/refcoco_gres.ipynb
- ./data/paco_lvis.ipynb
- ./data/part_imagenet.ipynb
Then generate dialogue format training data
- ./to_chat_format.ipynb
Note: You should prepare the JSON files ade20k_ref_data.json, paco_ref_data.json, refcoco_gres.json, and partimagenet_ref_data.json to generate convs_ep1.json, the dialogue format training data, using to_chat_formate.ipynb. Alternatively, you can download convs_ep1.json directly from Google Drive and place it in ./data.
Then generate qwen2 training data (remember to set DATA_PATH, and LLaMA_Factory_PATH in notebooks)
python convert_llava_qwen2.py
cd LLaMA-Factory
llamafactory-cli train examples/train_lora/qwen2vl_lora_sft_sam4mllm.yaml
- Export Model
llamafactory-cli export examples/merge_lora/qwen2vl_lora_sft_sam4mllm.yaml
llamafactory-cli webchat examples/inference/qwen2_vl_sam4mllm.yaml
llamafactory-cli api examples/inference/qwen2_vl_sam4mllm.yaml