Skip to content

This is the implementation of our ECCV'24 "SAM4MLLM: Enhance Multi-Modal Large Language Model for Referring Expression Segmentation", based on QWen2VL.

Notifications You must be signed in to change notification settings

weihua9217/qwen2vl-SAM4MLLM

Repository files navigation

QWen2vl-SAM4MLLM

This is the implementation of our ECCV'24 "SAM4MLLM: Enhance Multi-Modal Large Language Model for Referring Expression Segmentation", based on QWen2VL.

Data

Download each dataset from website:

Put all of them under data directory so you should get:

    dataset/ (put your dataset)
    |  ├──ADE20K_2021_17_01/
    |  |  ├── images
    |  ├──PACO/
    |  |  ├──paco_lvis_v1_test.json
    |  |  ├──paco_lvis_v1_train.json
    |  |  ├──paco_lvis_v1_val.json
    |  ├──Part-ImageNet/
    |  |  ├── annotations
    |  |  ├── images
    |  ├──RefCOCO/
    |  |  ├── refcoco
    |  |  |  ├── instances.json
    |  |  |  ├── refs(unc).p
    |  |  ├── refcoco+
    |  |  |  ├── instances.json
    |  |  |  ├── refs(unc).p
    |  |  ├── refcocog
    |  |  |  ├── instances.json
    |  |  |  ├── refs(unc).p
    |  ├──GRES/
    |  |  ├── grefs(unc).json
    |  |  ├── instances.json
    |  ├──COCO/
    |  |  ├── train2017
    |  |  ├── val2017
    qwen2vl-SAM4MLLM/
    ├──├──data/
       ├──├──sam_checkpoint 
       ├──├──├── effvit_xl1_decoder_coco_ft.pt
       ├──├──├── xl1.pt (download from Google Drive [1])
       ├──├──ade20k_ref_data.json (generated by ./data/ade20k.ipynb)
       ├──├──paco_ref_data.json (generated by ./data/paco_lvis.ipynb)
       ├──├──refcoco_gres.json (generated by ./data/refcoco_gres.ipynb)
       ├──├──partimagenet_ref_data.json (generated by ./data/part_imagenet.ipynb)
       ├──├──convs_ep1.json (generated by ./to_chat_formate.ipynb)
    ├──├──LLaMA-Factory/
       ├──├── data
       ├──├──├── sam4mllm-qwen2vl.json (download from Google Drive [1])

[1] Google Drive

Requirements

  • Create virtual environment and install requirements
conda create -n sam4mllm python==3.11
pip install -r requirements.txt
  • LLaMA-Factory
cd LLaMA-Factory
pip install -e ".[torch,metrics]"
  • Flash-attn
pip install flash-attn --no-build-isolation
  • Efficient-ViT
git clone https://github.com/mit-han-lab/efficientvit.git
cd efficientvit
pip install -U -r requirements.txt
  • Transformers

Please refer this PR to modify the code in transformers. Link

Pre-processing Data

Run following notebooks to arrange data (Remember to set Data Path)

  • ./data/ade20k.ipynb
  • ./data/refcoco_gres.ipynb
  • ./data/paco_lvis.ipynb
  • ./data/part_imagenet.ipynb

Then generate dialogue format training data

  • ./to_chat_format.ipynb

Note: You should prepare the JSON files ade20k_ref_data.json, paco_ref_data.json, refcoco_gres.json, and partimagenet_ref_data.json to generate convs_ep1.json, the dialogue format training data, using to_chat_formate.ipynb. Alternatively, you can download convs_ep1.json directly from Google Drive and place it in ./data.

Then generate qwen2 training data (remember to set DATA_PATH, and LLaMA_Factory_PATH in notebooks)

python convert_llava_qwen2.py 

Training

cd LLaMA-Factory
llamafactory-cli train examples/train_lora/qwen2vl_lora_sft_sam4mllm.yaml
  • Export Model
llamafactory-cli export examples/merge_lora/qwen2vl_lora_sft_sam4mllm.yaml

Inference

llamafactory-cli webchat examples/inference/qwen2_vl_sam4mllm.yaml
llamafactory-cli api examples/inference/qwen2_vl_sam4mllm.yaml

About

This is the implementation of our ECCV'24 "SAM4MLLM: Enhance Multi-Modal Large Language Model for Referring Expression Segmentation", based on QWen2VL.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published