Skip to content

Conversation

@BloomBerry
Copy link

@BloomBerry BloomBerry commented Mar 5, 2025

Add support for ColQwen2VL model
Description
This PR adds support for the ColQwen2VL model to vLLM. ColQwen2VL is an efficient document retrieval vision language model based on Qwen2VL, as described in the paper "ColPali: Efficient Document Retrieval with Vision Language Models". The model is designed to generate embeddings rather than text outputs, making it suitable for document retrieval applications.
Key implementation details:
Extended the existing Qwen2VL implementation for ColQwen2VL compatibility
Implemented custom text projection layer and L2 normalization for embedding generation
Added appropriate processing utilities for image and video inputs
Overrode forward, compute_logits and sample methods to optimize for embedding output
This implementation enables users to leverage ColQwen2VL's multimodal document retrieval capabilities through vLLM's efficient serving infrastructure.
Testing
Tested with sample image inputs
Verified embedding output format and dimensions
Confirmed compatibility with HuggingFace ColQwen2VL models

FIX #19381

@github-actions
Copy link

github-actions bot commented Mar 5, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the documentation Improvements or additions to documentation label Mar 5, 2025
@DarkLight1337
Copy link
Member

Thanks for implementing this! Can you update the following files as well?

  • Supported Models page
  • Test registry tests/models/test_registry.py
  • Model correctness tests tests/models/embedding/vision_language
  • Processor correctness tests tests/models/multimodal/processing/test_common.py

@DarkLight1337 DarkLight1337 changed the title add colqwen2_vl code & inference [Model] add colqwen2_vl code & inference Mar 5, 2025
Signed-off-by: BloomBerry <jyjang1090@gmail.com>
Signed-off-by: BloomBerry <jyjang1090@gmail.com>
Signed-off-by: BloomBerry <jyjang1090@gmail.com>
Signed-off-by: BloomBerry <jyjang1090@gmail.com>
@mgoin
Copy link
Member

mgoin commented May 25, 2025

Hey @BloomBerry I'm working on reviving this PR since it has drifted away from the refactors on main and needs some more testing. Would you want me to push to this PR myself or I can start a new one.

It seems to require this Transformers PR huggingface/transformers#35778

@mergify
Copy link

mergify bot commented Jun 4, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @BloomBerry.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jun 4, 2025
@mgoin mgoin mentioned this pull request Jun 9, 2025
1 task
@mergify mergify bot added the qwen Related to Qwen models label Jun 19, 2025
@issahammoud
Copy link

Hi, is there an estimation when the PR will be merged?

@mergify mergify bot added the new-model Requests to new models label Jul 11, 2025
@SMAntony
Copy link

SMAntony commented Sep 2, 2025

Is anyone working on this?

@issahammoud
Copy link

I was able to serve ColQwen 2.5 vl 3B (https://huggingface.co/Metric-AI/ColQwen2.5-3b-multilingual-v1.0) with vllm by doing some modifications to the source code.

The idea is to use the Qwen 2.5 VL with ALL pooling type so it outputs all embedding vectors for late interaction.

Here is a git patch you can apply on vllm source code (tested with v0.11.0).
colqwen.patch

I am using it with the local weights of Metric-AI/ColQwen2.5-3b-multilingual-v1.0 (with the base config from vidore/colqwen2.5-base).

You just need to change the architecture name in the config.json from ColQwen2_5, to ColQwen2_5_VLForConditionalGeneration and add the following modules.json file

[
  {
    "idx": 0,
    "name": "0",
    "path": "",
    "type": "sentence_transformers.models.Transformer"
  }
]

I am running the openai compatible server in docker compose as follows:

entrypoint: ["vllm", "serve"]
command:
      - "/root/.cache/huggingface/hub/models--Metric-AI--ColQwen2.5-3b-multilingual-v1.0/snapshots/e2a1c05d053dcf4ad6e39b6c48ced9d6a81071f0"
      - "--host"
      - "0.0.0.0"
      - "--port"
      - "8000"
      - "--runner"
      - "pooling"
      - "--convert"
      - "embed"
      - "--dtype"
      - "bfloat16"
      - "--max-model-len"
      - "1024"
      - "--gpu-memory-utilization"
      - "0.8"
      - "--trust-remote-code"
      - "--quantization"
      - "bitsandbytes"
      - "--override-pooler-config"
      - '{"pooling_type":"ALL","normalize":true}'
      - "--served-model-name"
      - "anyname"

It is working well with high throughput on a 8GB GPU. Hope it helps.

@HoangTung-Vu
Copy link

Does your patch support multimodal (image) embedding ?

@issahammoud
Copy link

Does your patch support multimodal (image) embedding ?

@HoangTung-Vu Yes indeed.

You should follow the same query strucuture as copali-engine:

payload={
            "model": "my_model_name",
            "messages": [{
                "role": "user",
                "content": [
                    {"type": "text", "text": "<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe the image.<|im_end|><|endoftext|>"},
                    {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{img_base64}"}}
                ],
            }],
            "encoding_format": "float",
        }

        resp = requests.post(embedding_url, json=payload)
       

However, you cannot use openai client code because it does not support multimodal embedding.

@HoangTung-Vu
Copy link

HoangTung-Vu commented Oct 13, 2025

I already used requests directly instead of OpenAI client code but i encountered 400 Bad Request Error.
Did you add any config to the model ?

If i comment out the image part, it works

embedding_url = "http://50.175.95.210:50168/v1/embeddings/"

payload={
    "model": "colqwen",
    "messages": [{
        "role": "user",
        "content": [
            {"type": "text", "text": "<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe the image.<|im_end|><|endoftext|>"},
            # {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_base64}"}}
        ],
    }],
    "encoding_format": "float",
}

resp = requests.post(embedding_url, json=payload)
print(resp)       

@issahammoud
Copy link

issahammoud commented Oct 13, 2025

@HoangTung-Vu I need more context to understand why it happened to you.

Could you tell me exactly the steps you did, and the whole message error?

@HoangTung-Vu
Copy link

I applied your patch using Git commands, but it raised some errors, so I manually integrated the changes instead.
I cloned the vllm repository and applied the modifications on the main branch (currently at version v0.11.0).

For the model, I cloned OpenGVLab/colqwen2_5-3b-base, added the modules.json file as in your implementation, and updated the model class in config.json.

However, when sending a request to the model, I still receive a 400 Bad Request response.

@issahammoud
Copy link

@HoangTung-Vu Make sure that vllm is loading the correct model. It happened to me that it loaded a default model because it could not load the local one.
In addition, when cloning vllm and adding the changes, you should build it from source obviously so the changes are taken into consideration. This step can take a lot of time (up to multiple hours based on your config).

I did install the docker version for my specific hardware so it was faster.
So I suggest that you make sure it is loading your model and not a default one, and confirm that you installed vllm from source and you are using it.

Here is my docker compose for an RTX 3070:


embedding:
    build:
      context: vllm
      dockerfile: docker/Dockerfile
      target: vllm-openai
      args:
        - max_jobs=8
        - nvcc_threads=2
        - torch_cuda_arch_list=8.6
        - VLLM_USE_PRECOMPILED=1
    environment:
      - DOCKER_BUILDKIT=1
      - CUDA_VISIBLE_DEVICES=0
      - VLLM_HOST=0.0.0.0
      - VLLM_PORT=8000
      - CUDA_HOME=/usr/local/cuda-12.8
      - CUDACXX=/usr/local/cuda-12.8/bin/nvcc
      - LD_LIBRARY_PATH=/usr/local/cuda-12.8/lib64
      - HF_HUB_OFFLINE=1
      - TORCH_CUDA_ARCH_LIST=8.6
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    ports:
      - "8000:8000"
    entrypoint: ["vllm", "serve"]
    command:
      - "/root/.cache/huggingface/hub/models--Metric-AI--ColQwen2.5-3b-multilingual-v1.0/snapshots/e2a1c05d053dcf4ad6e39b6c48ced9d6a81071f0"
      - "--host"
      - "0.0.0.0"
      - "--port"
      - "8000"
      - "--runner"
      - "pooling"
      - "--convert"
      - "embed"
      - "--dtype"
      - "bfloat16"
      - "--max-model-len"
      - "1024"
      - "--gpu-memory-utilization"
      - "0.8"
      - "--trust-remote-code"
      - "--quantization"
      - "bitsandbytes"
      - "--override-pooler-config"
      - '{"pooling_type":"ALL","normalize":true}'
      - "--served-model-name"
      - "my-model-name"

    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 60s
      timeout: 300s
      retries: 3
    restart: unless-stopped
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
      

@HoangTung-Vu
Copy link

I ran my tests on a cloud instance from Vast.ai. Since it is a virtual container environment, I was not able to use Docker Compose as in your setup.

For the model (ColQwen), I cloned it directly from Hugging Face. I chose the base model so that I could edit the model_class field in config.json. The fine-tuned variants only include adapter configurations, so they were not suitable for this purpose.

When running vLLM, I pointed directly to the local model directory, so I assume it correctly loaded the intended model.

Regarding vLLM itself, I installed it from source using:

pip install -e .

I suspect that the 400 Bad Request error might be caused by an incorrect configuration of the ColQwen model on my side. I’ll review the model setup again to ensure it matches your patch specifications.

@issahammoud
Copy link

@HoangTung-Vu
I recommend to set HF_HUB_OFFLINE=1 so it will not try to download another model.
Also check the .cache file to see if there are models you are not aware of.

@HoangTung-Vu
Copy link

I have rechecked the configuration and reinstalled everything.
However, with the message template above, it works when the image is provided via a URL, but not when using a base64 string.
Do you know why this might be happening? Thank you very much!

@DarkLight1337
Copy link
Member

DarkLight1337 commented Oct 15, 2025

What does your base64 URL look like? Make sure it is in the correct format

@issahammoud
Copy link

@HoangTung-Vu
Check the base64 format, I convert a PIL image as follows:

buffer = io.BytesIO()
img.save(buffer, format="png")
buffer.seek(0)
img_base64 = base64.b64encode(buffer.read()).decode("utf-8")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation needs-rebase new-model Requests to new models qwen Related to Qwen models

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[New Model]: Support ColQwen2VL

6 participants