[Model] add colqwen2_vl code & inference #14291

BloomBerry · 2025-03-05T14:35:37Z

Add support for ColQwen2VL model
Description
This PR adds support for the ColQwen2VL model to vLLM. ColQwen2VL is an efficient document retrieval vision language model based on Qwen2VL, as described in the paper "ColPali: Efficient Document Retrieval with Vision Language Models". The model is designed to generate embeddings rather than text outputs, making it suitable for document retrieval applications.
Key implementation details:
Extended the existing Qwen2VL implementation for ColQwen2VL compatibility
Implemented custom text projection layer and L2 normalization for embedding generation
Added appropriate processing utilities for image and video inputs
Overrode forward, compute_logits and sample methods to optimize for embedding output
This implementation enables users to leverage ColQwen2VL's multimodal document retrieval capabilities through vLLM's efficient serving infrastructure.
Testing
Tested with sample image inputs
Verified embedding output format and dimensions
Confirmed compatibility with HuggingFace ColQwen2VL models

FIX #19381

github-actions · 2025-03-05T14:35:51Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: BloomBerry <jyjang1090@gmail.com>

DarkLight1337 · 2025-03-05T14:59:08Z

Thanks for implementing this! Can you update the following files as well?

Supported Models page
Test registry tests/models/test_registry.py
Model correctness tests tests/models/embedding/vision_language
Processor correctness tests tests/models/multimodal/processing/test_common.py

Signed-off-by: BloomBerry <jyjang1090@gmail.com>

mgoin · 2025-05-25T16:28:59Z

Hey @BloomBerry I'm working on reviving this PR since it has drifted away from the refactors on main and needs some more testing. Would you want me to push to this PR myself or I can start a new one.

It seems to require this Transformers PR huggingface/transformers#35778

mergify · 2025-06-04T11:57:06Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @BloomBerry.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

issahammoud · 2025-06-30T13:10:18Z

Hi, is there an estimation when the PR will be merged?

SMAntony · 2025-09-02T10:51:00Z

Is anyone working on this?

issahammoud · 2025-10-07T15:25:25Z

I was able to serve ColQwen 2.5 vl 3B (https://huggingface.co/Metric-AI/ColQwen2.5-3b-multilingual-v1.0) with vllm by doing some modifications to the source code.

The idea is to use the Qwen 2.5 VL with ALL pooling type so it outputs all embedding vectors for late interaction.

Here is a git patch you can apply on vllm source code (tested with v0.11.0).
colqwen.patch

I am using it with the local weights of Metric-AI/ColQwen2.5-3b-multilingual-v1.0 (with the base config from vidore/colqwen2.5-base).

You just need to change the architecture name in the config.json from ColQwen2_5, to ColQwen2_5_VLForConditionalGeneration and add the following modules.json file

[
  {
    "idx": 0,
    "name": "0",
    "path": "",
    "type": "sentence_transformers.models.Transformer"
  }
]

I am running the openai compatible server in docker compose as follows:

entrypoint: ["vllm", "serve"]
command:
      - "/root/.cache/huggingface/hub/models--Metric-AI--ColQwen2.5-3b-multilingual-v1.0/snapshots/e2a1c05d053dcf4ad6e39b6c48ced9d6a81071f0"
      - "--host"
      - "0.0.0.0"
      - "--port"
      - "8000"
      - "--runner"
      - "pooling"
      - "--convert"
      - "embed"
      - "--dtype"
      - "bfloat16"
      - "--max-model-len"
      - "1024"
      - "--gpu-memory-utilization"
      - "0.8"
      - "--trust-remote-code"
      - "--quantization"
      - "bitsandbytes"
      - "--override-pooler-config"
      - '{"pooling_type":"ALL","normalize":true}'
      - "--served-model-name"
      - "anyname"

It is working well with high throughput on a 8GB GPU. Hope it helps.

HoangTung-Vu · 2025-10-13T08:02:03Z

Does your patch support multimodal (image) embedding ?

issahammoud · 2025-10-13T08:09:42Z

Does your patch support multimodal (image) embedding ?

@HoangTung-Vu Yes indeed.

You should follow the same query strucuture as copali-engine:

payload={
            "model": "my_model_name",
            "messages": [{
                "role": "user",
                "content": [
                    {"type": "text", "text": "<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe the image.<|im_end|><|endoftext|>"},
                    {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{img_base64}"}}
                ],
            }],
            "encoding_format": "float",
        }

        resp = requests.post(embedding_url, json=payload)

However, you cannot use openai client code because it does not support multimodal embedding.

HoangTung-Vu · 2025-10-13T09:05:49Z

I already used requests directly instead of OpenAI client code but i encountered 400 Bad Request Error.
Did you add any config to the model ?

If i comment out the image part, it works

embedding_url = "http://50.175.95.210:50168/v1/embeddings/"

payload={
    "model": "colqwen",
    "messages": [{
        "role": "user",
        "content": [
            {"type": "text", "text": "<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe the image.<|im_end|><|endoftext|>"},
            # {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_base64}"}}
        ],
    }],
    "encoding_format": "float",
}

resp = requests.post(embedding_url, json=payload)
print(resp)

issahammoud · 2025-10-13T09:21:49Z

@HoangTung-Vu I need more context to understand why it happened to you.

Could you tell me exactly the steps you did, and the whole message error?

HoangTung-Vu · 2025-10-13T09:47:23Z

I applied your patch using Git commands, but it raised some errors, so I manually integrated the changes instead.
I cloned the vllm repository and applied the modifications on the main branch (currently at version v0.11.0).

For the model, I cloned OpenGVLab/colqwen2_5-3b-base, added the modules.json file as in your implementation, and updated the model class in config.json.

However, when sending a request to the model, I still receive a 400 Bad Request response.

issahammoud · 2025-10-13T09:58:16Z

@HoangTung-Vu Make sure that vllm is loading the correct model. It happened to me that it loaded a default model because it could not load the local one.
In addition, when cloning vllm and adding the changes, you should build it from source obviously so the changes are taken into consideration. This step can take a lot of time (up to multiple hours based on your config).

I did install the docker version for my specific hardware so it was faster.
So I suggest that you make sure it is loading your model and not a default one, and confirm that you installed vllm from source and you are using it.

Here is my docker compose for an RTX 3070:


embedding:
    build:
      context: vllm
      dockerfile: docker/Dockerfile
      target: vllm-openai
      args:
        - max_jobs=8
        - nvcc_threads=2
        - torch_cuda_arch_list=8.6
        - VLLM_USE_PRECOMPILED=1
    environment:
      - DOCKER_BUILDKIT=1
      - CUDA_VISIBLE_DEVICES=0
      - VLLM_HOST=0.0.0.0
      - VLLM_PORT=8000
      - CUDA_HOME=/usr/local/cuda-12.8
      - CUDACXX=/usr/local/cuda-12.8/bin/nvcc
      - LD_LIBRARY_PATH=/usr/local/cuda-12.8/lib64
      - HF_HUB_OFFLINE=1
      - TORCH_CUDA_ARCH_LIST=8.6
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    ports:
      - "8000:8000"
    entrypoint: ["vllm", "serve"]
    command:
      - "/root/.cache/huggingface/hub/models--Metric-AI--ColQwen2.5-3b-multilingual-v1.0/snapshots/e2a1c05d053dcf4ad6e39b6c48ced9d6a81071f0"
      - "--host"
      - "0.0.0.0"
      - "--port"
      - "8000"
      - "--runner"
      - "pooling"
      - "--convert"
      - "embed"
      - "--dtype"
      - "bfloat16"
      - "--max-model-len"
      - "1024"
      - "--gpu-memory-utilization"
      - "0.8"
      - "--trust-remote-code"
      - "--quantization"
      - "bitsandbytes"
      - "--override-pooler-config"
      - '{"pooling_type":"ALL","normalize":true}'
      - "--served-model-name"
      - "my-model-name"

    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 60s
      timeout: 300s
      retries: 3
    restart: unless-stopped
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface

HoangTung-Vu · 2025-10-14T02:33:09Z

I ran my tests on a cloud instance from Vast.ai. Since it is a virtual container environment, I was not able to use Docker Compose as in your setup.

For the model (ColQwen), I cloned it directly from Hugging Face. I chose the base model so that I could edit the model_class field in config.json. The fine-tuned variants only include adapter configurations, so they were not suitable for this purpose.

When running vLLM, I pointed directly to the local model directory, so I assume it correctly loaded the intended model.

Regarding vLLM itself, I installed it from source using:

pip install -e .

I suspect that the 400 Bad Request error might be caused by an incorrect configuration of the ColQwen model on my side. I’ll review the model setup again to ensure it matches your patch specifications.

issahammoud · 2025-10-14T04:41:16Z

@HoangTung-Vu
I recommend to set HF_HUB_OFFLINE=1 so it will not try to download another model.
Also check the .cache file to see if there are models you are not aware of.

HoangTung-Vu · 2025-10-15T01:44:03Z

I have rechecked the configuration and reinstalled everything.
However, with the message template above, it works when the image is provided via a URL, but not when using a base64 string.
Do you know why this might be happening? Thank you very much!

DarkLight1337 · 2025-10-15T02:35:49Z

What does your base64 URL look like? Make sure it is in the correct format

issahammoud · 2025-10-15T08:10:03Z

@HoangTung-Vu
Check the base64 format, I convert a PIL image as follows:

buffer = io.BytesIO()
img.save(buffer, format="png")
buffer.seek(0)
img_base64 = base64.b64encode(buffer.read()).decode("utf-8")

add colqwen2_vl code & inference

67574c0

mergify bot added the documentation Improvements or additions to documentation label Mar 5, 2025

BloomBerry added 2 commits March 5, 2025 23:55

Add ColQwen2VL model implementation

acf3d8c

Signed-off-by: BloomBerry <jyjang1090@gmail.com>

Merge branch 'vllm-project:main' into colqwen2_vl

1e26ffc

DarkLight1337 changed the title ~~add colqwen2_vl code & inference~~ [Model] add colqwen2_vl code & inference Mar 5, 2025

BloomBerry added 4 commits March 6, 2025 00:22

colqwen2_vl inference code

6f130dd

Signed-off-by: BloomBerry <jyjang1090@gmail.com>

add supported models md

919c9e9

Signed-off-by: BloomBerry <jyjang1090@gmail.com>

add test_registry

f175e5b

Signed-off-by: BloomBerry <jyjang1090@gmail.com>

add test_colqwen2vl code

1df7a7e

Signed-off-by: BloomBerry <jyjang1090@gmail.com>

BloomBerry requested review from DarkLight1337 and ywang96 as code owners March 5, 2025 15:37

mergify bot added the needs-rebase label Jun 4, 2025

mgoin mentioned this pull request Jun 9, 2025

[New Model]: Support ColQwen2VL #19381

Closed

1 task

mergify bot added the qwen Related to Qwen models label Jun 19, 2025

mergify bot added the new-model Requests to new models label Jul 11, 2025

Uh oh!

[Model] add colqwen2_vl code & inference #14291

Are you sure you want to change the base?

[Model] add colqwen2_vl code & inference #14291

Conversation

BloomBerry commented Mar 5, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 5, 2025

Uh oh!

DarkLight1337 commented Mar 5, 2025

Uh oh!

mgoin commented May 25, 2025

Uh oh!

mergify bot commented Jun 4, 2025

Uh oh!

issahammoud commented Jun 30, 2025

Uh oh!

SMAntony commented Sep 2, 2025

Uh oh!

issahammoud commented Oct 7, 2025

Uh oh!

HoangTung-Vu commented Oct 13, 2025

Uh oh!

issahammoud commented Oct 13, 2025

Uh oh!

HoangTung-Vu commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

issahammoud commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HoangTung-Vu commented Oct 13, 2025

Uh oh!

issahammoud commented Oct 13, 2025

Uh oh!

HoangTung-Vu commented Oct 14, 2025

Uh oh!

issahammoud commented Oct 14, 2025

Uh oh!

HoangTung-Vu commented Oct 15, 2025

Uh oh!

DarkLight1337 commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

issahammoud commented Oct 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

BloomBerry commented Mar 5, 2025 •

edited by github-actions bot

Loading

HoangTung-Vu commented Oct 13, 2025 •

edited

Loading

issahammoud commented Oct 13, 2025 •

edited

Loading

DarkLight1337 commented Oct 15, 2025 •

edited

Loading