Skip to content

Conversation

@david6666666
Copy link
Contributor

@david6666666 david6666666 commented Aug 19, 2025

Purpose

Add option to run GLM-4.5V vision encoder in data parallel manner while the main model is in TP. Can be enabled by flag: --mm-encoder-tp-mode "data"

FIX #23877

Test Plan

vllm serve zai-org/GLM-4.5V \
     --tensor-parallel-size 4 \
     --tool-call-parser glm45 \
     --reasoning-parser glm45 \
     --enable-auto-tool-choice \
     --allowed-local-media-path / \
     --media-io-kwargs '{"video": {"num_frames": -1}}'

vllm serve zai-org/GLM-4.5V \
     --tensor-parallel-size 4 \
     --tool-call-parser glm45 \
     --reasoning-parser glm45 \
     --enable-auto-tool-choice \
     --allowed-local-media-path / \
     --media-io-kwargs '{"video": {"num_frames": -1}}' \
     --mm-encoder-tp-mode "data"
  1. Run GLM-4.5V accuracy test
mistral-eval(https://github.com/ywang96/mistral-evals/tree/modify)
python -m eval.run eval_vllm \
        --model_name zai-org/GLM-4.5V \
        --url http://0.0.0.0:8000 \
        --output_dir /glm4_5v \
        --eval_name "mmmu"
cd glm4_5v
python3 parse_result.py
TP4:

==================================
Total questions: 900
Correctly answered: 671
Accuracy: 74.56%
==================================

DP4:

==================================
Total questions: 900
Correctly answered: 674
Accuracy: 74.89%
==================================
  1. Run benchmark on GLM-4.5V
python3 benchmarks/benchmark_serving.py  \
--backend openai-chat   \
--model zai-org/GLM-4.5V   \
--endpoint /v1/chat/completions   \
--dataset-name hf   \
--dataset-path lmarena-ai/VisionArena-Chat   \
--hf-split train   \
--num-prompts 1000 \
--max-concurrency 64

Test Result

TP4:

============ Serving Benchmark Result ============
Successful requests:                     1000      
Maximum request concurrency:             64        
Benchmark duration (s):                  113.64    
Total input tokens:                      90524     
Total generated tokens:                  127011    
Request throughput (req/s):              8.80      
Output token throughput (tok/s):         1117.61   
Total Token throughput (tok/s):          1914.17   
---------------Time to First Token----------------
Mean TTFT (ms):                          1318.74   
Median TTFT (ms):                        1289.14   
P99 TTFT (ms):                           3283.91   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          46.46     
Median TPOT (ms):                        45.21     
P99 TPOT (ms):                           63.08     
---------------Inter-token Latency----------------
Mean ITL (ms):                           47.39     
Median ITL (ms):                         26.94     
P99 ITL (ms):                            602.48    
==================================================

DP4:

============ Serving Benchmark Result ============
Successful requests:                     1000      
Maximum request concurrency:             64        
Benchmark duration (s):                  104.64    
Total input tokens:                      90524     
Total generated tokens:                  127136    
Request throughput (req/s):              9.56      
Output token throughput (tok/s):         1215.00   
Total Token throughput (tok/s):          2080.12   
---------------Time to First Token----------------
Mean TTFT (ms):                          1140.35   
Median TTFT (ms):                        984.99    
P99 TTFT (ms):                           5460.47   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          43.27     
Median TPOT (ms):                        43.30     
P99 TPOT (ms):                           56.11     
---------------Inter-token Latency----------------
Mean ITL (ms):                           43.88     
Median ITL (ms):                         25.77     
P99 ITL (ms):                            486.12    
==================================================

single Req:
text:

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "zai-org/GLM-4.5V",
    "messages": [
      {
        "role": "user",
        "content": "What is the result of 111 * 5?"
      }
    ],
    "max_tokens": 500,
    "temperature": 0.7
  }'

{"id":"chatcmpl-5ecb8dc8dbab4c0f9cbf78b59bb921fe","object":"chat.completion","created":1756366966,"model":"zai-org/GLM-4.5V","choices":[{"index":0,"message":{"role":"assistant","content":"\nTo calculate \\(111 \\times 5\\), you can break it down using the distributive property:  \n\\[\n111 \\times 5 = (100 + 10 + 1) \\times 5 = 100 \\times 5 + 10 \\times 5 + 1 \\times 5 = 500 + 50 + 5 = 555.\n\\]  \nAlternatively, multiplying digit by digit:  \n- \\(1 \\times 5 = 5\\) (units place),  \n- \\(1 \\times 5 = 5\\) (tens place),  \n- \\(1 \\times 5 = 5\\) (hundreds place","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":"I need to calculate 111 multiplied by 5. Let me think about how to do this step by step. First, I know that multiplying by 5 is the same as multiplying by 10 and then dividing by 2, but maybe a simpler way is to just do the multiplication directly.\n\nSo, 111 times 5. I can break it down: 100 times 5 is 500, 10 times 5 is 50, and 1 times 5 is 5. Then add those together: 500 + 50 is 550, plus 5 is 555. That seems right.\n\nAlternatively, I can think of it as 111 * 5. Let's do the multiplication digit by digit. Starting from the right: 1 * 5 is 5. Then the next digit is 1, so 1 * 5 is 5, and the last digit is 1, so 1 * 5 is 5. So putting it together, it's 555. Yeah, that matches what I got before.\n\nI could also use the distributive property: 111 * 5 = (100 + 10 + 1) * 5 = 100*5 + 10*5 + 1*5 = 500 + 50 + 5 = 555. Same result.\n\nI think that's correct. Let me just verify with another method. If I add 111 five times: 111 + 111 is 222, plus another 111 is 333, plus another 111 is 444, plus the last 111 is 555. Yep, that works too.\n\nSo all methods lead to 555. I'm confident that's the answer."},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":17,"total_tokens":517,"completion_tokens":500,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}

image:

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "zai-org/GLM-4.5V",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "What do you see in this image?"
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "https://upload.wikimedia.org/wikipedia/commons/d/da/2015_Kaczka_krzy%C5%BCowka_w_wodzie_%28samiec%29.jpg"
            }
          }
        ]
      }
    ],
    "max_tokens": 300
  }'

{"id":"chatcmpl-1eb19d594b184209b0b287992ee35dd1","object":"chat.completion","created":1756366983,"model":"zai-org/GLM-4.5V","choices":[{"index":0,"message":{"role":"assistant","content":"\nIn the image, there is a duck swimming on a body of water. The duck has a vibrant green head, a bright yellow beak, and its body features a mix of brown, white, and gray feathers. The water is a deep blue with gentle ripples, and the duck’s reflection is visible on the surface. The overall scene captures the duck in a natural aquatic environment.","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":"Got it, let's see. The image shows a duck swimming in water. First, I need to describe the duck's features. The duck has a green head, a yellow beak, and its body is a mix of brown, white, and maybe some other colors. The water is blue with ripples, and there's a reflection of the duck in the water. So I should mention the duck's appearance, the water, and the reflection. Let me structure that."},"logprobs":null,"finish_reason":"stop","stop_reason":151336,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":6092,"total_tokens":6271,"completion_tokens":179,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "zai-org/GLM-4.5V",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "What activities are happening in this video?"
          },
          {
            "type": "video_url",
            "video_url": {
              "url": "https://content.pexels.com/videos/free-videos.mp4"
            }
          }
        ]
      }
    ],
    "max_tokens": 500
  }'

{"id":"chatcmpl-1bf2c80ca6234144912915fc28d528cb","object":"chat.completion","created":1756367006,"model":"zai-org/GLM-4.5V","choices":[{"index":0,"message":{"role":"assistant","content":"The video shows several distinct activities across different scenes:\n\n1.  **Ocean Waves Crashing on Rocks:** The video begins with aerial footage of turquoise ocean waves crashing against dark, jagged rocks along a coastline.\n2.  **Driving on a Winding Road:** It then shows an aerial view of cars driving on a winding road that cuts through lush, green, terraced fields, likely tea plantations.\n3.  **Using a Smartphone:** A close-up shot shows a person holding and using a red smartphone.\n4.  **Driving Through a Desert:** The video includes aerial footage of vehicles driving on a paved road through a vast, sandy desert landscape.\n5.  **Playing Basketball:** The final scenes feature a person in a basketball jersey on an outdoor court, holding a basketball and appearing to be in the middle of a game or practice.","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":151336,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":29728,"total_tokens":29900,"completion_tokens":172,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}

(Optional) Documentation Update


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@DarkLight1337
Copy link
Member

DarkLight1337 commented Aug 19, 2025

We are in the process of merging #22742 which is quite similar in idea to this, except that DP is applied on the whole vision encoder. Perhaps you could adapt your PR to use the helper functions from that PR.

@david6666666
Copy link
Contributor Author

We are in the process of merging #22742 which is quite similar in idea to this, except that DP is applied on the whole vision encoder. Perhaps you could adapt your PR to use the helper functions from that PR.

Thank you.I will adapt my PR later

@david6666666 david6666666 force-pushed the dp_vit_glm45v branch 3 times, most recently from e06086f to fbc3bf3 Compare August 20, 2025 02:19
@david6666666 david6666666 marked this pull request as ready for review August 20, 2025 02:26
@dosubot
Copy link

dosubot bot commented Aug 20, 2025

Related Documentation

No published documentation to review for changes on this repository.
Write your first living document

How did I do? Any feedback?  Join Discord

@DarkLight1337
Copy link
Member

The code looks good, I'll merge the PR in once you have posted the benchmark results. (Please ping me if you have done this!)

@david6666666 david6666666 force-pushed the dp_vit_glm45v branch 3 times, most recently from 4fab348 to daf1936 Compare August 20, 2025 07:21
@david6666666 david6666666 force-pushed the dp_vit_glm45v branch 2 times, most recently from b76be08 to b87b83b Compare August 21, 2025 08:00
@david6666666
Copy link
Contributor Author

We are in the process of merging #22742 which is quite similar in idea to this, except that DP is applied on the whole vision encoder. Perhaps you could adapt your PR to use the helper functions from that PR.

run_dp_sharded_mrope_vision_model does not match GLM-4.5V.
ViT's grid_thw type requires some time for adaptation and verification.

@david6666666 david6666666 requested a review from ywang96 as a code owner August 22, 2025 07:56
@mergify mergify bot added the multi-modality Related to multi-modality (#4194) label Aug 22, 2025
@david6666666 david6666666 changed the title [Model] Support dp on ViT on GLM-4.5V [WIP][Model] Support dp on ViT on GLM-4.5V Aug 22, 2025
@david6666666 david6666666 requested a review from hmellor as a code owner August 22, 2025 08:12
@mergify mergify bot added the documentation Improvements or additions to documentation label Aug 22, 2025
@david6666666 david6666666 changed the title [WIP][Model] Support dp on ViT on GLM-4.5V [Model] Support dp on ViT on GLM-4.5V Aug 28, 2025
@david6666666 david6666666 changed the title [Model] Support dp on ViT on GLM-4.5V [WIP][Model] Support dp on ViT on GLM-4.5V Aug 28, 2025
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be actually faster to use Python built-ins here instead of torch.Tensor because the length of grid_thw_list is pretty small. But I guess you should profile this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glm4vVisionTransformer
def forward(
self,
x: torch.Tensor,
grid_thw: torch.Tensor,
) -> torch.Tensor:
so I directly process it with tensor.Tensor without converting it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you try calling .tolist() before passing it into this method and see if it improves the performance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I will try.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you try calling .tolist() before passing it into this method and see if it improves the performance?

done

@david6666666 david6666666 changed the title [WIP][Model] Support dp on ViT on GLM-4.5V [Model] Support dp on ViT on GLM-4.5V Aug 29, 2025
Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: David Chen <530634352@qq.com>
@david6666666 david6666666 changed the title [Model] Support dp on ViT on GLM-4.5V [WIP][Model] Support dp on ViT on GLM-4.5V Sep 2, 2025
Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: David Chen <530634352@qq.com>
Comment on lines 1491 to 1494
# Split concatenated embeddings for each video item.
merge_size = self.visual.spatial_merge_size
sizes = grid_thw.prod(-1) // merge_size // merge_size
return video_embeds.split(sizes.tolist())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part can be factored out to be more similar to the original code (since the branch with use_data_parallel returns early anyway)

Signed-off-by: David Chen <530634352@qq.com>
@david6666666 david6666666 changed the title [WIP][Model] Support dp on ViT on GLM-4.5V [Model] Support dp on ViT on GLM-4.5V Sep 2, 2025
Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: David Chen <530634352@qq.com>
Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, do you have further changes to make?

@david6666666
Copy link
Contributor Author

LGTM, do you have further changes to make?

No, that's it.

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) September 2, 2025 08:44
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 2, 2025
@DarkLight1337 DarkLight1337 merged commit 2f0bab3 into vllm-project:main Sep 2, 2025
50 checks passed
eicherseiji pushed a commit to eicherseiji/vllm that referenced this pull request Sep 9, 2025
Signed-off-by: David Chen <530634352@qq.com>
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
Signed-off-by: David Chen <530634352@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation multi-modality Related to multi-modality (#4194) ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[MM Encoder]: Add Encoder DP to GLM-4.5V

2 participants