-
-
Notifications
You must be signed in to change notification settings - Fork 7.8k
[V1] Support Deepseek MTP #18435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[V1] Support Deepseek MTP #18435
Conversation
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn>
Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn>
Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn>
Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn>
Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@YaoJiayi Thanks for updating the PR! It looks good to me overall.
Left some comments. Please take a look.
Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn>
Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn>
@YaoJiayi LGTM except the minor issue above. Could you please run the deepseek model locally and see it could generate a reasonable output with a reasonable acceptance rate? |
Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn>
@WoosukKwon I tested on Deepseek-R1 with 10 simple prompts. Outputs are reasonabe and acceptance rate are 30-70%. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@YaoJiayi Great! Thanks for the amazing work!
PTAL at the failing V1 test |
Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn>
Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com> Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn> Co-authored-by: Rui Qiao <ruisearch42@gmail.com> Signed-off-by: Yuqi Zhang <yuqizhang@google.com>
* Add files via uploadAdd fused MoE kernel tuning configs (fp8_w8a8) for DeepSeek V3/R1 on a single-node 8x NVIDIA H20 96GB setup (vllm-project#18337) * [Misc] Fix typo (vllm-project#18330) * Neuron up mistral (vllm-project#18222) Signed-off-by: Satyajith Chilappagari <satchill@amazon.com> * fix CUDA_check redefinition in vllm-project#17918 (vllm-project#18287) Signed-off-by: Lucia Fang <fanglu@fb.com> Co-authored-by: Lucia (Lu) Fang <fanglu@meta.com> * [neuron] fix authorization issue (vllm-project#18364) Signed-off-by: Liangfu Chen <liangfc@amazon.com> * [Misc] Allow `AutoWeightsLoader` to skip loading weights with specific substr in name (vllm-project#18358) Signed-off-by: Isotr0py <2037008807@qq.com> * [Core] [Bugfix]: tensor parallel with prompt embeds (vllm-project#18171) Signed-off-by: Nan2018 <nan@protopia.ai> Co-authored-by: Andrew Sansom <andrew@protopia.ai> * [release] Change dockerhub username for TPU release (vllm-project#18389) * [Bugfix] fix adding bias twice in ipex GPTQ quantization (vllm-project#18363) Signed-off-by: rand-fly <randfly@outlook.com> * [doc] update env variable export (vllm-project#18391) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> * [Misc] Add LoRA code owner (vllm-project#18387) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> * Update cpu.txt (vllm-project#18398) Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com> * [CI] Add mteb testing to test the accuracy of the embedding model (vllm-project#17175) * [Bugfix] Fix MRoPE Errors in the Qwen-VL Model When Processing Pure Text (vllm-project#18407) Co-authored-by: 松灵 <wpf272043@alibaba-inc.com> * [Misc] refactor prompt embedding examples (vllm-project#18405) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> * [Minor] Rename quantization nvfp4 to modelopt_fp4 (vllm-project#18356) Signed-off-by: mgoin <mgoin64@gmail.com> * [Model] use AutoWeightsLoader for bloom (vllm-project#18300) Signed-off-by: calvin chen <120380290@qq.com> * [Kernel] update comment for KV shape in unified triton attn (vllm-project#18099) Signed-off-by: haochengxia <xhc_1007@163.com> * fix:Build torch wheel inline rather than picking from nightly (vllm-project#18351) Signed-off-by: Dilip Gowda Bhagavan <dilip.bhagavan@ibm.com> * [TPU] Re-enable the Pallas MoE kernel (vllm-project#18025) Signed-off-by: Michael Goin <mgoin64@gmail.com> * [Bugfix] config.head_dim is now explicitly set to None (vllm-project#18432) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> * [Bug] Fix moe_sum signature (vllm-project#18440) Signed-off-by: Bill Nell <bnell@redhat.com> * Revert "[Bugfix] Fix MRoPE Errors in the Qwen-VL Model When Processing Pure Text (vllm-project#18407)" (vllm-project#18456) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Bugfix][Failing Test] Fix nixl connector test when promt size < block size (vllm-project#18429) Signed-off-by: wwl2755 <wangwenlong2755@gmail.com> * [Misc] MultiConnector._connectors type (vllm-project#18423) Signed-off-by: nicklucche <nlucches@redhat.com> * [Frontend] deprecate `--device` arg (vllm-project#18399) Signed-off-by: Kebe <mail@kebe7jun.com> * [V1] Fix general plugins not loaded in engine for multiproc (vllm-project#18326) Signed-off-by: Yong Hoon Shin <yhshin@meta.com> * [Misc] refactor disaggregated-prefill-v1 example (vllm-project#18474) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> * [Bugfix][Failing Test] Fix test_events.py (vllm-project#18460) Signed-off-by: rabi <ramishra@redhat.com> * [MODEL] FalconH1 (vllm-project#18406) Signed-off-by: dhia.rhaiem <dhia.rhaiem@tii.ae> Co-authored-by: younesbelkada <younesbelkada@gmail.com> Co-authored-by: Ilyas Chahed <ilyas.chahed@tii.ae> Co-authored-by: Jingwei Zuo <jingwei.zuo@tii.ae> * [Doc] fix arg docstring in linear layers (vllm-project#18410) Signed-off-by: giantcroc <1204449533@qq.com> * [Bugfix] Reduce moe_sum test size to avoid OOM (vllm-project#18484) Signed-off-by: Bill Nell <bnell@redhat.com> * [Build] fix Dockerfile shell (vllm-project#18402) * [Misc] Update deprecation message for `--enable-reasoning` (vllm-project#18404) * [ROCm][Kernel][V1] Enable AMD Radeon GPU Custom Paged Attention on v1 (vllm-project#17004) Signed-off-by: Hosang Yoon <hosang.yoon@amd.com> * Remove incorrect env value * Revert "[v1] Support multiple KV cache groups in GPU model runner (vllm-project#17945) (vllm-project#18459) Signed-off-by: Mark McLoughlin <markmc@redhat.com> * [FEAT][ROCm] Upgrade AITER MLA v1 backend (vllm-project#18338) Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com> * [Bugfix] Consistent ascii handling in tool parsers (vllm-project#17704) Signed-off-by: Sebastian Schönnenbeck <sebastian.schoennenbeck@comma-soft.com> * [FalconH1] Fix output dtype in RMSNorm fallback path for Falcon-H1 (e.g. 0.5B) (vllm-project#18500) Signed-off-by: dhia.rhaiem <dhia.rhaiem@tii.ae> Co-authored-by: younesbelkada <younesbelkada@gmail.com> Co-authored-by: Ilyas Chahed <ilyas.chahed@tii.ae> Co-authored-by: Jingwei Zuo <jingwei.zuo@tii.ae> * [MISC] update project urls in pyproject.toml (vllm-project#18519) Signed-off-by: Andy Xie <andy.xning@gmail.com> * [CI] Fix race condition with StatelessProcessGroup.barrier (vllm-project#18506) Signed-off-by: Russell Bryant <rbryant@redhat.com> * Intialize io_thread_pool attribute in the beginning. (vllm-project#18331) Signed-off-by: rabi <ramishra@redhat.com> * [Bugfix] Inconsistent token calculation compared to HF in llava family (vllm-project#18479) Signed-off-by: jaycha <jaycha@ncsoft.com> * [BugFix][DP] Send DP wave completion only from `dp_rank==0` (vllm-project#18502) Signed-off-by: Nick Hill <nhill@redhat.com> Co-authored-by: kourosh hakhamaneshi <kourosh@anyscale.com> * [Bugfix][Model] Make Olmo2Model weight loading return loaded weights (vllm-project#18504) Signed-off-by: Shane A <shanea@allenai.org> * [Bugfix] Fix LoRA test (vllm-project#18518) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> * [Doc] Fix invalid JSON in example args (vllm-project#18527) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Neuron] Update Dockerfile.neuron to use latest neuron release (2.23) (vllm-project#18512) Signed-off-by: Satyajith Chilappagari <satchill@amazon.com> * Update default neuron config for speculation (vllm-project#18274) Signed-off-by: Elaine Zhao <elaineyz@amazon.com> Co-authored-by: Shashwat Srijan <sssrijan@amazon.com> Co-authored-by: Aakash Shetty <sheaak@amazon.com> * Order sequence ids + config update to support specifying custom quantization layers (vllm-project#18279) Signed-off-by: Elaine Zhao <elaineyz@amazon.com> Co-authored-by: Tailin Pan <tailinpa@amazon.com> Co-authored-by: Rishabh Rajesh <rishyraj@amazon.com> Co-authored-by: Yishan McNabb <yishanm@amazon.com> Co-authored-by: Patrick Lange <patlange@amazon.com> Co-authored-by: Maxwell Goldberg <mgld@amazon.com> Co-authored-by: Aakash Shetty <sheaak@amazon.com> * [Bugfix] Fix MRoPE Errors in the Qwen-VL Model When Processing Pure Text (vllm-project#18526) Co-authored-by: 松灵 <wpf272043@alibaba-inc.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Bugfix] Add kwargs to RequestOutput __init__ to be forward compatible (vllm-project#18513) Signed-off-by: Linkun <github@lkchen.net> * [CI/Build] Update bamba test model location (vllm-project#18544) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Doc] Support --stream arg in openai_completion_client.py script (vllm-project#18388) Signed-off-by: googs1025 <googs1025@gmail.com> * [Bugfix] Use random hidden states in dummy sampler run (vllm-project#18543) Signed-off-by: Bowen Wang <abmfy@icloud.com> * [Doc] Add stream flag for chat completion example (vllm-project#18524) Signed-off-by: calvin chen <120380290@qq.com> * [BugFix][CPU] Fix x86 SHM distributed module initialization (vllm-project#18536) Signed-off-by: jiang.li <jiang1.li@intel.com> * [Misc] improve Automatic Prefix Caching example (vllm-project#18554) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> * [Misc] Call `ndarray.tobytes()` directly instead of `ndarray.data.tobytes()` (vllm-project#18347) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com> * [Bugfix] make `test_openai_schema.py` pass (vllm-project#18224) Signed-off-by: David Xia <david@davidxia.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Platform] Move platform check to right place (vllm-project#18470) Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> * [Compile][Platform] Make PiecewiseBackend pluggable and extendable (vllm-project#18076) Signed-off-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: youkaichao <youkaichao@gmail.com> * [Build/CI] Fix CUDA 11.8 build (vllm-project#17679) Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: Tyler Michael Smith <tysmith@redhat.com> Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> * [Tool] Add NIXL installation script (vllm-project#18172) Signed-off-by: Linkun <github@lkchen.net> * [V1][Spec Decode][Bugfix] Load quantize weights for EAGLE (vllm-project#18290) * [Frontend][Bug Fix] Update llama4 pythonic jinja template and llama4_pythonic parser (vllm-project#17917) Signed-off-by: Kai Wu <kaiwu@meta.com> * [Frontend] [Core] Add Tensorizer support for V1, LoRA adapter serialization and deserialization (vllm-project#17926) Signed-off-by: Sanger Steel <sangersteel@gmail.com> * [AMD] [P/D] Compute num gpus for ROCm correctly in run_accuracy_test.sh (vllm-project#18568) Signed-off-by: Randall Smith <Randall.Smith@amd.com> * Re-submit: Fix: Proper RGBA -> RGB conversion for PIL images. (vllm-project#18569) Signed-off-by: Chenheli Hua <huachenheli@outlook.com> * [V1][Spec Decoding] Use model_loader.get_model() to load models (vllm-project#18273) Signed-off-by: Mark McLoughlin <markmc@redhat.com> * Enable hybrid attention models for Transformers backend (vllm-project#18494) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Misc] refactor: simplify input validation and num_requests handling in _convert_v1_inputs (vllm-project#18482) Signed-off-by: googs1025 <googs1025@gmail.com> * [BugFix] Increase TP execute_model timeout (vllm-project#18558) Signed-off-by: Nick Hill <nhill@redhat.com> * [Bugfix] Set `KVTransferConfig.engine_id` in post_init (vllm-project#18576) Signed-off-by: Linkun Chen <github@lkchen.net> * [Spec Decode] Make EAGLE3 draft token ID mapping optional (vllm-project#18488) Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> * [Neuron] Remove bypass on EAGLEConfig and add a test (vllm-project#18514) Signed-off-by: Elaine Zhao <elaineyz@amazon.com> * [Bugfix][Benchmarks] Fix a benchmark of deepspeed-mii backend to use api_key (vllm-project#17291) Signed-off-by: Teruaki Ishizaki <teruaki.ishizaki@ntt.com> * [Misc] Replace `cuda` hard code with `current_platform` (vllm-project#16983) Signed-off-by: shen-shanshan <467638484@qq.com> * [Hardware] correct method signatures for HPU,ROCm,XPU (vllm-project#18551) Signed-off-by: Andy Xie <andy.xning@gmail.com> * [V1] [Bugfix] eagle bugfix and enable correct lm_head for multimodal (vllm-project#18034) Signed-off-by: Ronald Xu <ronaldxu@amazon.com> * [Feature]Add async tensor parallelism using compilation pass (vllm-project#17882) Signed-off-by: cascade812 <cascade812@outlook.com> * [Doc] Update quickstart and install for cu128 using `--torch-backend=auto` (vllm-project#18505) Signed-off-by: mgoin <mgoin64@gmail.com> * [Feature][V1]: suupports cached_tokens in response usage (vllm-project#18149) Co-authored-by: simon-mo <xmo@berkeley.edu> * [Bugfix] Add half type support in reshape_and_cache_cpu_impl on x86 cpu platform (vllm-project#18430) Signed-off-by: Yuqi Zhang <yuqizhang@google.com> Co-authored-by: Yuqi Zhang <yuqizhang@google.com> * Migrate docs from Sphinx to MkDocs (vllm-project#18145) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * Revert "[V1] [Bugfix] eagle bugfix and enable correct lm_head for multimodal (vllm-project#18034)" (vllm-project#18600) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Bugfix][Model] Fix baichuan model loader for tp (vllm-project#18597) Signed-off-by: Mengqing Cao <cmq0113@163.com> * [V0][Bugfix] Fix parallel sampling performance regression when guided decoding is enabled (vllm-project#17731) Signed-off-by: Madeesh Kannan <shadeMe@users.noreply.github.com> Co-authored-by: Russell Bryant <rbryant@redhat.com> * Add myself as docs code owner (vllm-project#18605) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Hardware][CPU] Update intel_extension_for_pytorch 2.7.0 and move to `requirements/cpu.txt` (vllm-project#18542) Signed-off-by: Kay Yan <kay.yan@daocloud.io> * [CI] fix kv_cache_type argument (vllm-project#18594) Signed-off-by: Andy Xie <andy.xning@gmail.com> * [Doc] Fix indent of contributing to vllm (vllm-project#18611) Signed-off-by: Zerohertz <ohg3417@gmail.com> * Replace `{func}` with mkdocs style links (vllm-project#18610) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [CI/Build] Fix V1 flag being set in entrypoints tests (vllm-project#18598) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * Fix examples with code blocks in docs (vllm-project#18609) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Bugfix] Fix transformers model impl ignored for mixtral quant (vllm-project#18602) Signed-off-by: Tristan Leclercq <tristanleclercq@gmail.com> * Include private attributes in API documentation (vllm-project#18614) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Misc] add Haystack integration (vllm-project#18601) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> * [Bugfix][Build/CI] Fixup CUDA compiler version check for CUDA_SUPPORTED_ARCHS (vllm-project#18579) * [Doc] Fix markdown list indentation for MkDocs rendering (vllm-project#18620) Signed-off-by: Zerohertz <ohg3417@gmail.com> * [Doc] Use a different color for the announcement (vllm-project#18616) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * Refactor pplx init logic to make it modular (prepare for deepep) (vllm-project#18200) Signed-off-by: youkaichao <youkaichao@gmail.com> * Fix figures in design doc (vllm-project#18612) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Docs] Change mkdocs to not use directory urls (vllm-project#18622) Signed-off-by: mgoin <mgoin64@gmail.com> * [v1] Redo "Support multiple KV cache groups in GPU model runner (vllm-project#17945)" (vllm-project#18593) Signed-off-by: Chen Zhang <zhangch99@outlook.com> * [Doc] fix list formatting (vllm-project#18624) Signed-off-by: David Xia <david@davidxia.com> * [Doc] Fix top-level API links/docs (vllm-project#18621) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Doc] Avoid documenting dynamic / internal modules (vllm-project#18626) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Doc] Fix broken links and unlinked docs, add shortcuts to home sidebar (vllm-project#18627) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [V1] Support Deepseek MTP (vllm-project#18435) Signed-off-by: Rui Qiao <ruisearch42@gmail.com> Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn> Co-authored-by: Rui Qiao <ruisearch42@gmail.com> * Use prebuilt FlashInfer x86_64 PyTorch 2.7 CUDA 12.8 wheel for CI (vllm-project#18537) Signed-off-by: Huy Do <huydhn@gmail.com> * [CI] Enable test_initialization to run on V1 (vllm-project#16736) Signed-off-by: mgoin <mgoin64@gmail.com> * [Doc] Update references to doc files (vllm-project#18637) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [ModelOpt] Introduce VLLM_MAX_TOKENS_PER_EXPERT_FP4_MOE env var to control blockscale tensor allocation (vllm-project#18160) Signed-off-by: Pavani Majety <pmajety@nvidia.com> * [Bugfix] Migrate to REGEX Library to prevent catastrophic backtracking (vllm-project#18454) Signed-off-by: Crucifixion-Fxl <xmufxl@gmail.com> Co-authored-by: Crucifixion-Fxl <xmufxl@gmail.com> * [Bugfix][Nixl] Fix Preemption Bug (vllm-project#18631) Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * config.py: Clarify that only local GGUF checkpoints are supported. (vllm-project#18623) Signed-off-by: Mathieu Bordere <mathieu@letmetweakit.com> * FIX MOE issue in AutoRound format (vllm-project#18586) Signed-off-by: wenhuach21 <wenhua.cheng@intel.com> * [V1][Spec Decode] Small refactors to improve eagle bookkeeping performance (vllm-project#18424) Signed-off-by: qizixi <qizixi@meta.com> * [Frontend] improve vllm serve --help display (vllm-project#18643) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> * [Model] Add support for Qwen2.5-Omni-7B-AWQ (Qwen2_5OmniForConditionalGeneration) (vllm-project#18647) * [V1][Spec Decode] Support multi-layer eagle draft model (vllm-project#18030) Signed-off-by: qizixi <qizixi@meta.com> * [Doc] Update README links, mark external links (vllm-project#18635) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [MISC][pre-commit] Add pre-commit check for triton import (vllm-project#17716) Signed-off-by: Mengqing Cao <cmq0113@163.com> * [Doc] Fix indentation problems in V0 Paged Attention docs (vllm-project#18659) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Doc] Add community links (vllm-project#18657) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Model] use AutoWeightsLoader for gpt2 (vllm-project#18625) Signed-off-by: zt2370 <ztang2370@gmail.com> * [Doc] Reorganize user guide (vllm-project#18661) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [CI/Build] `chmod +x` to `cleanup_pr_body.sh` (vllm-project#18650) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [MISC] typo fix and clean import (vllm-project#18664) Signed-off-by: Andy Xie <andy.xning@gmail.com> * [BugFix] Fix import error for fused_moe (vllm-project#18642) Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> * [CI] enforce import regex instead of re (vllm-project#18665) Signed-off-by: Aaron Pham <contact@aarnphm.xyz> * fix(regression): clone from reference items (vllm-project#18662) Signed-off-by: Aaron Pham <contact@aarnphm.xyz> * [CI/Build] fix permission denied issue (vllm-project#18645) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> * [BugFix][Spec Decode] Improve Prefix Caching Logic in Speculative Decoding (vllm-project#18668) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> * [V1] Fix _pickle.PicklingError: Can't pickle <class 'transformers_modules.deepseek-ai.DeepSeek-V2-Lite... (vllm-project#18640) Signed-off-by: Seiji Eicher <seiji@anyscale.com> * [MISC] correct signature for LoaderFunction (vllm-project#18670) Signed-off-by: Andy Xie <andy.xning@gmail.com> * [Misc]Replace `cuda` hard code with `current_platform` in Ray (vllm-project#14668) Signed-off-by: noemotiovon <757486878@qq.com> * [Misc][ModelScope] Change to use runtime VLLM_USE_MODELSCOPE (vllm-project#18655) Signed-off-by: Mengqing Cao <cmq0113@163.com> Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: Isotr0py <2037008807@qq.com> * [VLM] Initialize video input support for InternVL models (vllm-project#18499) Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> * Speed up the `kernels/quantization/` tests (vllm-project#18669) Signed-off-by: mgoin <mgoin64@gmail.com> * [BUGFIX] catch subclass first for try...except (vllm-project#18672) Signed-off-by: Andy Xie <andy.xning@gmail.com> * [Misc] Reduce logs on startup (vllm-project#18649) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [doc] fix broken links (vllm-project#18671) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> * [doc] improve readability (vllm-project#18675) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> * [Bugfix] Fix cpu usage and cache hit stats reporting on cpu environment (vllm-project#18674) Signed-off-by: zzzyq <zhangyuqi94@gmail.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> * [CI/build] fix no regex (vllm-project#18676) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> * [Misc] small improve (vllm-project#18680) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> * [Bugfix] Fix profiling dummy data for Pixtral (vllm-project#18677) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Core][Multimodal] Convert PIL Image to array without data copy when hashing (vllm-project#18682) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com> * [CI/Build][Doc] Update `gte-Qwen2-1.5B-instruct` usage (vllm-project#18683) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: Isotr0py <2037008807@qq.com> * [Misc] Fixed the abnormally high TTFT issue in the PD disaggregation example (vllm-project#18644) Signed-off-by: zhaohaidao <zhaohaidao2008@hotmail.com> Signed-off-by: zhaohaiyuan <zhaohaiyuan@xiaohongshu.com> Co-authored-by: zhaohaiyuan <zhaohaiyuan@xiaohongshu.com> * refactor: simplify request handler, use positive condition check for handler assignment (vllm-project#18690) Signed-off-by: googs1025 <googs1025@gmail.com> * [Bugfix] Fix the lm_head in gpt_bigcode in lora mode (vllm-project#6357) Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Signed-off-by: Max de Bayser <maxdebayser@gmail.com> * [CI] add missing argument (vllm-project#18694) Signed-off-by: Andy Xie <andy.xning@gmail.com> * [GH] Add issue template for reporting CI failures (vllm-project#18696) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Doc] Fix issue template format (vllm-project#18699) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Bugfix] Fix Mistral-format models with sliding window (vllm-project#18693) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [CI/Build] Replace `math.isclose` with `pytest.approx` (vllm-project#18703) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [CI] fix dump_input for str type (vllm-project#18697) Signed-off-by: Andy Xie <andy.xning@gmail.com> * [Model] Add support for YARN in NemotronNAS models (vllm-project#18427) Signed-off-by: Nave Assaf <nassaf@nvidia.com> * [CI/Build] Split pooling and generation extended language models tests in CI (vllm-project#18705) Signed-off-by: Isotr0py <2037008807@qq.com> * [Hardware][Intel-Gaudi] [CI/Build] Add tensor parallel size = 2 test to HPU CI (vllm-project#18709) Signed-off-by: Lukasz Durejko <ldurejko@habana.ai> * [Misc] add AutoGen integration (vllm-project#18712) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> * [Bugfix]: handle hf-xet CAS error when loading Qwen3 weights in vLLM (vllm-project#18701) * [Doc] Improve API docs (vllm-project#18713) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Doc] Move examples and further reorganize user guide (vllm-project#18666) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Bugfix] Fix Llama GGUF initialization (vllm-project#18717) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [V1][Sampler] Improve performance of FlashInfer sampling by sampling logits instead of probs (vllm-project#18608) * Convert `examples` to `ruff-format` (vllm-project#18400) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Model][Gemma3] Simplify image input validation (vllm-project#18710) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com> * [Misc] improve web section group title display (vllm-project#18684) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> * [V1][Quantization] Add CUDA graph compatible v1 GGUF support (vllm-project#18646) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: Isotr0py <2037008807@qq.com> * [Model][Gemma3] Cast image pixel values already on CPU (vllm-project#18732) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com> * [FEAT] [ROCm] Upgrade AITER Fused MoE kernels. (vllm-project#18271) Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> * [Doc] Update OOT model docs (vllm-project#18742) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Doc] Update reproducibility doc and example (vllm-project#18741) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Misc] improve docs (vllm-project#18734) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> * feat(rocm-support): support mamba2 on rocm (vllm-project#18565) Signed-off-by: Islam Almersawi <islam.almersawi@openinnovation.ai> Co-authored-by: Islam Almersawi <islam.almersawi@openinnovation.ai> * [Hardware][Intel-Gaudi] [CI/Build] Fix multiple containers using the same name in run-hpu-test.sh (vllm-project#18752) Signed-off-by: Lukasz Durejko <ldurejko@habana.ai> * [Doc] cleanup deprecated flag for doc (vllm-project#18715) Signed-off-by: calvin chen <120380290@qq.com> * Minor fix about MooncakeStoreConnector (vllm-project#18721) Signed-off-by: baoloongmao <baoloongmao@tencent.com> * [Build] fix cpu build missing libtbbmalloc.so (vllm-project#18744) Signed-off-by: Kebe <mail@kebe7jun.com> * [BUG FIX] minicpm (vllm-project#18739) Signed-off-by: huangyuxiang03 <huangyx0321@gmail.com> Co-authored-by: huangyuxiang03 <huangyx0321@gmail.com> * [Doc] Convert Sphinx directives ( `{class}`, `{meth}`, `{attr}`, ...) to MkDocs format for better documentation linking (vllm-project#18663) Signed-off-by: Zerohertz <ohg3417@gmail.com> * [CI/Build] Remove imports of built-in `re` (vllm-project#18750) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [V1][Metrics] Add API for accessing in-memory Prometheus metrics (vllm-project#17010) Signed-off-by: Mark McLoughlin <markmc@redhat.com> * Disable prefix cache by default for benchmark (vllm-project#18639) Signed-off-by: cascade812 <cascade812@outlook.com> * optimize get_kv_cache_torch_dtype (vllm-project#18531) Signed-off-by: idellzheng <idellzheng@tencent.com> * [Core] Automatically cast multi-modal input dtype (vllm-project#18756) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Bugfix] Mistral tool calling when content is list (vllm-project#18729) Signed-off-by: mgoin <mgoin64@gmail.com> --------- Signed-off-by: Satyajith Chilappagari <satchill@amazon.com> Signed-off-by: Lucia Fang <fanglu@fb.com> Signed-off-by: Liangfu Chen <liangfc@amazon.com> Signed-off-by: Isotr0py <2037008807@qq.com> Signed-off-by: Nan2018 <nan@protopia.ai> Signed-off-by: rand-fly <randfly@outlook.com> Signed-off-by: reidliu41 <reid201711@gmail.com> Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com> Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: calvin chen <120380290@qq.com> Signed-off-by: haochengxia <xhc_1007@163.com> Signed-off-by: Dilip Gowda Bhagavan <dilip.bhagavan@ibm.com> Signed-off-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Signed-off-by: Bill Nell <bnell@redhat.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: wwl2755 <wangwenlong2755@gmail.com> Signed-off-by: nicklucche <nlucches@redhat.com> Signed-off-by: Kebe <mail@kebe7jun.com> Signed-off-by: Yong Hoon Shin <yhshin@meta.com> Signed-off-by: rabi <ramishra@redhat.com> Signed-off-by: dhia.rhaiem <dhia.rhaiem@tii.ae> Signed-off-by: giantcroc <1204449533@qq.com> Signed-off-by: Hosang Yoon <hosang.yoon@amd.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> Signed-off-by: Sebastian Schönnenbeck <sebastian.schoennenbeck@comma-soft.com> Signed-off-by: Andy Xie <andy.xning@gmail.com> Signed-off-by: Russell Bryant <rbryant@redhat.com> Signed-off-by: jaycha <jaycha@ncsoft.com> Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: Shane A <shanea@allenai.org> Signed-off-by: Elaine Zhao <elaineyz@amazon.com> Signed-off-by: Linkun <github@lkchen.net> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: googs1025 <googs1025@gmail.com> Signed-off-by: Bowen Wang <abmfy@icloud.com> Signed-off-by: jiang.li <jiang1.li@intel.com> Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com> Signed-off-by: David Xia <david@davidxia.com> Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Mengqing Cao <cmq0113@163.com> Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: Tyler Michael Smith <tysmith@redhat.com> Signed-off-by: Kai Wu <kaiwu@meta.com> Signed-off-by: Sanger Steel <sangersteel@gmail.com> Signed-off-by: Randall Smith <Randall.Smith@amd.com> Signed-off-by: Chenheli Hua <huachenheli@outlook.com> Signed-off-by: Linkun Chen <github@lkchen.net> Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai> Signed-off-by: Teruaki Ishizaki <teruaki.ishizaki@ntt.com> Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: Ronald Xu <ronaldxu@amazon.com> Signed-off-by: cascade812 <cascade812@outlook.com> Signed-off-by: Yuqi Zhang <yuqizhang@google.com> Signed-off-by: Madeesh Kannan <shadeMe@users.noreply.github.com> Signed-off-by: Kay Yan <kay.yan@daocloud.io> Signed-off-by: Zerohertz <ohg3417@gmail.com> Signed-off-by: Tristan Leclercq <tristanleclercq@gmail.com> Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: Rui Qiao <ruisearch42@gmail.com> Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn> Signed-off-by: Huy Do <huydhn@gmail.com> Signed-off-by: Pavani Majety <pmajety@nvidia.com> Signed-off-by: Crucifixion-Fxl <xmufxl@gmail.com> Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Signed-off-by: Mathieu Bordere <mathieu@letmetweakit.com> Signed-off-by: wenhuach21 <wenhua.cheng@intel.com> Signed-off-by: qizixi <qizixi@meta.com> Signed-off-by: zt2370 <ztang2370@gmail.com> Signed-off-by: Aaron Pham <contact@aarnphm.xyz> Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: noemotiovon <757486878@qq.com> Signed-off-by: zzzyq <zhangyuqi94@gmail.com> Signed-off-by: zhaohaidao <zhaohaidao2008@hotmail.com> Signed-off-by: zhaohaiyuan <zhaohaiyuan@xiaohongshu.com> Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Signed-off-by: Max de Bayser <maxdebayser@gmail.com> Signed-off-by: Nave Assaf <nassaf@nvidia.com> Signed-off-by: Lukasz Durejko <ldurejko@habana.ai> Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: Islam Almersawi <islam.almersawi@openinnovation.ai> Signed-off-by: baoloongmao <baoloongmao@tencent.com> Signed-off-by: huangyuxiang03 <huangyx0321@gmail.com> Signed-off-by: idellzheng <idellzheng@tencent.com> Co-authored-by: sunyicode0012 <116338547+sunyicode0012@users.noreply.github.com> Co-authored-by: Gong Shufan <2624542821@qq.com> Co-authored-by: Satyajith Chilappagari <satchill@amazon.com> Co-authored-by: Lucia Fang <116399278+luccafong@users.noreply.github.com> Co-authored-by: Lucia (Lu) Fang <fanglu@meta.com> Co-authored-by: Liangfu Chen <liangfc@amazon.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Nan Qin <nan@protopia.ai> Co-authored-by: Andrew Sansom <andrew@protopia.ai> Co-authored-by: Kevin H. Luu <kevin@anyscale.com> Co-authored-by: Random Fly <renfei8@live.cn> Co-authored-by: Reid <61492567+reidliu41@users.noreply.github.com> Co-authored-by: reidliu41 <reid201711@gmail.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com> Co-authored-by: wang.yuqi <noooop@126.com> Co-authored-by: 燃 <wulipc@163.com> Co-authored-by: 松灵 <wpf272043@alibaba-inc.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Calvin Chen <45745657+calvin0327@users.noreply.github.com> Co-authored-by: Percy <xhc_1007@163.com> Co-authored-by: Dilip Gowda Bhagavan <110233170+dilipgb@users.noreply.github.com> Co-authored-by: bnellnm <49004751+bnellnm@users.noreply.github.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Co-authored-by: wwl2755 <wangwenlong2755@gmail.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> Co-authored-by: Kebe <mail@kebe7jun.com> Co-authored-by: Yong Hoon Shin <48474650+sarckk@users.noreply.github.com> Co-authored-by: Rabi Mishra <ramishra@redhat.com> Co-authored-by: Dhia Eddine Rhaiem <163106757+dhiaEddineRhaiem@users.noreply.github.com> Co-authored-by: younesbelkada <younesbelkada@gmail.com> Co-authored-by: Ilyas Chahed <ilyas.chahed@tii.ae> Co-authored-by: Jingwei Zuo <jingwei.zuo@tii.ae> Co-authored-by: GiantCroc <1204449533@qq.com> Co-authored-by: Hyogeun Oh (오효근) <ohg3417@gmail.com> Co-authored-by: Hosang <156028780+hyoon1@users.noreply.github.com> Co-authored-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Co-authored-by: Sebastian Schoennenbeck <sebastian.schoennenbeck@comma-soft.com> Co-authored-by: Ning Xie <andy.xning@gmail.com> Co-authored-by: Russell Bryant <rbryant@redhat.com> Co-authored-by: youngrok cha <line0930@gmail.com> Co-authored-by: Nick Hill <nhill@redhat.com> Co-authored-by: kourosh hakhamaneshi <kourosh@anyscale.com> Co-authored-by: Shane A <shanea@allenai.org> Co-authored-by: aws-elaineyz <elaineyz@amazon.com> Co-authored-by: Shashwat Srijan <sssrijan@amazon.com> Co-authored-by: Aakash Shetty <sheaak@amazon.com> Co-authored-by: Tailin Pan <tailinpa@amazon.com> Co-authored-by: Rishabh Rajesh <rishyraj@amazon.com> Co-authored-by: Yishan McNabb <yishanm@amazon.com> Co-authored-by: Patrick Lange <patlange@amazon.com> Co-authored-by: Maxwell Goldberg <mgld@amazon.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: lkchen <github@lkchen.net> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: CYJiang <86391540+googs1025@users.noreply.github.com> Co-authored-by: Bowen Wang <abmfy@icloud.com> Co-authored-by: Li, Jiang <jiang1.li@intel.com> Co-authored-by: Lukas Geiger <lukas.geiger94@gmail.com> Co-authored-by: David Xia <david@davidxia.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: youkaichao <youkaichao@gmail.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Co-authored-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com> Co-authored-by: Kai Wu <kaiwu@meta.com> Co-authored-by: Sanger Steel <sangersteel@gmail.com> Co-authored-by: rasmith <Randall.Smith@amd.com> Co-authored-by: Chenheli Hua <huachenheli@outlook.com> Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: Teruaki Ishizaki <tell.ishi@gmail.com> Co-authored-by: Shanshan Shen <467638484@qq.com> Co-authored-by: RonaldBXu <72748153+RonaldBXu@users.noreply.github.com> Co-authored-by: cascade <cascade812@outlook.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com> Co-authored-by: simon-mo <xmo@berkeley.edu> Co-authored-by: Yuqi Zhang <zhangyuqi94@gmail.com> Co-authored-by: Yuqi Zhang <yuqizhang@google.com> Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com> Co-authored-by: Kay Yan <kay.yan@daocloud.io> Co-authored-by: Tristan Leclercq <49700633+tristanleclercq@users.noreply.github.com> Co-authored-by: Simon Mo <simon.mo@hey.com> Co-authored-by: Chen Zhang <zhangch99@outlook.com> Co-authored-by: Jiayi Yao <82156730+YaoJiayi@users.noreply.github.com> Co-authored-by: Rui Qiao <ruisearch42@gmail.com> Co-authored-by: Huy Do <huydhn@gmail.com> Co-authored-by: Pavani Majety <pmajety@nvidia.com> Co-authored-by: Feng XiaoLong <79261065+Crucifixion-Fxl@users.noreply.github.com> Co-authored-by: Crucifixion-Fxl <xmufxl@gmail.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: Mathieu Borderé <mathieu@bordere.org> Co-authored-by: Wenhua Cheng <wenhua.cheng@intel.com> Co-authored-by: qizixi <22851944+zixi-qi@users.noreply.github.com> Co-authored-by: Yuanhao WU <Nalkey@users.noreply.github.com> Co-authored-by: ztang2370 <ztang2370@gmail.com> Co-authored-by: Aaron Pham <contact@aarnphm.xyz> Co-authored-by: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com> Co-authored-by: Chenguang Li <757486878@qq.com> Co-authored-by: Isotr0py <2037008807@qq.com> Co-authored-by: AlexZhao <zhaohaidao2008@hotmail.com> Co-authored-by: zhaohaiyuan <zhaohaiyuan@xiaohongshu.com> Co-authored-by: Maximilien de Bayser <mbayser@br.ibm.com> Co-authored-by: Naveassaf <55059536+Naveassaf@users.noreply.github.com> Co-authored-by: Łukasz Durejko <lukasz.durejko@intel.com> Co-authored-by: dylan <xuhao296@qq.com> Co-authored-by: almersawi <43927639+almersawi@users.noreply.github.com> Co-authored-by: Islam Almersawi <islam.almersawi@openinnovation.ai> Co-authored-by: Łukasz Durejko <ldurejko@habana.ai> Co-authored-by: maobaolong <baoloongmao@tencent.com> Co-authored-by: Shawn Huang <57223022+huangyuxiang03@users.noreply.github.com> Co-authored-by: huangyuxiang03 <huangyx0321@gmail.com> Co-authored-by: chunxiaozheng <55471457+chunxiaozheng@users.noreply.github.com>
@YaoJiayi Hi, Could you please tell me how to use this feature? Is this feature compatible with pipeline parallelism? Can it be used simply by configuring vllm serve deepseek-ai/DeepSeek-R1 \
--max-num-seqs=80 \
--max-model-len=8192 \
--max-num-batched-tokens=16384 \
--tensor-parallel-size 8 \
--pipeline-parallel-size 2 \
--enable-expert-parallel \
--enable-chunked-prefill \
--enable-prefix-caching \
--disable-log-requests \
--distributed-executor-backend ray \
--swap-space=64 \
--enable-reasoning \
--reasoning-parser deepseek_r1 \
--trust-remote-code \
--served-model-name deepseek-r1 \
--speculative-config='{"method": "deepseek_mtp", "num_speculative_tokens": 1}' Thanks~ |
PP should be supported. And yes, there's no need to import mtp module separately. The deepseek model weights contain the mtp layer itself |
But I got error 025-05-29T12:13:57.588522996+08:00 ^[[36m(RayWorkerWrapper pid=1024)^[[0m ERROR 05-29 12:13:57 [worker_base.py:620] Error executing method 'determine_available_memory'. This might cause deadlock
2025-05-29T12:13:57.588527790+08:00 ^[[36m(RayWorkerWrapper pid=1024)^[[0m ERROR 05-29 12:13:57 [worker_base.py:620] Traceback (most recent call last):
2025-05-29T12:13:57.588532997+08:00 ^[[36m(RayWorkerWrapper pid=1024)^[[0m ERROR 05-29 12:13:57 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", li
2025-05-29T12:13:57.588537808+08:00 ^[[36m(RayWorkerWrapper pid=1024)^[[0m ERROR 05-29 12:13:57 [worker_base.py:620] return run_method(self, method, args, kwargs)
2025-05-29T12:13:57.588549892+08:00 ^[[36m(RayWorkerWrapper pid=1024)^[[0m ERROR 05-29 12:13:57 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.588554744+08:00 ^[[36m(RayWorkerWrapper pid=1024)^[[0m ERROR 05-29 12:13:57 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2605, in r
2025-05-29T12:13:57.588559691+08:00 ^[[36m(RayWorkerWrapper pid=1024)^[[0m ERROR 05-29 12:13:57 [worker_base.py:620] return func(*args, **kwargs)
2025-05-29T12:13:57.588564615+08:00 ^[[36m(RayWorkerWrapper pid=1024)^[[0m ERROR 05-29 12:13:57 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.588569423+08:00 ^[[36m(RayWorkerWrapper pid=1024)^[[0m ERROR 05-29 12:13:57 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", li
2025-05-29T12:13:57.588574175+08:00 ^[[36m(RayWorkerWrapper pid=1024)^[[0m ERROR 05-29 12:13:57 [worker_base.py:620] return func(*args, **kwargs)
2025-05-29T12:13:57.588579216+08:00 ^[[36m(RayWorkerWrapper pid=1024)^[[0m ERROR 05-29 12:13:57 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.588584871+08:00 ^[[36m(RayWorkerWrapper pid=1024)^[[0m ERROR 05-29 12:13:57 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py",
2025-05-29T12:13:57.588589687+08:00 ^[[36m(RayWorkerWrapper pid=1024)^[[0m ERROR 05-29 12:13:57 [worker_base.py:620] self.model_runner.profile_run()
2025-05-29T12:13:57.588605486+08:00 ^[[36m(RayWorkerWrapper pid=1024)^[[0m ERROR 05-29 12:13:57 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner
2025-05-29T12:13:57.588610630+08:00 ^[[36m(RayWorkerWrapper pid=1024)^[[0m ERROR 05-29 12:13:57 [worker_base.py:620] hidden_states = self._dummy_run(self.max_num_tokens)
2025-05-29T12:13:57.588616214+08:00 ^[[36m(RayWorkerWrapper pid=1024)^[[0m ERROR 05-29 12:13:57 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.591285563+08:00 ^[[36m(RayWorkerWrapper pid=1024)^[[0m ERROR 05-29 12:13:57 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", li
2025-05-29T12:13:57.591293605+08:00 ^[[36m(RayWorkerWrapper pid=1024)^[[0m ERROR 05-29 12:13:57 [worker_base.py:620] return func(*args, **kwargs)
2025-05-29T12:13:57.591299061+08:00 ^[[36m(RayWorkerWrapper pid=1024)^[[0m ERROR 05-29 12:13:57 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.591304189+08:00 ^[[36m(RayWorkerWrapper pid=1024)^[[0m ERROR 05-29 12:13:57 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner
2025-05-29T12:13:57.591309957+08:00 ^[[36m(RayWorkerWrapper pid=1024)^[[0m ERROR 05-29 12:13:57 [worker_base.py:620] assert isinstance(self.drafter, EagleProposer)
2025-05-29T12:13:57.591315505+08:00 ^[[36m(RayWorkerWrapper pid=1024)^[[0m ERROR 05-29 12:13:57 [worker_base.py:620] ^^^^^^^^^^^^
2025-05-29T12:13:57.591321524+08:00 ^[[36m(RayWorkerWrapper pid=1024)^[[0m ERROR 05-29 12:13:57 [worker_base.py:620] AttributeError: 'GPUModelRunner' object has no attribute 'drafter'
2025-05-29T12:13:57.591327328+08:00 ERROR 05-29 12:13:57 [core.py:500] EngineCore failed to start.
2025-05-29T12:13:57.591334275+08:00 ERROR 05-29 12:13:57 [core.py:500] Traceback (most recent call last):
2025-05-29T12:13:57.591339733+08:00 ERROR 05-29 12:13:57 [core.py:500] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 491, in run_engine_core
2025-05-29T12:13:57.591344574+08:00 ERROR 05-29 12:13:57 [core.py:500] engine_core = EngineCoreProc(*args, **kwargs)
2025-05-29T12:13:57.591355508+08:00 ERROR 05-29 12:13:57 [core.py:500] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.591360667+08:00 ERROR 05-29 12:13:57 [core.py:500] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 390, in __init__
2025-05-29T12:13:57.591365460+08:00 ERROR 05-29 12:13:57 [core.py:500] super().__init__(vllm_config, executor_class, log_stats,
2025-05-29T12:13:57.591370143+08:00 ERROR 05-29 12:13:57 [core.py:500] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 78, in __init__
2025-05-29T12:13:57.591375203+08:00 ERROR 05-29 12:13:57 [core.py:500] self._initialize_kv_caches(vllm_config)
2025-05-29T12:13:57.591379975+08:00 ERROR 05-29 12:13:57 [core.py:500] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 137, in _initialize_kv_caches
2025-05-29T12:13:57.591384939+08:00 ERROR 05-29 12:13:57 [core.py:500] available_gpu_memory = self.model_executor.determine_available_memory()
2025-05-29T12:13:57.591389521+08:00 ERROR 05-29 12:13:57 [core.py:500] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.591394666+08:00 ERROR 05-29 12:13:57 [core.py:500] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 75, in determine_available_memory
2025-05-29T12:13:57.591399473+08:00 ERROR 05-29 12:13:57 [core.py:500] output = self.collective_rpc("determine_available_memory")
2025-05-29T12:13:57.591404495+08:00 ERROR 05-29 12:13:57 [core.py:500] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.591409200+08:00 ERROR 05-29 12:13:57 [core.py:500] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 331, in collective_rpc
2025-05-29T12:13:57.591414538+08:00 ERROR 05-29 12:13:57 [core.py:500] return self._run_workers(method, *args, **(kwargs or {}))
2025-05-29T12:13:57.591420322+08:00 ERROR 05-29 12:13:57 [core.py:500] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.591424970+08:00 ERROR 05-29 12:13:57 [core.py:500] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/ray_distributed_executor.py", line 521, in _run_workers
2025-05-29T12:13:57.591429786+08:00 ERROR 05-29 12:13:57 [core.py:500] ray_worker_outputs = ray.get(ray_worker_outputs)
2025-05-29T12:13:57.591434222+08:00 ERROR 05-29 12:13:57 [core.py:500] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.591439178+08:00 ERROR 05-29 12:13:57 [core.py:500] File "/usr/local/lib/python3.12/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
2025-05-29T12:13:57.591444878+08:00 ERROR 05-29 12:13:57 [core.py:500] return fn(*args, **kwargs)
2025-05-29T12:13:57.591449410+08:00 ERROR 05-29 12:13:57 [core.py:500] ^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.591454210+08:00 ERROR 05-29 12:13:57 [core.py:500] File "/usr/local/lib/python3.12/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
2025-05-29T12:13:57.591459385+08:00 ERROR 05-29 12:13:57 [core.py:500] return func(*args, **kwargs)
2025-05-29T12:13:57.591464533+08:00 ERROR 05-29 12:13:57 [core.py:500] ^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.591469293+08:00 ERROR 05-29 12:13:57 [core.py:500] File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 2822, in get
2025-05-29T12:13:57.591474195+08:00 ERROR 05-29 12:13:57 [core.py:500] values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
2025-05-29T12:13:57.591488500+08:00 ERROR 05-29 12:13:57 [core.py:500] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.591493791+08:00 ERROR 05-29 12:13:57 [core.py:500] File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 930, in get_objects
2025-05-29T12:13:57.591504489+08:00 ERROR 05-29 12:13:57 [core.py:500] raise value.as_instanceof_cause()
2025-05-29T12:13:57.591509720+08:00 ERROR 05-29 12:13:57 [core.py:500] ray.exceptions.RayTaskError(AttributeError): ^[[36mray::RayWorkerWrapper.execute_method()^[[39m (pid=1024, ip=172.20.91.178,
2025-05-29T12:13:57.591514680+08:00 ERROR 05-29 12:13:57 [core.py:500] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.591519756+08:00 ERROR 05-29 12:13:57 [core.py:500] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.591524496+08:00 ERROR 05-29 12:13:57 [core.py:500] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 621, in execute_method
2025-05-29T12:13:57.591530368+08:00 ERROR 05-29 12:13:57 [core.py:500] raise e
2025-05-29T12:13:57.591535183+08:00 ERROR 05-29 12:13:57 [core.py:500] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 612, in execute_method
2025-05-29T12:13:57.591539943+08:00 ERROR 05-29 12:13:57 [core.py:500] return run_method(self, method, args, kwargs)
2025-05-29T12:13:57.591544724+08:00 ERROR 05-29 12:13:57 [core.py:500] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.591549499+08:00 ERROR 05-29 12:13:57 [core.py:500] File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2605, in run_method
2025-05-29T12:13:57.591554189+08:00 ERROR 05-29 12:13:57 [core.py:500] return func(*args, **kwargs)
2025-05-29T12:13:57.591559063+08:00 ERROR 05-29 12:13:57 [core.py:500] ^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.591563847+08:00 ERROR 05-29 12:13:57 [core.py:500] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
2025-05-29T12:13:57.591568729+08:00 ERROR 05-29 12:13:57 [core.py:500] return func(*args, **kwargs)
2025-05-29T12:13:57.591573414+08:00 ERROR 05-29 12:13:57 [core.py:500] ^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.591578069+08:00 ERROR 05-29 12:13:57 [core.py:500] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 185, in determine_available_memory
2025-05-29T12:13:57.591583058+08:00 ERROR 05-29 12:13:57 [core.py:500] self.model_runner.profile_run()
2025-05-29T12:13:57.591587810+08:00 ERROR 05-29 12:13:57 [core.py:500] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1897, in profile_run
2025-05-29T12:13:57.591592674+08:00 ERROR 05-29 12:13:57 [core.py:500] hidden_states = self._dummy_run(self.max_num_tokens)
2025-05-29T12:13:57.591597298+08:00 ERROR 05-29 12:13:57 [core.py:500] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.591602346+08:00 ERROR 05-29 12:13:57 [core.py:500] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
2025-05-29T12:13:57.591607406+08:00 ERROR 05-29 12:13:57 [core.py:500] return func(*args, **kwargs)
2025-05-29T12:13:57.591612166+08:00 ERROR 05-29 12:13:57 [core.py:500] ^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.591616595+08:00 ERROR 05-29 12:13:57 [core.py:500] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1744, in _dummy_run
2025-05-29T12:13:57.591622957+08:00 ERROR 05-29 12:13:57 [core.py:500] assert isinstance(self.drafter, EagleProposer)
2025-05-29T12:13:57.591627701+08:00 ERROR 05-29 12:13:57 [core.py:500] ^^^^^^^^^^^^
2025-05-29T12:13:57.591632667+08:00 ERROR 05-29 12:13:57 [core.py:500] AttributeError: 'GPUModelRunner' object has no attribute 'drafter'
2025-05-29T12:13:57.591975948+08:00 Process EngineCore_0:
2025-05-29T12:13:57.593798795+08:00 Traceback (most recent call last):
2025-05-29T12:13:57.594594089+08:00 File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
2025-05-29T12:13:57.594601413+08:00 self.run()
2025-05-29T12:13:57.594607309+08:00 File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
2025-05-29T12:13:57.594612477+08:00 self._target(*self._args, **self._kwargs)
2025-05-29T12:13:57.594684775+08:00 File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 504, in run_engine_core
2025-05-29T12:13:57.594715490+08:00 raise e
2025-05-29T12:13:57.595203986+08:00 File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 491, in run_engine_core
2025-05-29T12:13:57.595215486+08:00 engine_core = EngineCoreProc(*args, **kwargs)
2025-05-29T12:13:57.595220354+08:00 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.595226738+08:00 File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 390, in __init__
2025-05-29T12:13:57.595231680+08:00 super().__init__(vllm_config, executor_class, log_stats,
2025-05-29T12:13:57.595246137+08:00 File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 78, in __init__
2025-05-29T12:13:57.595251314+08:00 self._initialize_kv_caches(vllm_config)
2025-05-29T12:13:57.595256293+08:00 File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 137, in _initialize_kv_caches
2025-05-29T12:13:57.595262209+08:00 available_gpu_memory = self.model_executor.determine_available_memory()
2025-05-29T12:13:57.595267479+08:00 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.595279188+08:00 File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 75, in determine_available_memory
2025-05-29T12:13:57.595284617+08:00 output = self.collective_rpc("determine_available_memory")
2025-05-29T12:13:57.595290016+08:00 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.595300546+08:00 File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 331, in collective_rpc
2025-05-29T12:13:57.595305444+08:00 return self._run_workers(method, *args, **(kwargs or {}))
2025-05-29T12:13:57.595311025+08:00 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.595315847+08:00 File "/usr/local/lib/python3.12/dist-packages/vllm/executor/ray_distributed_executor.py", line 521, in _run_workers
2025-05-29T12:13:57.595320512+08:00 ray_worker_outputs = ray.get(ray_worker_outputs)
2025-05-29T12:13:57.595325247+08:00 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.595335067+08:00 File "/usr/local/lib/python3.12/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
2025-05-29T12:13:57.595341027+08:00 return fn(*args, **kwargs)
2025-05-29T12:13:57.595346425+08:00 ^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.595351406+08:00 File "/usr/local/lib/python3.12/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
2025-05-29T12:13:57.595357326+08:00 return func(*args, **kwargs)
2025-05-29T12:13:57.595362430+08:00 ^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.595374546+08:00 File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 2822, in get
2025-05-29T12:13:57.595379950+08:00 values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
2025-05-29T12:13:57.595385016+08:00 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.595398036+08:00 File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 930, in get_objects
2025-05-29T12:13:57.595402849+08:00 raise value.as_instanceof_cause()
2025-05-29T12:13:57.595415493+08:00 ray.exceptions.RayTaskError(AttributeError): ^[[36mray::RayWorkerWrapper.execute_method()^[[39m (pid=1024, ip=172.20.91.178, actor_id=527d362287d5fc8d5ed1643a01
2025-05-29T12:13:57.595420509+08:00 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.595424949+08:00 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.595429653+08:00 File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 621, in execute_method
2025-05-29T12:13:57.595435028+08:00 raise e
2025-05-29T12:13:57.595439668+08:00 File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 612, in execute_method
2025-05-29T12:13:57.595444635+08:00 return run_method(self, method, args, kwargs)
2025-05-29T12:13:57.595449329+08:00 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.595453954+08:00 File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2605, in run_method
2025-05-29T12:13:57.595458834+08:00 return func(*args, **kwargs)
2025-05-29T12:13:57.595463488+08:00 ^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.595468384+08:00 File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
2025-05-29T12:13:57.595473007+08:00 return func(*args, **kwargs)
2025-05-29T12:13:57.595477574+08:00 ^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.595482348+08:00 File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 185, in determine_available_memory
2025-05-29T12:13:57.595487088+08:00 self.model_runner.profile_run()
2025-05-29T12:13:57.595492032+08:00 File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1897, in profile_run
2025-05-29T12:13:57.595496803+08:00 hidden_states = self._dummy_run(self.max_num_tokens)
2025-05-29T12:13:57.595501407+08:00 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.595506781+08:00 File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
2025-05-29T12:13:57.595511962+08:00 return func(*args, **kwargs)
2025-05-29T12:13:57.595516572+08:00 ^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.595521711+08:00 File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1744, in _dummy_run
2025-05-29T12:13:57.595526426+08:00 assert isinstance(self.drafter, EagleProposer)
2025-05-29T12:13:57.595531067+08:00 ^^^^^^^^^^^^
2025-05-29T12:13:57.595536347+08:00 AttributeError: 'GPUModelRunner' object has no attribute 'drafter'
2025-05-29T12:13:57.596537258+08:00 2025-05-29 12:13:57,596 ERROR worker.py:421 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ^[[36mray::RayWorkerWrapper.execute_method()
2025-05-29T12:13:57.596548114+08:00 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.596553193+08:00 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.596559937+08:00 File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 621, in execute_method
2025-05-29T12:13:57.596565809+08:00 raise e
2025-05-29T12:13:57.596570777+08:00 File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 612, in execute_method
2025-05-29T12:13:57.596575647+08:00 return run_method(self, method, args, kwargs)
2025-05-29T12:13:57.596604554+08:00 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.596610041+08:00 File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2605, in run_method
2025-05-29T12:13:57.596615000+08:00 return func(*args, **kwargs)
2025-05-29T12:13:57.596620424+08:00 ^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.596625604+08:00 File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
2025-05-29T12:13:57.596630466+08:00 return func(*args, **kwargs)
2025-05-29T12:13:57.596635228+08:00 ^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.596639975+08:00 File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 185, in determine_available_memory
2025-05-29T12:13:57.596644987+08:00 self.model_runner.profile_run()
2025-05-29T12:13:57.596650140+08:00 File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1897, in profile_run
2025-05-29T12:13:57.596655183+08:00 hidden_states = self._dummy_run(self.max_num_tokens)
2025-05-29T12:13:57.596659993+08:00 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.596665295+08:00 File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
2025-05-29T12:13:57.596669814+08:00 return func(*args, **kwargs)
2025-05-29T12:13:57.596674754+08:00 ^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.596679779+08:00 File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1744, in _dummy_run
2025-05-29T12:13:57.596684525+08:00 assert isinstance(self.drafter, EagleProposer)
2025-05-29T12:13:57.596689314+08:00 ^^^^^^^^^^^^
2025-05-29T12:13:57.596694670+08:00 AttributeError: 'GPUModelRunner' object has no attribute 'drafter'
2025-05-29T12:13:57.597112916+08:00 2025-05-29 12:13:57,596 ERROR worker.py:421 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ^[[36mray::RayWorkerWrapper.execute_method()
2025-05-29T12:13:57.597125387+08:00 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.597131236+08:00 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.597136868+08:00 File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 621, in execute_method
2025-05-29T12:13:57.597142612+08:00 raise e
2025-05-29T12:13:57.597147699+08:00 File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 612, in execute_method
2025-05-29T12:13:57.597152731+08:00 return run_method(self, method, args, kwargs)
2025-05-29T12:13:57.597157853+08:00 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.597163551+08:00 File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2605, in run_method
2025-05-29T12:13:57.597170215+08:00 return func(*args, **kwargs)
2025-05-29T12:13:57.597175002+08:00 ^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.597180322+08:00 File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
2025-05-29T12:13:57.597184891+08:00 return func(*args, **kwargs)
2025-05-29T12:13:57.597189854+08:00 ^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.597194906+08:00 File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 185, in determine_available_memory
2025-05-29T12:13:57.597199526+08:00 self.model_runner.profile_run()
2025-05-29T12:13:57.597204506+08:00 File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1897, in profile_run
2025-05-29T12:13:57.597219241+08:00 hidden_states = self._dummy_run(self.max_num_tokens)
2025-05-29T12:13:57.597223991+08:00 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.597230911+08:00 File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
2025-05-29T12:13:57.597236049+08:00 return func(*args, **kwargs)
2025-05-29T12:13:57.597241193+08:00 ^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.597245984+08:00 File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1744, in _dummy_run
2025-05-29T12:13:57.597250649+08:00 assert isinstance(self.drafter, EagleProposer)
2025-05-29T12:13:57.597255616+08:00 ^^^^^^^^^^^^
2025-05-29T12:13:57.597260689+08:00 AttributeError: 'GPUModelRunner' object has no attribute 'drafter'
2025-05-29T12:13:57.597602408+08:00 2025-05-29 12:13:57,597 ERROR worker.py:421 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ^[[36mray::RayWorkerWrapper.execute_method()
2025-05-29T12:13:57.597610664+08:00 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.597616112+08:00 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.597621375+08:00 File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 621, in execute_method
2025-05-29T12:13:57.597626682+08:00 raise e
2025-05-29T12:13:57.597631775+08:00 File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 612, in execute_method
2025-05-29T12:13:57.597636681+08:00 return run_method(self, method, args, kwargs)
2025-05-29T12:13:57.597641383+08:00 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.597646447+08:00 File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2605, in run_method
2025-05-29T12:13:57.597651414+08:00 return func(*args, **kwargs)
2025-05-29T12:13:57.597656214+08:00 ^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.597660974+08:00 File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
2025-05-29T12:13:57.597665990+08:00 return func(*args, **kwargs)
2025-05-29T12:13:57.597670994+08:00 ^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.597675975+08:00 File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 185, in determine_available_memory
2025-05-29T12:13:57.597680761+08:00 self.model_runner.profile_run()
2025-05-29T12:13:57.597685515+08:00 File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1897, in profile_run
2025-05-29T12:13:57.597690408+08:00 hidden_states = self._dummy_run(self.max_num_tokens)
2025-05-29T12:13:57.597695248+08:00 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.597699926+08:00 File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
2025-05-29T12:13:57.597704614+08:00 return func(*args, **kwargs)
2025-05-29T12:13:57.597709527+08:00 ^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.597714629+08:00 File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1744, in _dummy_run
2025-05-29T12:13:57.597720845+08:00 assert isinstance(self.drafter, EagleProposer)
2025-05-29T12:13:57.597725699+08:00 ^^^^^^^^^^^^
2025-05-29T12:13:57.597730745+08:00 AttributeError: 'GPUModelRunner' object has no attribute 'drafter'
2025-05-29T12:13:57.598035854+08:00 2025-05-29 12:13:57,597 ERROR worker.py:421 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ^[[36mray::RayWorkerWrapper.execute_method()
2025-05-29T12:13:57.598047441+08:00 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.598053119+08:00 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.598058813+08:00 File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 621, in execute_method
2025-05-29T12:13:57.598064353+08:00 raise e
2025-05-29T12:13:57.598069644+08:00 File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 612, in execute_method
2025-05-29T12:13:57.598074720+08:00 return run_method(self, method, args, kwargs)
2025-05-29T12:13:57.598079473+08:00 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.598084609+08:00 File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2605, in run_method
2025-05-29T12:13:57.598090270+08:00 return func(*args, **kwargs)
2025-05-29T12:13:57.598095248+08:00 ^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.598100132+08:00 File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
2025-05-29T12:13:57.598105212+08:00 return func(*args, **kwargs)
2025-05-29T12:13:57.598109603+08:00 ^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.598114310+08:00 File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 185, in determine_available_memory
2025-05-29T12:13:57.598119468+08:00 self.model_runner.profile_run()
2025-05-29T12:13:57.598124232+08:00 File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1897, in profile_run
2025-05-29T12:13:57.598129184+08:00 hidden_states = self._dummy_run(self.max_num_tokens)
2025-05-29T12:13:57.598134369+08:00 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.598139231+08:00 File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
2025-05-29T12:13:57.598144135+08:00 return func(*args, **kwargs)
2025-05-29T12:13:57.598148847+08:00 ^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.598153878+08:00 File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1744, in _dummy_run
2025-05-29T12:13:57.598158655+08:00 assert isinstance(self.drafter, EagleProposer)
2025-05-29T12:13:57.598163402+08:00 ^^^^^^^^^^^^
2025-05-29T12:13:57.598168458+08:00 AttributeError: 'GPUModelRunner' object has no attribute 'drafter'
2025-05-29T12:13:57.598421272+08:00 2025-05-29 12:13:57,598 ERROR worker.py:421 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ^[[36mray::RayWorkerWrapper.execute_method()
2025-05-29T12:13:57.598427722+08:00 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.598432472+08:00 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.598437268+08:00 File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 621, in execute_method
2025-05-29T12:13:57.598442580+08:00 raise e
2025-05-29T12:13:57.598447611+08:00 File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 612, in execute_method
2025-05-29T12:13:57.598452504+08:00 return run_method(self, method, args, kwargs)
2025-05-29T12:13:57.598457500+08:00 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.598468259+08:00 File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2605, in run_method
2025-05-29T12:13:57.598473043+08:00 return func(*args, **kwargs)
2025-05-29T12:13:57.598477731+08:00 ^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.598482775+08:00 File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
2025-05-29T12:13:57.598487285+08:00 return func(*args, **kwargs)
2025-05-29T12:13:57.598491801+08:00 ^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.598496710+08:00 File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 185, in determine_available_memory
2025-05-29T12:13:57.598501602+08:00 self.model_runner.profile_run()
2025-05-29T12:13:57.598506295+08:00 File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1897, in profile_run
2025-05-29T12:13:57.598510814+08:00 hidden_states = self._dummy_run(self.max_num_tokens)
2025-05-29T12:13:57.598515260+08:00 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.598519914+08:00 File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
2025-05-29T12:13:57.598524422+08:00 return func(*args, **kwargs)
2025-05-29T12:13:57.598529529+08:00 ^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.598534489+08:00 File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1744, in _dummy_run
2025-05-29T12:13:57.598539505+08:00 assert isinstance(self.drafter, EagleProposer)
2025-05-29T12:13:57.598544599+08:00 ^^^^^^^^^^^^
2025-05-29T12:13:57.598549333+08:00 AttributeError: 'GPUModelRunner' object has no attribute 'drafter'
2025-05-29T12:13:57.598954455+08:00 2025-05-29 12:13:57,598 ERROR worker.py:421 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ^[[36mray::RayWorkerWrapper.execute_method()
2025-05-29T12:13:57.598962051+08:00 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.598967084+08:00 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.598971847+08:00 File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 621, in execute_method
2025-05-29T12:13:57.598977192+08:00 raise e
2025-05-29T12:13:57.598982175+08:00 File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 612, in execute_method
2025-05-29T12:13:57.598987147+08:00 return run_method(self, method, args, kwargs)
2025-05-29T12:13:57.598992302+08:00 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.598997809+08:00 File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2605, in run_method
2025-05-29T12:13:57.599003786+08:00 return func(*args, **kwargs)
2025-05-29T12:13:57.599008862+08:00 ^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.599013518+08:00 File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
2025-05-29T12:13:57.599018102+08:00 return func(*args, **kwargs)
2025-05-29T12:13:57.599022837+08:00 ^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.599029281+08:00 File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 185, in determine_available_memory
2025-05-29T12:13:57.599055767+08:00 self.model_runner.profile_run()
2025-05-29T12:13:57.599060481+08:00 File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1897, in profile_run
2025-05-29T12:13:57.599065427+08:00 hidden_states = self._dummy_run(self.max_num_tokens)
2025-05-29T12:13:57.599078461+08:00 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-29T12:13:57.599083201+08:00 File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
2025-05-29T12:13:57.599088076+08:00 return func(*args, **kwargs)
|
Is the configuration for MTP spec is |
@YaoJiayi Hi, I have a problem. My running command is:
I got this error 'RuntimeError: Worker failed with error ''GPUModelRunner' object has no attribute 'attn_metadata_builder'', please check the stack trace above for the root cause' when I send request to vllm server. I run server on H20 x 8, and install vllm == 0.9.0.1. My request is like this: curl http://127.0.0.1:8000/v1/completions -H "Content-Type: application/json" -d '{ "model": "deepseek-r1", "prompt": "China is", "max_tokens": 30, "temperature": 0, "stream": true }' INFO 06-04 10:14:43 [async_llm.py:261] Added request cmpl-e8f26a6c69014834ab68bc10a77149f1-0. Thanks~ |
I met the same problem as @mahaocong90. Anybody knows why? |
Got the same error here, with deepseek mtp + v1 engine, with MLA attention backend |
|
I tried to change the line 144 in eagle.py to self.runner.attn_metadata_builders[0].build, but got another error, does it support MTP size > 1 for now, or only MTP size = 1? |
For now, it should be 1 because the number of MTP layer is 1. I haven't tested MTP size > 1 but it should be easy to support |
Thanks! I tried with MTP size = 1 and the above dumb fix it will work fine for now, but I guess a formal fix patch to align the syntax is needed. @rain7996 @mahaocong90 |
Yes, it works, Thanks ~ |
Tested on Deepseek R1 with (1) TP=8 and (2) TP=4 * PP=2.
TODOs: