Automatically configure KV cache size #6

WoosukKwon · 2023-03-03T10:05:40Z

This PR adds OPT memory analyzer to the system, and uses it to automatically determine the KV cache size.

Tested models:

OPT-125M
OPT-350M
OPT-1.3B
OPT-2.7B
OPT-6.7B
OPT-13B

Tested GPUs:

A100

* finish changing scheduler * finish merge * fix model * Fix (vllm-project#5) * fix problems * fix * delete unused params * remove redundant comments --------- Co-authored-by: Xiangyu Tian <109123695+xiangyuT@users.noreply.github.com>

…ect#6)

Add missing Python requirements

Co-authored-by: Mor Zusman <morz@ai21.com>

[CI/Build] Dockerfile.ubi : Remove test stage

FP8 on A100 for PHIMOE

…tokens [Bugfix] Include encoder_prompt_tokens in num_prompt_tokensin UsageInfo

* print numforward * print forward context * print forward context * print forward context * print forward context * synchronize after Ulysses in graph capture * synchronize after Ulysses in graph capture * synchronize NCCL operations (all-to-all) for graph capture

### What this PR does / why we need it? This PR is a refactoring of model runner, to decouple it from the classes specifically designed for GPU. The changes of model runner are generally showed below: ![iShot_2025-01-20_21 32 37](https://github.com/user-attachments/assets/e7e14e5f-5367-42cf-bc82-abff35cd73b9) **Other changes:** I have removed the code of `cuda`, `lora` and `prompt adapter`, because NPU doesn`t support them now. ### Does this PR introduce _any_ user-facing change? no. ### How was this patch tested? I have used `AI-ModelScope/gpt2` for testing `examples/offline_inference_npu.py`, and the results showed that it worked well. The test logs are showed below: ```bash INFO 02-05 09:08:46 __init__.py:30] Available plugins for group vllm.platform_plugins: INFO 02-05 09:08:46 __init__.py:32] name=ascend, value=vllm_ascend:register INFO 02-05 09:08:46 __init__.py:34] all available plugins for group vllm.platform_plugins will be loaded. INFO 02-05 09:08:46 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load. INFO 02-05 09:08:46 __init__.py:44] plugin ascend loaded. INFO 02-05 09:08:46 __init__.py:177] Platform plugin ascend is activated INFO 02-05 09:08:48 config.py:2383] Downcasting torch.float32 to torch.float16. INFO 02-05 09:08:59 config.py:542] This model supports multiple tasks: {'generate', 'score', 'embed', 'reward', 'classify'}. Defaulting to 'generate'. INFO 02-05 09:08:59 llm_engine.py:234] Initializing a V0 LLM engine (v0.1.dev1+gb3a0d01) with config: model='/home/sss/models/AI-ModelScope/gpt2', speculative_config=None, tokenizer='/home/sss/models/AI-ModelScope/gpt2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/sss/models/AI-ModelScope/gpt2, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, WARNING 02-05 09:09:01 _custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'") INFO 02-05 09:09:01 importing.py:16] Triton not installed or not compatible; certain GPU-related functions will not be available. Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 3.18it/s] Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 3.18it/s] INFO 02-05 09:09:11 executor_base.py:110] # CPU blocks: 98557, # CPU blocks: 7281 INFO 02-05 09:09:11 executor_base.py:115] Maximum concurrency for 1024 tokens per request: 1539.95x INFO 02-05 09:09:12 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 2.13 seconds Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:02<00:00, 1.53it/s, est. speed input: 8.41 toks/s, output: 152.97 toks/s] Prompt: 'Hello, my name is', Generated text: " John. I'm a writer, and I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm" Prompt: 'The president of the United States is', Generated text: ' States president. He is the president of the United States. He is the president of the United States. He is the president of the United States. He is the president of the United States. He is the president of the United States. He is the president of the United States. He is the president of the United States. He is the president of the United States. He is the president of the United States. He is the president of the United States. He is the president of the United' Prompt: 'The capital of France is', Generated text: ' the capital of the French Republic, and the capital of the French Republic is the capital of the French Republic.\n\nThe French Republic is the capital of the French Republic.\n\nThe French Republic is the capital of the French Republic.\n\nThe French Republic is the capital of the French Republic.\n\nThe French Republic is the capital of the French Republic.\n\nThe French Republic is the capital of the French Republic.\n\nThe French Republic is the capital of the French Republic.' Prompt: 'The future of AI is', Generated text: '\n\nThe future of AI is a question of how to make it work.\n\nThe future of AI is a question of how to make it work.\n\nThe future of AI is a question of how to make it work.\n\nThe future of AI is a question of how to make it work.\n\nThe future of AI is a question of how to make it work.\n\nThe future of AI is a question of how to make it work.\n\nThe future' ``` --------- Signed-off-by: Shanshan Shen <467638484@qq.com>

update torchwhl

Signed-off-by: remi <remi@mistral.ai>

commit 406d6bf Author: Rémi Delacourt <54138269+Flechman@users.noreply.github.com> Date: Fri Apr 11 00:47:40 2025 +0200 Add MLA support for v1 disagg connector (#6) Signed-off-by: remi <remi@mistral.ai> commit 1d8415d Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Thu Apr 10 21:59:54 2025 +0000 rename Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit 9c4159c Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Thu Apr 10 21:41:20 2025 +0000 updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit 54e1491 Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Thu Apr 10 20:31:35 2025 +0000 updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit 8e1eadc Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Thu Apr 10 20:26:37 2025 +0000 updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit 05349a5 Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 22:10:50 2025 +0000 updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit 7f57f3c Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 17:13:31 2025 +0000 update lifecycle Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit 7c31e29 Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 17:03:55 2025 +0000 nits Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit 74af233 Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 16:44:01 2025 +0000 done with nits Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit e64f745 Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 16:28:51 2025 +0000 updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit 40e5d81 Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 16:25:04 2025 +0000 refactor Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit 25c9592 Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 16:20:41 2025 +0000 updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit fc58dd5 Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 16:13:39 2025 +0000 updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit 20decdf Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 16:06:15 2025 +0000 updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit 5145566 Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 15:52:03 2025 +0000 updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit 62e1421 Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 15:47:40 2025 +0000 updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit 689379e Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 15:39:40 2025 +0000 updaed Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit b1310fd Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 15:36:37 2025 +0000 updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit 7b64acb Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 15:32:47 2025 +0000 clean up code Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit 7766ca5 Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 15:31:33 2025 +0000 updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit b0629bd Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 15:27:58 2025 +0000 updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit eca7a49 Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 15:25:24 2025 +0000 cleaning Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit 1881aa5 Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 15:24:55 2025 +0000 updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit 7833645 Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 15:11:18 2025 +0000 updared Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit e72e5e4 Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 15:08:34 2025 +0000 updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit 48c2eb2 Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 15:07:37 2025 +0000 updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit de1e487 Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 15:01:14 2025 +0000 fix nit Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit 0163070 Merge: e2ecc14 8b3f606 Author: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Date: Wed Apr 9 10:39:46 2025 -0400 Merge pull request #4 from robertgshaw2-redhat/rob-changes Rob changes commit 8b3f606 Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 14:29:26 2025 +0000 updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit 90e8c53 Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 14:29:17 2025 +0000 updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit da019df Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 14:23:47 2025 +0000 updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit 4ebcc3e Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 13:44:41 2025 +0000 updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit 00df670 Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 13:20:14 2025 +0000 updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit a73721a Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 13:19:17 2025 +0000 updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit 31d807e Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Tue Apr 8 20:58:28 2025 +0000 stash Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit 5accb53 Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Tue Apr 8 16:00:29 2025 +0000 stash Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Signed-off-by: ApostaC <yihua98@uchicago.edu>

commit 3a24897 Author: ApostaC <yihua98@uchicago.edu> Date: Thu Apr 10 18:31:47 2025 -0700 [Fix] memory leak problem by proper clean up Signed-off-by: ApostaC <yihua98@uchicago.edu> commit 406d6bf Author: Rémi Delacourt <54138269+Flechman@users.noreply.github.com> Date: Fri Apr 11 00:47:40 2025 +0200 Add MLA support for v1 disagg connector (#6) Signed-off-by: remi <remi@mistral.ai> commit 1d8415d Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Thu Apr 10 21:59:54 2025 +0000 rename Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit 9c4159c Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Thu Apr 10 21:41:20 2025 +0000 updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit 54e1491 Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Thu Apr 10 20:31:35 2025 +0000 updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit 8e1eadc Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Thu Apr 10 20:26:37 2025 +0000 updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit 05349a5 Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 22:10:50 2025 +0000 updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit 7f57f3c Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 17:13:31 2025 +0000 update lifecycle Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit 7c31e29 Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 17:03:55 2025 +0000 nits Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit 74af233 Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 16:44:01 2025 +0000 done with nits Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit e64f745 Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 16:28:51 2025 +0000 updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit 40e5d81 Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 16:25:04 2025 +0000 refactor Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit 25c9592 Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 16:20:41 2025 +0000 updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit fc58dd5 Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 16:13:39 2025 +0000 updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit 20decdf Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 16:06:15 2025 +0000 updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit 5145566 Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 15:52:03 2025 +0000 updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit 62e1421 Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 15:47:40 2025 +0000 updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit 689379e Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 15:39:40 2025 +0000 updaed Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit b1310fd Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 15:36:37 2025 +0000 updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit 7b64acb Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 15:32:47 2025 +0000 clean up code Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit 7766ca5 Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 15:31:33 2025 +0000 updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit b0629bd Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 15:27:58 2025 +0000 updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit eca7a49 Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 15:25:24 2025 +0000 cleaning Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit 1881aa5 Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 15:24:55 2025 +0000 updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit 7833645 Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 15:11:18 2025 +0000 updared Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit e72e5e4 Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 15:08:34 2025 +0000 updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit 48c2eb2 Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 15:07:37 2025 +0000 updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit de1e487 Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 15:01:14 2025 +0000 fix nit Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit 0163070 Merge: e2ecc14 8b3f606 Author: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Date: Wed Apr 9 10:39:46 2025 -0400 Merge pull request #4 from robertgshaw2-redhat/rob-changes Rob changes commit 8b3f606 Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 14:29:26 2025 +0000 updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit 90e8c53 Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 14:29:17 2025 +0000 updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit da019df Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 14:23:47 2025 +0000 updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit 4ebcc3e Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 13:44:41 2025 +0000 updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit 00df670 Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 13:20:14 2025 +0000 updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit a73721a Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Wed Apr 9 13:19:17 2025 +0000 updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit 31d807e Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Tue Apr 8 20:58:28 2025 +0000 stash Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit 5accb53 Author: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Date: Tue Apr 8 16:00:29 2025 +0000 stash Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Signed-off-by: ApostaC <yihua98@uchicago.edu>

WoosukKwon added 17 commits March 3, 2023 04:16

Fix a bug in 1D shape

e5a1fa8

Minor

342275f

Minor

b91a2fa

[WIP] Add memory analyzer

d78e2fb

Automatically config GPU/CPU blocks

2649eb5

Remove TODO

1ae7420

Merge branch 'main' into autoconfig

6654b34

Merge branch 'main' into autoconfig

fcbf027

Add max_num_batched_tokens argument

350ed27

Minor

6f5b41b

Minor

2d03918

Refactor model utils

8ec00fe

Re-implement memory analyzer

84203fc

Fix __init__

96b216c

Use memory analyzer in server.py

c89d440

Add psutil to README

f5d1e2c

Fix comment

cc63c24

WoosukKwon merged commit e9d3f2f into main Mar 12, 2023

WoosukKwon deleted the autoconfig branch March 12, 2023 07:23

TheBloke mentioned this pull request Jul 20, 2023

Can't launch OpenAI API server on newly installed vLLM in Docker - fastchat not found #537

Closed

shanshanpt mentioned this pull request Nov 17, 2023

Run long conetxt error : CUDA error: an illegal memory access was encountered #1700

Closed

junior-zsy mentioned this pull request Nov 20, 2023

Error with 32k Long Text in chatglm2-6b-32k Model #1725

Closed

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Add memory analyzer & utomatically configure KV cache size (vllm-proj…

de10960

…ect#6)

slyalin pushed a commit to slyalin/vllm that referenced this pull request Mar 21, 2024

Merge pull request vllm-project#6 from mzegla/extended_requirements

2922b06

Add missing Python requirements

mzusman added a commit to mzusman/vllm that referenced this pull request Apr 16, 2024

dtype (vllm-project#6)

00bce1f

Co-authored-by: Mor Zusman <morz@ai21.com>

dtrifiro referenced this pull request in dtrifiro/vllm Apr 26, 2024

Merge pull request #6 from z103cb/ibm_main_docker_ubi_updates

91e4a51

[CI/Build] Dockerfile.ubi : Remove test stage

dlopes78 mentioned this pull request May 8, 2024

[Bug]: VLLM + tritonserver #4695

Closed

Starmys pushed a commit to Starmys/vllm that referenced this pull request May 20, 2024

Merge pull request vllm-project#6 from wenxcs/wenxh/fp8-on-a100

4e56e27

FP8 on A100 for PHIMOE

oliver-li mentioned this pull request Jul 5, 2024

[Bug]: NCCL hangs and causes timeout #5484

Closed

heheda12345 added a commit to heheda12345/vllm that referenced this pull request Sep 25, 2024

Merge pull request vllm-project#6 from vllm-project/chang/num_prompt_…

9b931bf

…tokens [Bugfix] Include encoder_prompt_tokens in num_prompt_tokensin UsageInfo

Clint-chan mentioned this pull request Sep 29, 2024

[Bug]: Vllm0.6.2 UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown #8933

Open

1 task

SpaceHunterInf mentioned this pull request Sep 30, 2024

[Bug]: Bus error (core dumped) #8974

Closed

1 task

This was referenced Oct 12, 2024

[Bug]: RuntimeError: CUDA error: an illegal memory access was encountered #6976

Closed

[Bug]: Failed to pickle inputs of failed execution: CUDA error: an illegal memory access was encountered #9306

Open

xxzhang0927 mentioned this pull request Oct 30, 2024

[Bug]: Engine iteration timed out. This should never happen! #9839

Open

1 task

hteeyeoh mentioned this pull request Dec 6, 2024

[Bug]: Not able to install/compile vllm using alpine linux base image #10924

Closed

1 task

HelenaSak mentioned this pull request Feb 19, 2025

[Bug]: watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered #8177

Open

1 task

lsy323 pushed a commit to lsy323/vllm that referenced this pull request Mar 31, 2025

Merge pull request vllm-project#6 from lsy323/lsiyuan/update-torch-whl

afbf9aa

update torchwhl

alokkrsahu mentioned this pull request Apr 9, 2025

[Bug]: meta-llama/Llama-4-Scout-17B-16E-Instruct compatibility #16330

Closed

1 task

jifa513 mentioned this pull request Apr 10, 2025

[Bug]: Error: Failed to initialize the TMA descriptor 700 #13961

Open

1 task

qiuhaining mentioned this pull request Apr 10, 2025

[Bug]: corrupted double-linked list (not small) Aborted #16412

Closed

1 task

southfreebird mentioned this pull request Apr 11, 2025

[Bug]: Medusa speculation hangs when tp > 1 #16477

Closed

1 task

QuanhuiGuan mentioned this pull request Apr 14, 2025

[Bug]: CUDA error: an illegal memory access was encountered #16398

Open

1 task

maobaolong pushed a commit to maobaolong/vllm that referenced this pull request Apr 14, 2025

Add MLA support for v1 disagg connector (vllm-project#6)

406d6bf

Signed-off-by: remi <remi@mistral.ai>

hao-cold mentioned this pull request May 13, 2025

[Bug]: CUDA error: an illegal instruction was encountered #18045

Open

1 task

markmc mentioned this pull request May 21, 2025

[Bug][Failing Test]: Distributed Comm Ops - distributed/test_shm_broadcast.py #18492

Closed

1 task

zerosurplus mentioned this pull request Jun 16, 2025

[Bug]: torch.distributed.DistNetworkError: The client socket has timed out after 600000ms while trying to connect to (172.17.0.9, 46229). #19670

Open

1 task

xiaocode337317439 mentioned this pull request Jun 27, 2025

[Bug]:RuntimeError: CUDA error: an illegal memory access was encountered #20170

Open

1 task

Chris113113 mentioned this pull request Jul 10, 2025

[Bug]: [V1][gpu_model_runner.py] CUDA memory error #19415

Open

1 task

shrijayan mentioned this pull request Jul 12, 2025

vLLM hangs after 10 minutes without any error message #1492

Closed

tyxiong23 mentioned this pull request Jul 30, 2025

[Bug]: GLM-4.1V-Thinking ValueError #21811

Closed

1 task

xiaomofang mentioned this pull request Jul 31, 2025

[Bug]: There is an issue with speculative inference in Eagle mode, where the context length of vLLM inference is constrained by the draft model. #21986

Open

1 task

devops724 mentioned this pull request Aug 3, 2025

[Bug]: vLLM engine crashes then restarts and loads the model on sleep if a chat request is made #15483

Open

1 task

fernandaspets mentioned this pull request Aug 8, 2025

[Bug]: --tensor-parallel-size 2 seems broken for Blackwell 6000 pro since version 10 #22479

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Automatically configure KV cache size #6

Automatically configure KV cache size #6

Uh oh!

WoosukKwon commented Mar 3, 2023 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Automatically configure KV cache size #6

Automatically configure KV cache size #6

Uh oh!

Conversation

WoosukKwon commented Mar 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

WoosukKwon commented Mar 3, 2023 •

edited

Loading