cached tokens completions #22
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
dockerfilegraph
pre-commit hook (Fixdockerfilegraph
pre-commit hook vllm-project/vllm#17698)benchmark_serving.py
([Misc] Add Next Edit Prediction (NEP) datasets support inbenchmark_serving.py
vllm-project/vllm#16839)apply_rotary_emb
from vllm_flash_attn for Qwen2-VL vision RoPE ([Misc] Useapply_rotary_emb
from vllm_flash_attn for Qwen2-VL vision RoPE vllm-project/vllm#17726)deprecated=True
CLIkwarg
(Fix and simplifydeprecated=True
CLIkwarg
vllm-project/vllm#17781)MultiprocExecutor.workers
error ([BugFix] Avoid secondary missingMultiprocExecutor.workers
error vllm-project/vllm#17811)vllm
(Don't call the venvvllm
vllm-project/vllm#17810)--disable-log-stats
in V1 server mode ([BugFix] Fix--disable-log-stats
in V1 server mode vllm-project/vllm#17600)use_fast
failing to be propagated to Qwen2-VL image processor ([Bugfix]use_fast
failing to be propagated to Qwen2-VL image processor vllm-project/vllm#17838)>= 0.7.11
) to avoid AttributeError (noStructTag
) ([V1][Structured Output] Update llguidance (>= 0.7.11
) to avoid AttributeError (noStructTag
) vllm-project/vllm#17839)max_num_batched_tokens
config (Fix Whisper crash caused by invalidmax_num_batched_tokens
config vllm-project/vllm#17853)top_k
to be disabled with0
(still accept-1
for now) (Changetop_k
to be disabled with0
(still accept-1
for now) vllm-project/vllm#17773)str
passed to/v1/audio/transcriptions
(Handle error whenstr
passed to/v1/audio/transcriptions
vllm-project/vllm#17909)ModelConfig
when default constructingVllmConfig
(Don't default constructModelConfig
when default constructingVllmConfig
vllm-project/vllm#17943)transformers.Auto*.from_pretrained
processors ([Bugfix] Add revision totransformers.Auto*.from_pretrained
processors vllm-project/vllm#17948)rocm_aiter_rms_norm
([BugFix] [ROCm]: Bugfix and handle addition case of input forrocm_aiter_rms_norm
vllm-project/vllm#17857)"EngineClient" has no attribute "model_config"
([Fix] Benchmark"EngineClient" has no attribute "model_config"
vllm-project/vllm#17976)KVTransferConfig
properly from Python instead of using JSON blobs without CLI (ConstructKVTransferConfig
properly from Python instead of using JSON blobs without CLI vllm-project/vllm#17994)SchedulerConfig
(Remove noisy warnings fromSchedulerConfig
vllm-project/vllm#17995)tool_choice: required
when using Xgrammar as theStructuredOutputBackend
. ([Feature][V1] Supporttool_choice: required
when using Xgrammar as theStructuredOutputBackend
. vllm-project/vllm#17845).buildkite
toruff format
(Convert.buildkite
toruff format
vllm-project/vllm#17656)model_executor/layers
(Update deprecated type hinting inmodel_executor/layers
vllm-project/vllm#18056)vllm/profiler
(Update deprecated type hinting invllm/profiler
vllm-project/vllm#18057)vllm/transformers_utils
(Update deprecated type hinting invllm/transformers_utils
vllm-project/vllm#18058)benchmarks
toruff format
(Convertbenchmarks
toruff format
vllm-project/vllm#18068)vllm/compilation
(Update deprecated type hinting invllm/compilation
vllm-project/vllm#18072)vllm/adapter_commons
(Update deprecated type hinting invllm/adapter_commons
vllm-project/vllm#18073)NixlConnector
([P/D] Add some more debug logs toNixlConnector
vllm-project/vllm#18102)topk_weight
loading ([Bugfix] fix moe marlintopk_weight
loading vllm-project/vllm#18080)vllm/lora
(Update deprecated type hinting invllm/lora
vllm-project/vllm#18128)vllm/device_allocator
andvllm/distributed
(Update deprecated type hinting invllm/device_allocator
andvllm/distributed
vllm-project/vllm#18126)platform
,plugins
,triton_utils
,vllm_flash_attn
(Update deprecated type hinting inplatform
,plugins
,triton_utils
,vllm_flash_attn
vllm-project/vllm#18129)AOPerModuleConfig
(Add support for loading torchao models withAOPerModuleConfig
vllm-project/vllm#17826)test_openai_schema.py
pass ([Bugfix]: make most oftest_openai_schema.py
pass vllm-project/vllm#17664)models
(Update deprecated type hinting inmodels
vllm-project/vllm#18132)test_kv_cache_events()
([CI] don't skip fixedtest_kv_cache_events()
vllm-project/vllm#18183)model_loader
(Update deprecated type hinting inmodel_loader
vllm-project/vllm#18130)chunked_prefill_paged_decode
as fallback for V1 attention on ROCm ([Bugfix][ROCm] Usechunked_prefill_paged_decode
as fallback for V1 attention on ROCm vllm-project/vllm#18093)resolve_hf_chat_template
([Fix] Fix typo inresolve_hf_chat_template
vllm-project/vllm#18259)an illegal memory access was encountered
of marlin kernel + act_order ([Bugfix] fixan illegal memory access was encountered
of marlin kernel + act_order vllm-project/vllm#18245)usedforsecurity=False
in MD5 hashing to enable FIPS ([BugFix] [Vul] Add missingusedforsecurity=False
in MD5 hashing to enable FIPS vllm-project/vllm#18319)AutoWeightsLoader
to skip loading weights with specific substr in name ([Misc] AllowAutoWeightsLoader
to skip loading weights with specific substr in name vllm-project/vllm#18358)--device
arg ([Frontend] deprecate--device
arg vllm-project/vllm#18399)--enable-reasoning
([Misc] Update deprecation message for--enable-reasoning
vllm-project/vllm#18404)dp_rank==0
([BugFix][DP] Send DP wave completion only fromdp_rank==0
vllm-project/vllm#18502)ndarray.tobytes()
directly instead ofndarray.data.tobytes()
([Misc] Callndarray.tobytes()
directly instead ofndarray.data.tobytes()
vllm-project/vllm#18347)test_openai_schema.py
pass ([Bugfix] maketest_openai_schema.py
pass vllm-project/vllm#18224)KVTransferConfig.engine_id
in post_init ([Bugfix] SetKVTransferConfig.engine_id
in post_init vllm-project/vllm#18576)cuda
hard code withcurrent_platform
([Misc] Replacecuda
hard code withcurrent_platform
vllm-project/vllm#16983)--torch-backend=auto
([Doc] Update quickstart and install for cu128 using--torch-backend=auto
vllm-project/vllm#18505)requirements/cpu.txt
([Hardware][CPU] Update intel_extension_for_pytorch 2.7.0 and move torequirements/cpu.txt
vllm-project/vllm#18542){func}
with mkdocs style links (Replace{func}
with mkdocs style links vllm-project/vllm#18610)chmod +x
tocleanup_pr_body.sh
([CI/Build]chmod +x
tocleanup_pr_body.sh
vllm-project/vllm#18650)cuda
hard code withcurrent_platform
in Ray ([Misc]Replacecuda
hard code withcurrent_platform
in Ray vllm-project/vllm#14668)kernels/quantization/
tests (Speed up thekernels/quantization/
tests vllm-project/vllm#18669)gte-Qwen2-1.5B-instruct
usage ([CI/Build][Doc] Updategte-Qwen2-1.5B-instruct
usage vllm-project/vllm#18683)math.isclose
withpytest.approx
([CI/Build] Replacemath.isclose
withpytest.approx
vllm-project/vllm#18703)examples
toruff-format
(Convertexamples
toruff-format
vllm-project/vllm#18400){class}
,{meth}
,{attr}
, ...) to MkDocs format for better documentation linking ([Doc] Convert Sphinx directives ({class}
,{meth}
,{attr}
, ...) to MkDocs format for better documentation linking vllm-project/vllm#18663)re
([CI/Build] Remove imports of built-inre
vllm-project/vllm#18750)vllm bench serve
and sync with benchmark_[serving,datasets].py (#18566)get_dummy_text
andget_dummy_mm_data
(#18796)async_timeout
(#18792)None
for fields which should never beNone
(#17985)model
when initializingLLM
(#18802)base
(#18914)Qwen2EmbeddingModel
(#18913)packed_modules_mapping
for VLM with arbitrary components (#18987)WeightsMapper
for qwen2-vl/qwen2.5-vl (#19054)_Backend
enums (#19081)scaled_fp8_quant
by increasing vectorization (#18844)Optional
andAnnotated
in CLI typing (#19093)compressed-tensors
(#19217)generate()
to handle Generator exits (#19225)inputs
arg fallback in Engine classes (#18799)max_model_len
in V0 (#19348)kv_sharing_target_layer_name
argument to cutlass_mla backend (#19374)use_irope
(#19134)Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
Test Plan
Test Result
(Optional) Documentation Update