[Bugfix] Max concurrency estimation and check_enough_kv_cache_memory for models with sliding window layers #19029
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR fixes two bugs:
If a model contains both sliding window attention and full attention, the sliding window attention layers are regarded as full attention layers when allocating kv cache, but in
check_enough_kv_cache_memory
, they are still regarded as sliding window attention layers. This PR fix it by changing the order ofunify_hybrid_kv_cache_specs
andcheck_enough_kv_cache_memory
so that we first change sliding window layers to full attention layers, and then perform the check.The max concurrency estimation in
_get_kv_cache_config_uniform_type
didn't considered models with sliding window layers. This PR fixes it.vllm serve bigcode/starcoder2-3b
(a model that all layers use sliding window attention, result changed)Main branch:
This PR:
vllm serve meta-llama/Llama-3.1-8B-Instruct
(a model that all layers use full attention, result not changed)Main branch:
This PR: