Skip to content

[UPDATED] - Large Block_size solution #21123

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

nadathurv
Copy link

@nadathurv nadathurv commented Jul 17, 2025

This is the updated and current work on this issue. It is related to the To-Do item

Original Problem

Hybrid models were using extremely large block sizes (~400 tokens) due to individual layer constraints. Each attention layer was padded so that kv_hidden_size * block_size of one layer was larger than the mamba state size of one layer, leading to inefficient memory usage.

Solution

Implement aggregate constraint approach instead of individual layer constraints:

Before: Each layer individually satisfies mamba state requirement
After: Combined memory of all attention layers satisfies mamba state requirement

Key Changes

  1. kv_cache_coordinator.py: Add calculate_optimal_block_size() method

    • Implements aggregate constraint calculation: max_mamba_state / (num_attention_layers * min_per_token_bytes)
    • Provides fallback to OPTIMAL_BLOCK_FALLBACK when calculation fails
    • Includes cached version with LRU cache for performance optimization
  2. kv_cache_utils.py: Add _get_kv_cache_config_optimal_block_size() integration

    • Deep copies all specs to prevent mutation of original configurations
    • Applies calculated optimal block size uniformly across all layer specs
    • Wraps calculation in try-catch with fallback to existing uniform page size logic
    • Integrates with existing get_kv_cache_config() flow for hybrid models

cc @heheda12345 @tlrmchlsmth

Outdated links: Original Work

nadathurv and others added 7 commits July 16, 2025 11:52
Signed-off-by: nadathurv <work.vnadathur@gmail.com>
Signed-off-by: Srreyansh Sethi <srreyansh.sethi@gmail.com>
Co-Authored-By: Srreyansh Sethi <107075589+WorldExplored@users.noreply.github.com>
Co-Authored-By: nadathurv <218520480+nadathurv@users.noreply.github.com>
Signed-off-by: nadathurv <work.vnadathur@gmail.com>
Signed-off-by: Srreyansh Sethi <srreyansh.sethi@gmail.com>
Co-Authored-By: Srreyansh Sethi <107075589+WorldExplored@users.noreply.github.com>
Co-Authored-By: nadathurv <218520480+nadathurv@users.noreply.github.com>
Signed-off-by: nadathurv <work.vnadathur@gmail.com>
Signed-off-by: Srreyansh Sethi <srreyansh.sethi@gmail.com>
Co-Authored-By: Srreyansh Sethi <107075589+WorldExplored@users.noreply.github.com>
Co-Authored-By: nadathurv <work.vnadathur@gmail.com>
Signed-off-by: nadathurv <work.vnadathur@gmail.com>
Signed-off-by: Srreyansh Sethi <srreyansh.sethi@gmail.com>
Co-Authored-By: Srreyansh Sethi <107075589+WorldExplored@users.noreply.github.com>
Co-Authored-By: nadathurv <work.vnadathur@gmail.com>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the v1 label Jul 17, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an intelligent way to calculate the optimal block size for hybrid models, which should improve memory efficiency. The core logic for the calculation in kv_cache_coordinator.py is robust and handles edge cases well. The integration in kv_cache_utils.py is also well-structured.

I've identified one high-severity issue regarding error handling. The use of a broad, silent except Exception could mask bugs and should be updated to include logging for better maintainability and easier debugging. Other than that, the changes look good.

Signed-off-by: nadathurv <work.vnadathur@gmail.com>
Co-Authored-By: nadathurv <work.vnadathur@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant