- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 10.9k
[Core] Add Flashinfer TRTLLM Backend for Flashinfer decode path (SM100). #19825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| 👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run  Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add  🚀 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @pavanimajety, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces a significant enhancement by integrating a new Flashinfer TRTLLM backend for the decode attention path, specifically optimized for Blackwell (SM100) architectures. This aims to provide a more performant decoding kernel, which can be toggled via an environment variable. The changes involve extending existing data structures, implementing conditional execution logic for the new backend, and adding a dedicated benchmark to validate its performance.
Highlights
- New TRTLLM Decode Backend Integration: I've integrated the trtllm_batch_decode_with_kv_cachefunction from Flashinfer, providing an alternative, potentially optimized, decode attention kernel for Blackwell (SM100) architectures.
- Feature Flag Control: A new environment variable, VLLM_USE_TRTLLM_DECODE_ATTENTION, has been added. This allows users to explicitly enable or disable the new TRTLLM decode path, offering flexibility and control over which backend is utilized.
- FlashInfer Metadata Extension: The FlashInferMetadataclass has been extended to accommodate additional parameters required by the TRTLLM decode function, such asmax_seq_len,seq_lens,block_table_tensor, andworkspace_buffer.
- Conditional Execution Logic: The core logic in the _planandforwardmethods of theFlashInferBackendhas been updated to conditionally invoke either the new TRTLLM decode kernel or the existing Flashinfer decode based on theVLLM_USE_TRTLLM_DECODE_ATTENTIONenvironment variable. This includes handling the specific KV cache layout expected by TRTLLM.
- Dedicated Benchmark Test: A new test file (test_flashinfer_trtllm_decode_attention.py) has been introduced. This benchmark specifically measures the performance of thetrtllm_batch_decode_with_kv_cachefunction across various sequence counts, ensuring the new integration can be properly evaluated.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description | 
|---|---|---|
| Code Review | /gemini review | Performs a code review for the current pull request in its current state. | 
| Pull Request Summary | /gemini summary | Provides a summary of the current pull request in its current state. | 
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. | 
| Help | /gemini help | Displays a list of available commands. | 
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
- 
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩ 
| This pull request has merge conflicts that must be resolved before it can be | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds a Flashinfer TRTLLM backend for the Flashinfer decode path, specifically targeting SM100 architecture. The changes include modifications to the attention backend, environment variables, and a new test file. The code introduces a new environment variable to enable the TRTLLM backend and integrates it into the existing Flashinfer attention implementation. The test file benchmarks the performance of the new backend. There are several areas where the code could be improved, including hardcoded values, redundant calculations, and missing documentation.
36ca48a    to
    03c31c5      
    Compare
  
    | This pull request has merge conflicts that must be resolved before it can be | 
7cdac4d    to
    8e10c86      
    Compare
  
    | This pull request has merge conflicts that must be resolved before it can be | 
| Kernel benchmark:  | 
0b52d41    to
    2f8bc21      
    Compare
  
            
          
                tests/kernels/attention/test_flashinfer_trtllm_decode_attention.py
              
                Outdated
          
            Show resolved
            Hide resolved
        
      Signed-off-by: Pavani Majety <pmajety@nvidia.com>
Head branch was pushed to by a user without write access
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
| This pull request has merge conflicts that must be resolved before it can be | 
…0). (vllm-project#19825) Signed-off-by: Pavani Majety <pmajety@nvidia.com> Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: shuw <shuw@nvidia.com> Co-authored-by: mgoin <mgoin64@gmail.com>
…0). (vllm-project#19825) Signed-off-by: Pavani Majety <pmajety@nvidia.com> Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: shuw <shuw@nvidia.com> Co-authored-by: mgoin <mgoin64@gmail.com>
…0). (vllm-project#19825) Signed-off-by: Pavani Majety <pmajety@nvidia.com> Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: shuw <shuw@nvidia.com> Co-authored-by: mgoin <mgoin64@gmail.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
…0). (vllm-project#19825) Signed-off-by: Pavani Majety <pmajety@nvidia.com> Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: shuw <shuw@nvidia.com> Co-authored-by: mgoin <mgoin64@gmail.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>
…0). (vllm-project#19825) Signed-off-by: Pavani Majety <pmajety@nvidia.com> Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: shuw <shuw@nvidia.com> Co-authored-by: mgoin <mgoin64@gmail.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>
…0). (vllm-project#19825) Signed-off-by: Pavani Majety <pmajety@nvidia.com> Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: shuw <shuw@nvidia.com> Co-authored-by: mgoin <mgoin64@gmail.com>
Co-authored by @wenscarl
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Purpose
Adds decode kernels for Paged GQA for kv-cache-dtype="auto". A follow up PR would include FA3 style of Q=FP8 and KV=FP8 support
Test Plan
Test Result
Llama 3.3 70B FP8 Benchmarking results:
(Optional) Documentation Update
Introduces
VLLM_USE_TRTLLM_DECODE_ATTENTIONfor switching between flashinferBatchDecodePagedKVCacheWrapperwrapper and thetrtllm_batch_decode_with_kv_cacheAPIKernel level Benchmarks: see comments
Test results: