Skip to content

Added FP8 support quantization support to DualChunkFlashAttentionBackend #19420

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

ExtReMLapin
Copy link

@ExtReMLapin ExtReMLapin commented Jun 10, 2025

Essential Elements of an Effective PR Description Checklist

Purpose

Fixed missing FP8 quantization support to DualChunkFlashAttentionBackend

Test Plan

Try to run Qwen 2.5 1m With quantization

VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 VLLM_ATTENTION_BACKEND=FLASHINFER CUDA_VISIBLE_DEVICES=1,2 VLLM_ATTENTION_BACKEND=DUAL_CHUNK_FLASH_ATTN vllm serve Qwen/Qwen2.5-7B-Instruct-1M --max-model-len 140000 --max-num-seqs 1 --port 2483 --enforce-eager --gpu-memory-utilization 0.57 --quantization fp8

Test Result

Now it works

(Optional) Documentation Update

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @ExtReMLapin, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces support for FP8 quantization within the DualChunkFlashAttention backend. It adds the necessary checks, quantizes the query, key, and value tensors before attention computation and caching, and ensures the KV cache is viewed correctly for FP8 operations. This enables running models with FP8 quantization using this specific attention backend.

Highlights

  • FP8 Quantization Support: Added logic to check if FP8 attention is enabled based on kv_cache_dtype.
  • Query Tensor Quantization: Implemented FP8 quantization for all query variants (query, query_succ, query_inter, query_succ_critical, query_inter_critical) using ops.scaled_fp8_quant and the layer's query scale.
  • Key/Value Tensor Quantization: Implemented FP8 quantization for input key and value tensors using ops.scaled_fp8_quant and the layer's key/value scales.
  • KV Cache View: Added logic to convert the KV cache and its key/value components to a torch.float8_e4m3fn view when FP8 attention is enabled.
  • FlashAttention Integration: Passed the FP8 attention flag, FlashAttention version, and placeholder descale tensors to the underlying flash_attn_varlen_func and flash_attn_with_kvcache calls.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configureGemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds FP8 quantization support to the DualChunkFlashAttentionBackend. The changes involve quantizing query, key, and value tensors to FP8 format and updating the attention mechanism to handle these FP8 tensors.

Key areas of feedback:

  1. FP8 Descaling Factors: There's a critical concern regarding the initialization of q_descale, k_descale, and v_descale parameters for FlashAttention functions. They are currently set to torch.ones, but comments in the code and the nature of FP8 quantization suggest that actual scaling factors from the layer (layer._q_scale, etc.) should be used. Using torch.ones will likely lead to incorrect attention outputs if the true scales are not 1.0.
  2. Query Tensor Updates: The logic for updating query tensors after quantization could be refactored for better maintainability and conciseness.

Overall, the addition of FP8 support is a valuable enhancement. Addressing the points above, especially the handling of scaling factors, is crucial for the correctness of the implementation.

@ExtReMLapin
Copy link
Author

ExtReMLapin commented Jun 10, 2025

This pr is VERY largely inspired on how things works for flash attention implementation

This PR also includes #19084

Right now, Qwen 2.5 1M (could be the model, something else of DCA) is BROKEN on head and lastest github release so I tested it on 9.0.1

@ExtReMLapin
Copy link
Author

I cleaned the code a lot, how can the LLM bot re-review it ?

Copy link

mergify bot commented Jun 12, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ExtReMLapin.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jun 12, 2025
@houseroad
Copy link
Collaborator

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds FP8 quantization support to the DualChunkFlashAttentionBackend. The changes primarily involve:

  1. Initializing and checking for FP8 support via flash_attn_supports_fp8 and get_flash_attn_version.
  2. Propagating the layer object through various internal attention functions to access FP8 quantization scales (_q_scale, _k_scale, _v_scale).
  3. Setting up descaling parameters (q_descale, k_descale, v_descale) for calls to flash_attn_varlen_func and flash_attn_with_kvcache when FP8 attention is active.
  4. Ensuring KV cache tensors are viewed as torch.float8_e4m3fn when FP8 is enabled.

The main concern identified is the use of placeholder torch.ones for descaling factors in the _dual_chunk_flash_attn_decoding_with_exp_sums function, which needs to be rectified by using the actual scales from the layer object. Other changes appear to correctly implement the FP8 support logic.

@mergify mergify bot added qwen Related to Qwen models and removed needs-rebase labels Jun 12, 2025
@ExtReMLapin
Copy link
Author

ExtReMLapin commented Jun 25, 2025

As of today FP8 emulation is working with Marlin kernels.
It's broken on RTX 5090 gpu. #20052

@ExtReMLapin
Copy link
Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds FP8 quantization support to the DualChunkFlashAttentionBackend. I've identified a few areas that need attention to ensure correctness and robustness, particularly around handling different FP8 formats and ensuring correct tensor shapes for scaling factors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
qwen Related to Qwen models
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants