-
Notifications
You must be signed in to change notification settings - Fork 383
refactor: separate SM100 and legacy TRT-LLM comm modules #1259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @EmilienM, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request refactors the TensorRT-LLM communication module's compilation and loading mechanism to significantly improve hardware compatibility and portability. My changes enable a single FlashInfer build to support a wider range of NVIDIA GPUs, from older architectures to the latest SM100 (Hopper+) GPUs, by dynamically selecting the appropriate communication module at runtime.
Highlights
- Module Separation: I've introduced a clear separation for the TensorRT-LLM communication module compilation. There's now a dedicated
trtllm_comm_legacy
module for older GPU architectures, complementing the existingtrtllm_comm
module which is optimized for SM100 (Hopper+) GPUs. - Dynamic Module Loading: The
get_trtllm_comm_module
function has been enhanced to dynamically detect the runtime GPU's compute capability. Based on this detection, it will now load either the SM100-optimized or the legacy communication module, ensuring broader hardware compatibility and graceful handling of CPU-only environments. - Build System Adaptation: The
gen_trtllm_comm_module
function has been modified to accept ansm100
boolean flag. This flag dictates whether the module should be compiled with SM100-specific optimizations or as a generic legacy module, allowing for flexible compilation based on target hardware.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the TensorRT-LLM communication module compilation to improve hardware compatibility by creating separate SM100-optimized and legacy modules. The changes are well-structured, but there are a few issues to address. There's a critical bug in flashinfer/comm/trtllm_ar.py
where an undefined function is called, which will lead to a runtime error. Additionally, the ahead-of-time compilation script in flashinfer/aot.py
doesn't fully achieve build portability as it only compiles one of the two modules, and there's an unused import. I've provided suggestions to fix these issues.
@@ -325,7 +327,8 @@ def gen_all_modules( | |||
jit_specs.append(gen_fused_moe_sm100_module()) | |||
jit_specs.append(gen_fp4_quantization_sm100_module()) | |||
jit_specs.append(gen_gemm_sm100_module()) | |||
jit_specs.append(gen_trtllm_comm_module()) | |||
|
|||
jit_specs.append(gen_trtllm_comm_module(sm100=has_sm100)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current logic compiles only one version of the trtllm_comm
module (either SM100 or legacy) based on the build environment. For a truly portable build that works across different GPU generations without runtime compilation, both modules should be compiled if the build environment supports SM100.
To achieve this, you should always compile the legacy module and conditionally compile the SM100-optimized one. A possible implementation would be:
# in aot.py
# Always compile the legacy module
jit_specs.append(gen_trtllm_comm_module(sm100=False))
if has_sm100:
# Other sm100 modules...
jit_specs.append(gen_trtllm_comm_module(sm100=True))
This would ensure that a pre-built package is fully portable and avoids the need for a compiler toolchain at runtime on the user's machine.
jit_specs.append(gen_trtllm_comm_module(sm100=False))
if has_sm100:
jit_specs.append(gen_trtllm_comm_module(sm100=True))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this suggestion is valid, because it's a matter of CUDA CFLAGS being used when building.
I'll let the reviewers bring their opinion.
Restructure the compilation of the TensorRT-LLM communication module to improve hardware compatibility and portability. Previously, the module was compiled with SM100-specific flags only if a compatible GPU was detected during the build process. This made a single build non-portable across different GPU generations. This change introduces two distinct modules: - `trtllm_comm`: compiled with SM100 optimizations for Hopper+ GPUs. - `trtllm_comm_legacy`: a fallback version for older GPU architectures. At runtime, `get_trtllm_comm_module` now detects the GPU's compute capability and dynamically loads the appropriate module. This allows a single FlashInfer build to support a wider range of NVIDIA GPUs and gracefully handles CPU-only environments. Signed-off-by: Emilien Macchi <emacchi@redhat.com>
📌 Description
Restructure the compilation of the TensorRT-LLM communication module to
improve hardware compatibility and portability.
Previously, the module was compiled with SM100-specific flags only if a
compatible GPU was detected during the build process.
This made a single build non-portable across different GPU generations.
This change introduces two distinct modules:
trtllm_comm
: compiled with SM100 optimizations for Hopper+ GPUs.trtllm_comm_legacy
: a fallback version for older GPU architectures.At runtime,
get_trtllm_comm_module
now detects the GPU's compute capabilityand dynamically loads the appropriate module.
This allows a single FlashInfer build to support a wider range of NVIDIA GPUs
and gracefully handles CPU-only environments.
🔍 Related Issues
Fixes #1256
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commit
by runningpip install pre-commit
(or used your preferred method).pre-commit install
.pre-commit run --all-files
and fixed any reported issues.🧪 Tests
unittest
, etc.).Reviewer Notes