-
Notifications
You must be signed in to change notification settings - Fork 237
[feat!] Disable FSDP inference by default #996
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary of ChangesHello @XOR-op, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a performance bottleneck in inference by changing the default behavior of FSDP (Fully Sharded Data Parallel) usage. Previously enabled by default, FSDP inference was found to introduce significant overhead in certain single-GPU scenarios. The core change disables FSDP inference by default to prioritize speed, while still providing guidance and options for users who need it for VRAM optimization. This ensures better out-of-the-box performance for most users and clearer instructions for memory-constrained environments. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request disables FSDP inference by default, which is a sensible change to prioritize performance over memory savings out-of-the-box. The changes are consistently applied across all example scripts, and the documentation is updated accordingly. I've found a few minor issues: a typo in the documentation, a stylistic issue in one of the examples, and a bug in a help string that would lead to a malformed word. My review includes suggestions to fix these.
| "Use FSDP for inference by sharding the model weights. FSDP helps reduce GPU memory usage but may introduce" | ||
| + | ||
| " weight transfer overhead depending on the specific setup. Enable if run out of memory.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The string concatenation here is missing a space between 'introduce' and 'weight', which will result in 'introduceweight' in the help message. This is a bug. I've added the space to fix it.
As a minor style point, using + for string literal concatenation is not necessary, as Python automatically concatenates adjacent string literals. You could remove the + and the line it's on for more idiomatic code.
| "Use FSDP for inference by sharding the model weights. FSDP helps reduce GPU memory usage but may introduce" | |
| + | |
| " weight transfer overhead depending on the specific setup. Enable if run out of memory.", | |
| "Use FSDP for inference by sharding the model weights. FSDP helps reduce GPU memory usage but may introduce " | |
| + | |
| "weight transfer overhead depending on the specific setup. Enable if run out of memory.", |
| # FastVideo will automatically handle distributed setup | ||
| num_gpus=1, | ||
| use_fsdp_inference=True, | ||
| use_fsdp_inference=False, # set to True if GPU is out of memory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
FSDP H2D is async, but you may write an explicit transfer controller to reduce the dispatch overhead as in sgl-project/sglang#15511. |
|
@Ohm-Rishabh already ported that sglang PR to fastvideo here #966 This PR should just be cleaning up the default settings and our example scripts. Since similar to sglang, |
|
@Edenzzzz In the current implementation, the H2D is on the same CUDA stream of the main one and not prefetched, which exhibits the synchronous behavior even if is invoked as async. That could explain why I hit performance issue. |
If so that's really poor implementation and worth raising an issue, if you have any profiling results showing they are on the same stream |
|
Close this PR for now, since some tests encounter OOM issue. Another PR will be opened which will enable |


Originally, fastvideo leaves
use_fsdp_inferenceon by default. However, FSDP as a VRAM saving approach can bring performance overhead. In the latest commit with 1GPU, FSDP slows down by 30% for DiT denoising due to the synchronous H2D weight transfer.This PR 1) disable FSDP in inference examples explicitly with further instruction; 2) change the default value of
use_fsdp_inferencefromTruetoFalseas a BREAKING CHANGE; 3) add more documentations on this option.This breaking change will only affect inference scripts, since in training mode, this flag will be ignored.