Skip to content

Conversation

@XOR-op
Copy link
Collaborator

@XOR-op XOR-op commented Jan 7, 2026

Originally, fastvideo leaves use_fsdp_inference on by default. However, FSDP as a VRAM saving approach can bring performance overhead. In the latest commit with 1GPU, FSDP slows down by 30% for DiT denoising due to the synchronous H2D weight transfer.

This PR 1) disable FSDP in inference examples explicitly with further instruction; 2) change the default value of use_fsdp_inference from True to False as a BREAKING CHANGE; 3) add more documentations on this option.

This breaking change will only affect inference scripts, since in training mode, this flag will be ignored.

@XOR-op XOR-op requested a review from SolitaryThinker January 7, 2026 23:34
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @XOR-op, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a performance bottleneck in inference by changing the default behavior of FSDP (Fully Sharded Data Parallel) usage. Previously enabled by default, FSDP inference was found to introduce significant overhead in certain single-GPU scenarios. The core change disables FSDP inference by default to prioritize speed, while still providing guidance and options for users who need it for VRAM optimization. This ensures better out-of-the-box performance for most users and clearer instructions for memory-constrained environments.

Highlights

  • Default FSDP Inference Disabled: The use_fsdp_inference flag is now False by default, reversing its previous default True state. This is a breaking change affecting inference scripts.
  • Performance Improvement: This change aims to mitigate a 30% performance slowdown observed with FSDP on single-GPU DiT denoising due due to synchronous D2H weight transfer.
  • Updated Inference Examples: All basic inference examples have been updated to explicitly set use_fsdp_inference to False, with a comment indicating when to enable it.
  • Enhanced Documentation: The documentation for use_fsdp_inference has been updated to clarify its purpose, potential performance overhead, and when it should be enabled (i.e., for VRAM saving).

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request disables FSDP inference by default, which is a sensible change to prioritize performance over memory savings out-of-the-box. The changes are consistently applied across all example scripts, and the documentation is updated accordingly. I've found a few minor issues: a typo in the documentation, a stylistic issue in one of the examples, and a bug in a help string that would lead to a malformed word. My review includes suggestions to fix these.

Comment on lines +434 to +436
"Use FSDP for inference by sharding the model weights. FSDP helps reduce GPU memory usage but may introduce"
+
" weight transfer overhead depending on the specific setup. Enable if run out of memory.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The string concatenation here is missing a space between 'introduce' and 'weight', which will result in 'introduceweight' in the help message. This is a bug. I've added the space to fix it.

As a minor style point, using + for string literal concatenation is not necessary, as Python automatically concatenates adjacent string literals. You could remove the + and the line it's on for more idiomatic code.

Suggested change
"Use FSDP for inference by sharding the model weights. FSDP helps reduce GPU memory usage but may introduce"
+
" weight transfer overhead depending on the specific setup. Enable if run out of memory.",
"Use FSDP for inference by sharding the model weights. FSDP helps reduce GPU memory usage but may introduce "
+
"weight transfer overhead depending on the specific setup. Enable if run out of memory.",

# FastVideo will automatically handle distributed setup
num_gpus=1,
use_fsdp_inference=True,
use_fsdp_inference=False, # set to True if GPU is out of memory
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's a trailing whitespace at the end of this comment. It should be removed for code cleanliness.

Suggested change
use_fsdp_inference=False, # set to True if GPU is out of memory
use_fsdp_inference=False, # set to True if GPU is out of memory

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@Edenzzzz
Copy link
Collaborator

Edenzzzz commented Jan 8, 2026

FSDP H2D is async, but you may write an explicit transfer controller to reduce the dispatch overhead as in sgl-project/sglang#15511.

@SolitaryThinker
Copy link
Collaborator

@Ohm-Rishabh already ported that sglang PR to fastvideo here #966

This PR should just be cleaning up the default settings and our example scripts. Since similar to sglang, use_fsdp_inference and this layerwise offloading are mutually exclusive

@XOR-op
Copy link
Collaborator Author

XOR-op commented Jan 8, 2026

@Edenzzzz In the current implementation, the H2D is on the same CUDA stream of the main one and not prefetched, which exhibits the synchronous behavior even if is invoked as async. That could explain why I hit performance issue.

@Edenzzzz
Copy link
Collaborator

Edenzzzz commented Jan 8, 2026

@Edenzzzz In the current implementation, the H2D is on the same CUDA stream of the main one and not prefetched, which exhibits the synchronous behavior even if is invoked as async. That could explain why I hit performance issue.

If so that's really poor implementation and worth raising an issue, if you have any profiling results showing they are on the same stream

@XOR-op
Copy link
Collaborator Author

XOR-op commented Jan 8, 2026

If so that's really poor implementation and worth raising an issue, if you have any profiling results showing they are on the same stream

This is when use_fsdp_inference enabled:
image

This is when use_fsdp_inference dsiabled and dit_layerwise_offload enabled.
image

Note that dit_layerwise_offload is only added for Wan models now. It's feasible to add support for other models, but should be in another PR. Also that's only for CPU offloading and not all-gather prefetch. I agree that we should raise an issue when we can build some minimal reproducible code.

@Edenzzzz Edenzzzz added the go Trigger Buildkite CI label Jan 8, 2026
@XOR-op
Copy link
Collaborator Author

XOR-op commented Jan 8, 2026

Close this PR for now, since some tests encounter OOM issue. Another PR will be opened which will enable --dit-layerwise-offload by default when it is supported for all models. Layerwise offloading has far less overhead compared to torch's FSDP CPU offloading.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go Trigger Buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants