[feat!] Disable FSDP inference by default #996

XOR-op · 2026-01-07T23:34:12Z

Originally, fastvideo leaves use_fsdp_inference on by default. However, FSDP as a VRAM saving approach can bring performance overhead. In the latest commit with 1GPU, FSDP slows down by 30% for DiT denoising due to the synchronous H2D weight transfer.

This PR 1) disable FSDP in inference examples explicitly with further instruction; 2) change the default value of use_fsdp_inference from True to False as a BREAKING CHANGE; 3) add more documentations on this option.

This breaking change will only affect inference scripts, since in training mode, this flag will be ignored.

gemini-code-assist · 2026-01-07T23:34:35Z

Summary of Changes

Hello @XOR-op, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a performance bottleneck in inference by changing the default behavior of FSDP (Fully Sharded Data Parallel) usage. Previously enabled by default, FSDP inference was found to introduce significant overhead in certain single-GPU scenarios. The core change disables FSDP inference by default to prioritize speed, while still providing guidance and options for users who need it for VRAM optimization. This ensures better out-of-the-box performance for most users and clearer instructions for memory-constrained environments.

Highlights

Default FSDP Inference Disabled: The use_fsdp_inference flag is now False by default, reversing its previous default True state. This is a breaking change affecting inference scripts.
Performance Improvement: This change aims to mitigate a 30% performance slowdown observed with FSDP on single-GPU DiT denoising due due to synchronous D2H weight transfer.
Updated Inference Examples: All basic inference examples have been updated to explicitly set use_fsdp_inference to False, with a comment indicating when to enable it.
Enhanced Documentation: The documentation for use_fsdp_inference has been updated to clarify its purpose, potential performance overhead, and when it should be enabled (i.e., for VRAM saving).

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request disables FSDP inference by default, which is a sensible change to prioritize performance over memory savings out-of-the-box. The changes are consistently applied across all example scripts, and the documentation is updated accordingly. I've found a few minor issues: a typo in the documentation, a stylistic issue in one of the examples, and a bug in a help string that would lead to a malformed word. My review includes suggestions to fix these.

gemini-code-assist · 2026-01-07T23:35:57Z

fastvideo/fastvideo_args.py

+            "Use FSDP for inference by sharding the model weights. FSDP helps reduce GPU memory usage but may introduce"
+            +
+            " weight transfer overhead depending on the specific setup. Enable if run out of memory.",


The string concatenation here is missing a space between 'introduce' and 'weight', which will result in 'introduceweight' in the help message. This is a bug. I've added the space to fix it.

As a minor style point, using + for string literal concatenation is not necessary, as Python automatically concatenates adjacent string literals. You could remove the + and the line it's on for more idiomatic code.

Suggested change

"Use FSDP for inference by sharding the model weights. FSDP helps reduce GPU memory usage but may introduce"

+

" weight transfer overhead depending on the specific setup. Enable if run out of memory.",

"Use FSDP for inference by sharding the model weights. FSDP helps reduce GPU memory usage but may introduce "

+

"weight transfer overhead depending on the specific setup. Enable if run out of memory.",

docs/inference/inference_quick_start.md

gemini-code-assist · 2026-01-07T23:35:57Z

examples/inference/basic/basic_wan2_2.py

        # FastVideo will automatically handle distributed setup
        num_gpus=1,
-        use_fsdp_inference=True,
+        use_fsdp_inference=False, # set to True if GPU is out of memory 


There's a trailing whitespace at the end of this comment. It should be removed for code cleanliness.

Suggested change

use_fsdp_inference=False, # set to True if GPU is out of memory

use_fsdp_inference=False, # set to True if GPU is out of memory

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Edenzzzz · 2026-01-08T00:19:20Z

FSDP H2D is async, but you may write an explicit transfer controller to reduce the dispatch overhead as in sgl-project/sglang#15511.

SolitaryThinker · 2026-01-08T00:28:12Z

@Ohm-Rishabh already ported that sglang PR to fastvideo here #966

This PR should just be cleaning up the default settings and our example scripts. Since similar to sglang, use_fsdp_inference and this layerwise offloading are mutually exclusive

XOR-op · 2026-01-08T01:00:19Z

@Edenzzzz In the current implementation, the H2D is on the same CUDA stream of the main one and not prefetched, which exhibits the synchronous behavior even if is invoked as async. That could explain why I hit performance issue.

Edenzzzz · 2026-01-08T01:19:32Z

@Edenzzzz In the current implementation, the H2D is on the same CUDA stream of the main one and not prefetched, which exhibits the synchronous behavior even if is invoked as async. That could explain why I hit performance issue.

If so that's really poor implementation and worth raising an issue, if you have any profiling results showing they are on the same stream

XOR-op · 2026-01-08T02:37:43Z

If so that's really poor implementation and worth raising an issue, if you have any profiling results showing they are on the same stream

This is when use_fsdp_inference enabled:

This is when use_fsdp_inference dsiabled and dit_layerwise_offload enabled.

Note that dit_layerwise_offload is only added for Wan models now. It's feasible to add support for other models, but should be in another PR. Also that's only for CPU offloading and not all-gather prefetch. I agree that we should raise an issue when we can build some minimal reproducible code.

XOR-op · 2026-01-08T22:47:14Z

Close this PR for now, since some tests encounter OOM issue. Another PR will be opened which will enable --dit-layerwise-offload by default when it is supported for all models. Layerwise offloading has far less overhead compared to torch's FSDP CPU offloading.

XOR-op added 3 commits January 7, 2026 18:21

change(example): disable fsdp inference by default

b3b9945

change!(config): disable FSDP inference by default

356f365

docs: update low VRAM inference instructions

68174cd

XOR-op requested a review from SolitaryThinker January 7, 2026 23:34

gemini-code-assist bot reviewed Jan 7, 2026

View reviewed changes

fix typo

345ac61

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Edenzzzz added the go Trigger Buildkite CI label Jan 8, 2026

XOR-op closed this Jan 8, 2026

XOR-op mentioned this pull request Jan 8, 2026

[feat!] Disable FSDP inference by default #1001

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[feat!] Disable FSDP inference by default #996

[feat!] Disable FSDP inference by default #996

Uh oh!

XOR-op commented Jan 7, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Jan 7, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 7, 2026

Uh oh!

Uh oh!

gemini-code-assist bot Jan 7, 2026

Uh oh!

Edenzzzz commented Jan 8, 2026

Uh oh!

SolitaryThinker commented Jan 8, 2026

Uh oh!

XOR-op commented Jan 8, 2026

Uh oh!

Edenzzzz commented Jan 8, 2026 •

edited

Loading

Uh oh!

XOR-op commented Jan 8, 2026

Uh oh!

XOR-op commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	use_fsdp_inference=False, # set to True if GPU is out of memory
	use_fsdp_inference=False, # set to True if GPU is out of memory

[feat!] Disable FSDP inference by default #996

[feat!] Disable FSDP inference by default #996

Uh oh!

Conversation

XOR-op commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Jan 7, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist bot Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Edenzzzz commented Jan 8, 2026

Uh oh!

SolitaryThinker commented Jan 8, 2026

Uh oh!

XOR-op commented Jan 8, 2026

Uh oh!

Edenzzzz commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

XOR-op commented Jan 8, 2026

Uh oh!

XOR-op commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

XOR-op commented Jan 7, 2026 •

edited

Loading

Edenzzzz commented Jan 8, 2026 •

edited

Loading