float8 rowwise training: add FSDP workaround #1629

vkuzo · 2025-01-27T21:53:25Z

Summary:

Adds the workaround from
pytorch/pytorch#141881 to the torchao float8 rowwise recipe, to reduce memory usage when FSDP is on.

Test Plan: tested in torchtitan, LLaMa 3 8B 8H100 training with rowwise peak memory decreased from 67GiB to 59GiB

Reviewers:

Subscribers:

Tasks:

Tags:

Summary: Adds the workaround from pytorch/pytorch#141881 to the torchao float8 rowwise recipe, to reduce memory usage when FSDP is on. Test Plan: tested in torchtitan, LLaMa 3 8B 8H100 training with rowwise peak memory decreased from 67GiB to 59GiB Reviewers: Subscribers: Tasks: Tags:

pytorch-bot · 2025-01-27T21:53:28Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1629

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 066f889 with merge base 47f96f1 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Summary: This is an example of how to call float8 training with rowwise scaling from torchao. TODO: finalize API in torchao, and finalize how we want to expose it in torchtitan, and optimize performance. ``` // baseline (bf16 + compile) > with-proxy CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh --training.compile ... step: 20 loss: 8.4931 memory: 47.65GiB(50.16%) tps: 5,760 mfu: 33.73% // experiment (rowwise float8 + compile) > with-proxy CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh --float8.enable_float8_linear --training.compile ... // torchao main branch step: 40 loss: 7.3818 memory: 66.81GiB(70.33%) tps: 6,412 mfu: 37.55% // torchao with pytorch/ao#1629 step: 20 loss: 8.3823 memory: 58.55GiB(61.63%) tps: 6,424 mfu: 37.62% // for comparison, tensorwise float8 with float8 all-gather (on main branch) with-proxy CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh --float8.enable_float8_linear --training.compile --float8.enable_fsdp_float8_all_gather --float8.precompute_float8_dynamic_scale_for_fsdp ... step: 20 loss: 8.4258 memory: 47.32GiB(49.81%) tps: 7,186 mfu: 42.08% ``` Test Plan: Reviewers: Subscribers: Tasks: Tags:

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 27, 2025

vkuzo added topic: performance Use this tag if this PR improves the performance of a feature and removed CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. labels Jan 27, 2025

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 27, 2025

vkuzo requested review from lw, drisspg and danielvegamyhre January 28, 2025 03:41

danielvegamyhre approved these changes Jan 28, 2025

View reviewed changes

danielvegamyhre mentioned this pull request Jan 30, 2025

add configuration for float8 with rowwise scaling, via recipe lookup pytorch/torchtitan#808

Merged

vkuzo merged commit 3eb18e7 into main Jan 31, 2025
23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

float8 rowwise training: add FSDP workaround #1629

float8 rowwise training: add FSDP workaround #1629

Uh oh!

vkuzo commented Jan 27, 2025

Uh oh!

pytorch-bot bot commented Jan 27, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

float8 rowwise training: add FSDP workaround #1629

float8 rowwise training: add FSDP workaround #1629

Uh oh!

Conversation

vkuzo commented Jan 27, 2025

Uh oh!

pytorch-bot bot commented Jan 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1629

✅ No Failures

Uh oh!

Uh oh!

Uh oh!

pytorch-bot bot commented Jan 27, 2025 •

edited

Loading