switch from just-in-time scaling to delayed scaling #18

vkuzo · 2023-08-07T21:28:12Z

Summary:

Before: all scaling was done just-in-time
After:

scaling is done in a delayed fashion with a history of 1
there is special logic to populate initial amaxes (TE doesn't have this)

A future PR will add windowed calculation

Test Plan:

with-proxy ./tests/test_everything.sh

Reviewers:

Subscribers:

Tasks:

Tags:

Summary: Before: all scaling was done just-in-time After: 1. scaling is done in a delayed fashion with a history of 1 2. there is special logic to populate initial amaxes (TE doesn't have this) A future PR will add windowed calculation Test Plan: ``` with-proxy ./tests/test_everything.sh ``` Reviewers: Subscribers: Tasks: Tags:

Summary: In #18, the MNIST finetuning script broke because the casts were not saturated. By default, casts to float8 are not saturated. With delayed scaling, we need to saturate to avoid `NaN`s everywhere. For now, write the saturation logic in eager mode. In the future we would ideally lower this to a hardware accelerated saturated cast via PT2.0. Test Plan: ``` // loss now converges, again with-proxy python finetune/mnist.py --batch-size 4096 --use-pt-fp8 ``` Reviewers: Subscribers: Tasks: Tags:

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 7, 2023

vkuzo merged commit 145f31a into main Aug 7, 2023

vkuzo mentioned this pull request Aug 14, 2023

switch casts to float8 to be saturated #33

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

switch from just-in-time scaling to delayed scaling #18

switch from just-in-time scaling to delayed scaling #18

vkuzo commented Aug 7, 2023

Uh oh!

Uh oh!

switch from just-in-time scaling to delayed scaling #18

switch from just-in-time scaling to delayed scaling #18

Conversation

vkuzo commented Aug 7, 2023

Uh oh!

Uh oh!