[float8nocompile] Simplified Float8Linear implementation which only supports dynamic tensorwise scaling #1429

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

danielvegamyhre merged 2 commits into pytorch:main from danielvegamyhre:linear1

Dec 18, 2024

Contributor

danielvegamyhre commented Dec 17, 2024 •

edited

Loading

Summary:

This PR adds a simplified implementation of Float8Linear dubbed Float8LinearNoCompile which only supports dynamic tensorwise scaling. I've used TODOs to mark the places where I need to replace the torch based logic with custom triton kernels, in order to improve eager mode performance. Once those kernels have been implemented, I'll benchmark the performance and do some profiling to identify bottlenecks and find additional optimization opportunities.

The purpose of starting with this is to start with a simple implementation which works e2e for float8 training (as shown in the test plan section below), so that as I start replacing torch logic with new triton kernels, I can use the e2e training example as a basic test to validate it's still working.

Test plan:

Validated this simplified implementation works e2e by running a simple example in examples/example.py:

[danvm@devgpu006.vll6 /data/users/danvm/ao/torchao/prototype/float8nocompile/examples (linear1)]$ python3 example.py 
calling convert_to_float8_nocompile_training
finished convert_to_float8_nocompile_training
step 0
step 1
step 2
step 3
step 4
step 5
step 6
step 7
step 8
step 9


          float8nocompile: add simplified implementation of float8linear which …

2af8c14

…only supports dynamic tensorwise scaling

danielvegamyhre requested a review from vkuzo

December 17, 2024 23:58

pytorch-bot bot commented Dec 17, 2024 •

edited

Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1429

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 450f140 with merge base a5a53a2 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot added the CLA Signed label

danielvegamyhre added the topic: not user facing label

vkuzo reviewed

View reviewed changes

torchao/prototype/float8nocompile/examples/example.py

+              convert_to_float8_nocompile_training(m)
+              print("finished convert_to_float8_nocompile_training")
+              for i in range(10):

Contributor

vkuzo Dec 18, 2024

this makes sense, I'd recommend adding a stronger numerical equivalency check (can be in a future PR):

create example input
create (a) reference model and (b) your model
feed the example input, run backwards
compare that model output, weight gradients, input gradients are equivalent across (a) and (b)

Contributor Author

danielvegamyhre Dec 18, 2024

Makes sense, will do this in a follow up PR.

vkuzo reviewed

View reviewed changes

torchao/prototype/float8nocompile/float8nocompile_linear.py Outdated

+                      """
+                      # Amax scales should always be kept as float32.
+                      self.always_float32_buffers = set()

Contributor

vkuzo Dec 18, 2024

can delete, buffers are only needed for delayed scaling

Contributor Author

danielvegamyhre Dec 18, 2024

ah yes I noticed buffers weren't needed for dynamic scaling but missed this one somehow - thanks!

vkuzo reviewed

View reviewed changes

torchao/prototype/float8nocompile/float8nocompile_linear.py Outdated

+                      emulate = config.emulate
+                      super().__init__(*args, **kwargs)
+                      # Defines the scaling behavior of input, weight, grad_output

Contributor

vkuzo Dec 18, 2024

can delete, this is only needed for supporting multiple types of scaling

vkuzo reviewed

View reviewed changes

torchao/prototype/float8nocompile/float8nocompile_linear.py Outdated

+                      self.scaling_type_grad_output = config.cast_config_grad_output.scaling_type
+                      self.config = config
+                      self.is_amax_initialized = not self.config.enable_amax_init

Contributor

vkuzo Dec 18, 2024

can delete, this is only needed for delayed scaling

vkuzo reviewed

View reviewed changes

torchao/prototype/float8nocompile/float8nocompile_linear.py Outdated

+                  def forward(self, input: torch.Tensor) -> torch.Tensor:
+                      # TODO(danielvegamyhre): modify to support for FSDP once dependencies are implemented
+                      output = self.forward_fp8_matmul(input)
+                      if self.bias is not None:

Contributor

vkuzo Dec 18, 2024

feel free to not support bias at all to simplify, as our target model uses linear without bias

vkuzo reviewed

View reviewed changes

torchao/prototype/float8nocompile/float8nocompile_linear.py Outdated

+                  def get_weight_scale(self, weight: torch.Tensor) -> Optional[torch.Tensor]:
+                      # TODO(danielvegamyhre): replace scale calculation with triton kernel
+                      if tensor_already_casted_to_fp8(weight):

Contributor

vkuzo Dec 18, 2024

feel free to remove this to simplify

vkuzo reviewed

View reviewed changes

torchao/prototype/float8nocompile/float8nocompile_linear.py Outdated

+                  def from_float(
+                      cls,
+                      mod,
+                      config: Optional[Float8LinearConfig] = None,

Contributor

vkuzo Dec 18, 2024

nit: feel free to only support default settings and not give an option to modify the config

vkuzo reviewed

View reviewed changes

torchao/prototype/float8nocompile/float8nocompile_scaling_utils.py Outdated

+                      axiswise_dim: if axiswise granularity is used, defines the dim to scale across
+                  """
+                  # TODO(danielvegamyhre): replace this torch implementation with custom triton kernel
+                  if tensor_already_casted_to_fp8(hp_tensor):

Contributor

vkuzo Dec 18, 2024

nit: can remove to simplify

vkuzo reviewed

View reviewed changes

torchao/prototype/float8nocompile/float8nocompile_scaling_utils.py Outdated

+                  # TODO(danielvegamyhre): replace this torch implementation with custom triton kernel
+                  if tensor_already_casted_to_fp8(hp_tensor):
+                      return hp_tensor
+                  scale = tensor_to_scale(

Contributor

vkuzo Dec 18, 2024

I'd recommend just inlining the right code here for just the default settings instead of calling into util functions, it will be easier to compare that to handwritten kernels

vkuzo reviewed

View reviewed changes

torchao/prototype/float8nocompile/float8nocompile_scaling_utils.py Outdated

+                      scaling_granularity,
+                      axiswise_dim,
+                  )
+                  return hp_tensor_and_scale_to_float8(

Contributor

vkuzo Dec 18, 2024

I'd recommend just inlining the right code here for just the default settings instead of calling into util functions, it will be easier to compare that to handwritten kernels

vkuzo approved these changes

View reviewed changes

Contributor

vkuzo left a comment

lgtm, feel free to address changes in this PR or future PRs

Contributor Author

danielvegamyhre commented Dec 18, 2024

lgtm, feel free to address changes in this PR or future PRs

Sounds good, I addressed all your comments except the extension to the test script, which I'll do in a follow up PR - thanks!


          address comments

450f140

danielvegamyhre force-pushed the linear1 branch from d1e79a1 to 450f140 Compare

December 18, 2024 05:30

danielvegamyhre merged commit ec64182 into pytorch:main

18 checks passed

amdfaa pushed a commit that referenced this pull request


          [float8nocompile] Simplified Float8Linear implementation which only s…

ce631b2

…upports dynamic tensorwise scaling (#1429)

* float8nocompile: add simplified implementation of float8linear which only supports dynamic tensorwise scaling

* address comments

---------

Co-authored-by: Daniel Vega-Myhre <danvm@fb.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed topic: not user facing