Initial prototype of differentiable _scaled_grouped_mm function #1969

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

danielvegamyhre merged 64 commits into main from grouped-mm

Apr 2, 2025

Contributor

danielvegamyhre commented Mar 26, 2025 •

edited

Loading

Summary

The _grouped_scaled_mm function in torchao will do:

Dynamic, float8 rowwise quantization on inputs
Use these float8 inputs with the grouped scaled mm kernel in pytorch core and return the result
Do this in a differentiable way

Note this prototype only handles A=2D, B=3D.

Test plan

Added unit tests verifying the correctness of the forward pass (outputs) and backward pass (gradients)
Verified torch.compile works with no graph breaks

Example usage

from torchao.prototype.scaled_grouped_mm import _scaled_grouped_mm

...

out = _scaled_grouped_mm(
    x,             # 2D high precision input tensor
    params,        # 3D high precision weights
    offs=offs,     # 1D int32 group offsets
    out_dtype=out_dtype,
)

danielvegamyhre added 2 commits

March 26, 2025 13:50


          grouped_mm forward pass

134242b


          add unit test

danielvegamyhre added the topic: not user facing label

pytorch-bot bot commented Mar 26, 2025 •

edited

Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1969

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 300db8b with merge base 923242e ():

NEW FAILURE - The following job has failed:

.github/workflows/float8nocompile_test.yaml (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot added the CLA Signed label

danielvegamyhre added 2 commits

March 26, 2025 15:56


          only support float8

0a90f0b


          rowwise scaling test passing

a761549

danielvegamyhre force-pushed the grouped-mm branch from 7244136 to a761549 Compare

March 27, 2025 01:40


          add 3Dx3D test

8d15a8a

danielvegamyhre force-pushed the grouped-mm branch from 7afbe08 to 8d15a8a Compare

March 27, 2025 03:08


          numeric unit tests passing

cced381

danielvegamyhre force-pushed the grouped-mm branch from 8268b63 to cced381 Compare

March 27, 2025 15:21

danielvegamyhre changed the title ~~[WIP] Initial prototype of grouped_mm API for torchao~~ [GroupedMM] Initial prototype of grouped_mm API for torchao

danielvegamyhre added 2 commits

March 27, 2025 08:22


          lint

46d7e42


          update 3Dx3D case

e32d528

danielvegamyhre requested a review from vkuzo

March 27, 2025 15:27

danielvegamyhre changed the title ~~[GroupedMM] Initial prototype of grouped_mm API for torchao~~ [GroupedMM] Initial prototype of grouped_mm API for torchao (forward pass only)

danielvegamyhre added 3 commits

March 27, 2025 09:54


          lint

c42af73


          lint

e61c71d


          change func name

94a0cba

danielvegamyhre changed the title ~~[GroupedMM] Initial prototype of grouped_mm API for torchao (forward pass only)~~ [GroupedMM] Initial prototype of _grouped_scaled_mm prototype function for torchao (forward pass only)

danielvegamyhre changed the title ~~[GroupedMM] Initial prototype of _grouped_scaled_mm prototype function for torchao (forward pass only)~~ [GroupedMM] Initial prototype of _grouped_scaled_mm function for torchao (forward pass only)

danielvegamyhre added 4 commits

March 27, 2025 10:02


          lint

fce469b


          B must be 3D

3899bb2


          add docstring


          allow other axiswise dims so we can pass in 3D B tensor tranposed

4e04022

danielvegamyhre force-pushed the grouped-mm branch from a472122 to 4e04022 Compare

March 27, 2025 18:29

danielvegamyhre added 3 commits

March 27, 2025 11:43


          clean up

4117a9e


          add todo

61f0ee4


          lint

dc40622

danielvegamyhre added 3 commits

April 1, 2025 10:28


          explicit re-export

1cd3658


          lint

c154222


          reorganize

e9f2174

danielvegamyhre force-pushed the grouped-mm branch from 4aa2992 to e9f2174 Compare

April 1, 2025 20:06

danielvegamyhre added 2 commits

April 1, 2025 13:19


          add tests for invalid dims

4ba8453


          validate group sizes are multiples of 16

c2e5d42

ngimel reviewed

View reviewed changes

torchao/prototype/scaled_grouped_mm/scaled_grouped_mm.py Outdated

+                  """
+                  group_sizes = []
+                  start_idx = 0
+                  for end_idx in offs.tolist():

ngimel Apr 1, 2025

this is causing device-host synchronization, and we should avoid it, relying on upstream ops to create suitable inputs, or in worst case, on _scaled_grouped_mm implementation itself to throw an assert

Contributor Author

danielvegamyhre Apr 1, 2025

Good catch, removed this assertion for now and will rely on kernel side assert, to avoid device-host sync

ngimel reviewed

View reviewed changes

torchao/prototype/scaled_grouped_mm/scaled_grouped_mm.py Outdated

+                      # Store what we need for backward.
+                      ctx.save_for_backward(A, B)
+                      ctx.float8_config = float8_config
+                      ctx.offs = offs

ngimel Apr 1, 2025

offs is also a tensor so better you you use save_for_backward

Contributor Author

danielvegamyhre Apr 1, 2025

Done


          use save_for_backward for offs

7466ce4

ngimel reviewed

View reviewed changes

torchao/prototype/scaled_grouped_mm/scaled_grouped_mm.py Show resolved Hide resolved

torchao/prototype/scaled_grouped_mm/scaled_grouped_mm.py Outdated

+                      # Convert B to non-transposed, float8, column-major for right operand of grouped GEMM
+                      # needed for grad_A: grad_output @ B.
+                      # Since B was transposed before entry to forward, we need to transpose it back here for this.
+                      B_non_transposed_col_major = B.contiguous().transpose(-2, -1)

ngimel Apr 1, 2025

It might be better to scale and transpose B in forward, and store only quantized version (to minimize memory)

Contributor Author

danielvegamyhre Apr 1, 2025

good idea, done. hopefully torch.compile can do some fusion here and read B once and write both outputs simultaneously (float transposed column major, float8 non-transposed column major).

torchao/prototype/scaled_grouped_mm/scaled_grouped_mm.py

+                  start_idx = 0
+                  next_scale_idx = 0
+                  for end_idx in offs.tolist():

ngimel Apr 1, 2025

here also would be better to have a triton kernel computing scales that could read offs on the device, to avoid syncs

Contributor Author

danielvegamyhre Apr 1, 2025 •

edited

Loading

Yeah this implementation has more room for perf optimization, my first goal was to get accurate numerics. As a follow up I can write a triton kernel to avoid this device-host sync and for loop.

danielvegamyhre added 2 commits

April 1, 2025 15:14


          remove group size assert to avoid device-host sync

527525b


          precompute B_non_transposed_fp8_col_major for backward to save memory

3ea7455

vkuzo reviewed

View reviewed changes

torchao/prototype/scaled_grouped_mm/scaled_grouped_mm.py Outdated

+                          B_col_major = B
+                      # Fetch float8 config from specified recipe name.
+                      float8_config = Float8LinearConfig.from_recipe_name(

Contributor

vkuzo Apr 2, 2025

it's suprising to see a config created here, why not just inline the logic you need without worrying about configs? IMO dealing with configs would be for when this API is about to be productionized.

Contributor Author

danielvegamyhre Apr 2, 2025 •

edited

Loading

I agree, I used it here for this prototype because in the test code i need to use matmul_with_hp_or_float8_args (which requires a Float8 config) to compute the reference forward/backward - and trying to get the outputs/grads to match was already pretty tricky, so to start I wanted to just reference the same Float8LinearConfig.from_recipe_name(Float8LinearRecipeName.ROWWISE) everywhere, to minimize room for accidental differences in how a particular tensor is quantized in test code vs implementation, etc.

I've now updated the implementation to inline everything, then in the test code compare against the float8 rowwise recipe to verify correctness.


          inline configs in impl

a1e7c53

vkuzo reviewed

View reviewed changes

torchao/prototype/scaled_grouped_mm/scaled_grouped_mm.py Outdated

+                      A (bf16/float32 torch.Tensor): The first high-precision input tensor, which must be a 2D tensor of shape (M * num_groups, K).
+                      B (bf16/float32 torch.Tensor): The second high-precision input tensor which must be 3D, which must be shape (B, K, N).
+                      offs (int32 torch.Tensor): The offsets to use to mark the starting index of each group in the input tensor of shape.
+                      float8_recipe (Float8LinearRecipeName): The recipe to use for dynamic float8 quantization.

Contributor

vkuzo Apr 2, 2025

remove?

vkuzo reviewed

View reviewed changes

torchao/prototype/scaled_grouped_mm/scaled_grouped_mm.py Outdated

+                  Args:
+                      A (bf16/float32 torch.Tensor): The first high-precision input tensor, which must be a 2D tensor of shape (M * num_groups, K).
+                      B (bf16/float32 torch.Tensor): The second high-precision input tensor which must be 3D, which must be shape (B, K, N).
+                      offs (int32 torch.Tensor): The offsets to use to mark the starting index of each group in the input tensor of shape.

Contributor

vkuzo Apr 2, 2025

can we clarify if this is for A, B or both? A, right?

vkuzo reviewed

View reviewed changes

torchao/prototype/scaled_grouped_mm/scaled_grouped_mm.py Outdated

+                      offs (int32 torch.Tensor): The offsets to use to mark the starting index of each group in the input tensor of shape.
+                      float8_recipe (Float8LinearRecipeName): The recipe to use for dynamic float8 quantization.
+                      out_dtype (Optional[torch.dtype]): The dtype of the output tensor. Currently only torch.bfloat16 is supported.
+                      use_fast_accum (bool): Whether to use fast accumulation or not. Default is False.

Contributor

vkuzo Apr 2, 2025

remove?

vkuzo reviewed

View reviewed changes

torchao/prototype/scaled_grouped_mm/scaled_grouped_mm.py

+                      assert A.ndim == 2, "A must be 2D"
+                      assert B.ndim == 3, "B must be 3D"
+                      assert (

Contributor

vkuzo Apr 2, 2025

are these assertions redundant with what is there in torch._scaled_grouped_mm? Ideally we only assert in one place for each condition.

Contributor Author

danielvegamyhre Apr 2, 2025 •

edited

Loading

Some are specific to this implementation (only support 2d A and 3D B for now, the primary use case), but some are redundant, yes.

I found the device-side assertions to be opaque sometimes, I often had to read the kernel code to figure out why it wasn't working, to see the exact condition that was failing if the error was a bit ambiguous. So my goal here was to make the requirements more transparent and easier to debug.

(I think since scaled grouped mm is a kernel executing on the GPU, if a check fails, then on the CPU side the error message we can't see the actual line of code with the condition that failed, it just points to the entrypoint torch._scaled_grouped_mm)

ngimel Apr 2, 2025

ndim assertions are not device side, so they have correct stack trace. Device-side assertions are put in when it's impossible to check the same thing on the host without introducing host-device sync, so they happen for a reason and unfortunately can't be improved. To get correct stack trace run with CUDA_LAUNCH_BLOCKING=1

vkuzo reviewed

View reviewed changes

torchao/prototype/scaled_grouped_mm/scaled_grouped_mm.py Outdated

+                      ), f"shape {A.shape} and {B.shape} are not compatible for _scaled_grouped_mm"
+                      # Due to hardware requirements, the right operand in a scaled grouped GEMM must be column-major.
+                      if not _is_column_major(B):

Contributor

vkuzo Apr 2, 2025

nit: I'd prefer letting the caller do this instead, and this function can just assert that the layout is what is needed for the kernel

vkuzo reviewed

View reviewed changes

torchao/prototype/scaled_grouped_mm/scaled_grouped_mm.py Outdated

+                      # low precision B tensor instead of the high precision B tensor.
+                      # In the backward this is needed for grad_A: grad_output @ B.
+                      # Since B was transposed before entry to forward, we need to transpose it back here for this.
+                      B_non_transposed_col_major = B.contiguous().transpose(-2, -1)

Contributor

vkuzo Apr 2, 2025

this naming is confusing, if the original variable is B, then this looks like B_transposed. IMO it would be cleanest to do something like:

input is B
B transposed is B_t

or

input is B_t
B_t transposed is B

Contributor Author

danielvegamyhre Apr 2, 2025

Yeah I agree, I considered this as well. The problem is I'm trying to make the naming consistent with torch._scaled_grouped_mm, which calls the tensors A and B, but it checks that B must be transposed - although really what it's enforcing is column major format, so it's a bit confusing.

For now I changed the naming to option 2 above, which will make the python code here clearer, with the trade-off being it will no longer be consistent with the kernel naming. I think that's fine though, I doubt too many people will be diving into the kernel code.

vkuzo reviewed

View reviewed changes

torchao/prototype/scaled_grouped_mm/scaled_grouped_mm.py

+                  Returns:
+                      A boolean indicating whether the input tensor is column-major.
+                  """
+                  return x.stride(-2) == 1 and x.stride(-1) > 1

Contributor

vkuzo Apr 2, 2025

does this work for 4d/5d/etc tensors? if not, maybe assert that rank is 3?

vkuzo approved these changes

View reviewed changes

danielvegamyhre added 2 commits

April 2, 2025 14:17


          update naming

d405950


          add assert

300db8b

danielvegamyhre force-pushed the grouped-mm branch from 3799b2d to 300db8b Compare

April 2, 2025 21:37

danielvegamyhre changed the title ~~Initial prototype of differentiable grouped_scaled_mm function for torchao~~ Initial prototype of differentiable _scaled_grouped_mm function for torchao

danielvegamyhre changed the title ~~Initial prototype of differentiable _scaled_grouped_mm function for torchao~~ Initial prototype of differentiable _scaled_grouped_mm function

danielvegamyhre merged commit 620356d into main

18 checks passed

jainapurva pushed a commit that referenced this pull request


          Initial prototype of differentiable _scaled_grouped_mm function (#1969)

49705d9

danielvegamyhre mentioned this pull request

[roadmap/tracker] Low precision MoE training #2147

Open

36 tasks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed topic: not user facing