Skip to content

Conversation

jerryzh168
Copy link
Contributor

@jerryzh168 jerryzh168 commented Aug 26, 2025

Stacked PRs:


Float8Tensor per row quantization pass bias to fbgemm kernel

Summary:
Previously bias is not passed to fbgemm kernel for float8 per row quant,
this PR adds it.

Difference is we should have a faster float8 per row quantized kernel, without changing
numerics or other things.

Test Plan:

python test/quantization/quantize_/workflows/float8/test_float8_tensor.py -k test_kernel_preference_numerical_equivalence
python test/quantization/quantize_/workflows/float8/test_float8_tensor.py -k test_expected_gpu_kernel_fbgemm

Reviewers:

Subscribers:

Tasks:

Tags:

Copy link

pytorch-bot bot commented Aug 26, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2884

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ No Failures

As of commit 4d70152 with merge base fbe3df9 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jerryzh168 added a commit that referenced this pull request Aug 26, 2025
Summary:
Previously bias is not passed to fbgemm kernel for float8 per row quant,
this PR adds it.

Difference is we should have a faster float8 per row quantized kernel, without changing
numerics or other things.

Test Plan:
```
python test/dtypes/test_affine_quantized_float.py -k test_expected_kernels_on_gpu
python test/quantization/quantize_/workflows/float8/test_float8_tensor.py
```

Reviewers:

Subscribers:

Tasks:

Tags:

stack-info: PR: #2884, branch: jerryzh168/stack/60
@jerryzh168 jerryzh168 force-pushed the jerryzh168/stack/59 branch from 6935cc8 to bacbe8c Compare August 26, 2025 22:22
@jerryzh168 jerryzh168 force-pushed the jerryzh168/stack/60 branch from 2c1e5da to 613585d Compare August 26, 2025 22:22
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 26, 2025
@jerryzh168 jerryzh168 added the topic: improvement Use this tag if this PR is an improvement (doesn't fit into any of the other categories) label Aug 26, 2025
@jerryzh168 jerryzh168 requested a review from vkuzo August 26, 2025 22:44
@jerryzh168 jerryzh168 changed the base branch from jerryzh168/stack/59 to main August 26, 2025 23:19
jerryzh168 added a commit that referenced this pull request Aug 26, 2025
Summary:
Previously bias is not passed to fbgemm kernel for float8 per row quant,
this PR adds it.

Difference is we should have a faster float8 per row quantized kernel, without changing
numerics or other things.

Test Plan:
```
python test/dtypes/test_affine_quantized_float.py -k test_expected_kernels_on_gpu
python test/quantization/quantize_/workflows/float8/test_float8_tensor.py
```

Reviewers:

Subscribers:

Tasks:

Tags:

stack-info: PR: #2884, branch: jerryzh168/stack/60
@jerryzh168 jerryzh168 force-pushed the jerryzh168/stack/60 branch from 613585d to b98440e Compare August 26, 2025 23:19
@jerryzh168 jerryzh168 changed the base branch from main to jerryzh168/stack/59 August 26, 2025 23:19
@jerryzh168 jerryzh168 changed the base branch from jerryzh168/stack/59 to main August 27, 2025 23:47
@jerryzh168 jerryzh168 force-pushed the jerryzh168/stack/60 branch from b98440e to 6acc102 Compare August 27, 2025 23:47
jerryzh168 added a commit that referenced this pull request Aug 27, 2025
Summary:
Previously bias is not passed to fbgemm kernel for float8 per row quant,
this PR adds it.

Difference is we should have a faster float8 per row quantized kernel, without changing
numerics or other things.

Test Plan:
```
python test/dtypes/test_affine_quantized_float.py -k test_expected_kernels_on_gpu
python test/quantization/quantize_/workflows/float8/test_float8_tensor.py
```

Reviewers:

Subscribers:

Tasks:

Tags:

stack-info: PR: #2884, branch: jerryzh168/stack/60
@jerryzh168 jerryzh168 changed the base branch from main to jerryzh168/stack/59 August 27, 2025 23:50
@vkuzo
Copy link
Contributor

vkuzo commented Aug 28, 2025

looks reasonable, can we add a test which covers the new path?

@jerryzh168 jerryzh168 changed the base branch from jerryzh168/stack/59 to main August 28, 2025 00:19
jerryzh168 added a commit that referenced this pull request Aug 28, 2025
Summary:
Previously bias is not passed to fbgemm kernel for float8 per row quant,
this PR adds it.

Difference is we should have a faster float8 per row quantized kernel, without changing
numerics or other things.

Test Plan:
```
python test/dtypes/test_affine_quantized_float.py -k test_expected_kernels_on_gpu
python test/quantization/quantize_/workflows/float8/test_float8_tensor.py
```

Reviewers:

Subscribers:

Tasks:

Tags:

stack-info: PR: #2884, branch: jerryzh168/stack/60
@jerryzh168 jerryzh168 force-pushed the jerryzh168/stack/60 branch from 6acc102 to d5465ea Compare August 28, 2025 00:19
@jerryzh168 jerryzh168 changed the base branch from main to jerryzh168/stack/59 August 28, 2025 00:19
@jerryzh168 jerryzh168 changed the base branch from jerryzh168/stack/59 to main August 28, 2025 00:27
jerryzh168 added a commit that referenced this pull request Aug 28, 2025
Summary:
Previously bias is not passed to fbgemm kernel for float8 per row quant,
this PR adds it.

Difference is we should have a faster float8 per row quantized kernel, without changing
numerics or other things.

Test Plan:
```
python test/dtypes/test_affine_quantized_float.py -k test_expected_kernels_on_gpu
python test/quantization/quantize_/workflows/float8/test_float8_tensor.py
```

Reviewers:

Subscribers:

Tasks:

Tags:

stack-info: PR: #2884, branch: jerryzh168/stack/60
@jerryzh168 jerryzh168 force-pushed the jerryzh168/stack/60 branch from d5465ea to 951a49c Compare August 28, 2025 00:27
@jerryzh168 jerryzh168 changed the base branch from main to jerryzh168/stack/59 August 28, 2025 00:27
@jerryzh168 jerryzh168 changed the base branch from jerryzh168/stack/59 to main August 28, 2025 00:35
jerryzh168 added a commit that referenced this pull request Aug 28, 2025
Summary:
Previously bias is not passed to fbgemm kernel for float8 per row quant,
this PR adds it.

Difference is we should have a faster float8 per row quantized kernel, without changing
numerics or other things.

Test Plan:
```
python test/dtypes/test_affine_quantized_float.py -k test_expected_kernels_on_gpu
python test/quantization/quantize_/workflows/float8/test_float8_tensor.py
```

Reviewers:

Subscribers:

Tasks:

Tags:

stack-info: PR: #2884, branch: jerryzh168/stack/60
@jerryzh168 jerryzh168 force-pushed the jerryzh168/stack/60 branch from 951a49c to 5860000 Compare August 28, 2025 00:35
@jerryzh168 jerryzh168 changed the base branch from main to jerryzh168/stack/59 August 28, 2025 00:35
@jerryzh168 jerryzh168 changed the base branch from jerryzh168/stack/59 to main August 28, 2025 00:37
@jerryzh168 jerryzh168 changed the base branch from main to jerryzh168/stack/59 August 28, 2025 20:24
@jerryzh168 jerryzh168 changed the base branch from jerryzh168/stack/59 to main August 28, 2025 20:28
jerryzh168 added a commit that referenced this pull request Aug 28, 2025
Summary:
Previously bias is not passed to fbgemm kernel for float8 per row quant,
this PR adds it.

Difference is we should have a faster float8 per row quantized kernel, without changing
numerics or other things.

Test Plan:
```
python test/dtypes/test_affine_quantized_float.py -k test_expected_kernels_on_gpu
python test/quantization/quantize_/workflows/float8/test_float8_tensor.py
```

Reviewers:

Subscribers:

Tasks:

Tags:

stack-info: PR: #2884, branch: jerryzh168/stack/60
@jerryzh168 jerryzh168 force-pushed the jerryzh168/stack/60 branch from d94d663 to 0c70dee Compare August 28, 2025 20:28
@jerryzh168 jerryzh168 changed the base branch from main to jerryzh168/stack/59 August 28, 2025 20:28
@jerryzh168 jerryzh168 changed the base branch from jerryzh168/stack/59 to main August 28, 2025 22:17
jerryzh168 added a commit that referenced this pull request Aug 28, 2025
Summary:
Previously bias is not passed to fbgemm kernel for float8 per row quant,
this PR adds it.

Difference is we should have a faster float8 per row quantized kernel, without changing
numerics or other things.

Test Plan:
```
python test/dtypes/test_affine_quantized_float.py -k test_expected_kernels_on_gpu
python test/quantization/quantize_/workflows/float8/test_float8_tensor.py
```

Reviewers:

Subscribers:

Tasks:

Tags:

stack-info: PR: #2884, branch: jerryzh168/stack/60
@jerryzh168 jerryzh168 force-pushed the jerryzh168/stack/60 branch from 0c70dee to 2b5046b Compare August 28, 2025 22:17
@jerryzh168 jerryzh168 changed the base branch from main to jerryzh168/stack/59 August 28, 2025 22:17
@jerryzh168 jerryzh168 changed the base branch from jerryzh168/stack/59 to main August 28, 2025 22:35
jerryzh168 added a commit that referenced this pull request Aug 28, 2025
Summary:
Previously bias is not passed to fbgemm kernel for float8 per row quant,
this PR adds it.

Difference is we should have a faster float8 per row quantized kernel, without changing
numerics or other things.

Test Plan:
```
python test/quantization/quantize_/workflows/float8/test_float8_tensor.py -k test_kernel_preference_numerical_equivalence
python test/quantization/quantize_/workflows/float8/test_float8_tensor.py -k test_expected_gpu_kernel_fbgemm
```

Reviewers:

Subscribers:

Tasks:

Tags:

stack-info: PR: #2884, branch: jerryzh168/stack/60
@jerryzh168 jerryzh168 force-pushed the jerryzh168/stack/60 branch from 2b5046b to 091ad9e Compare August 28, 2025 22:35
@jerryzh168 jerryzh168 changed the base branch from main to jerryzh168/stack/59 August 28, 2025 22:35
@jerryzh168 jerryzh168 changed the base branch from jerryzh168/stack/59 to main August 28, 2025 22:43
@jerryzh168 jerryzh168 force-pushed the jerryzh168/stack/60 branch from 091ad9e to 94aec1a Compare August 28, 2025 22:43
jerryzh168 added a commit that referenced this pull request Aug 28, 2025
Summary:
Previously bias is not passed to fbgemm kernel for float8 per row quant,
this PR adds it.

Difference is we should have a faster float8 per row quantized kernel, without changing
numerics or other things.

Test Plan:
```
python test/quantization/quantize_/workflows/float8/test_float8_tensor.py -k test_kernel_preference_numerical_equivalence
python test/quantization/quantize_/workflows/float8/test_float8_tensor.py -k test_expected_gpu_kernel_fbgemm
```

Reviewers:

Subscribers:

Tasks:

Tags:

stack-info: PR: #2884, branch: jerryzh168/stack/60
@jerryzh168 jerryzh168 changed the base branch from main to jerryzh168/stack/59 August 28, 2025 22:43
@jerryzh168 jerryzh168 requested a review from drisspg August 29, 2025 17:07
jerryzh168 added a commit that referenced this pull request Aug 29, 2025
Summary:
Previously bias is not passed to fbgemm kernel for float8 per row quant,
this PR adds it.

Difference is we should have a faster float8 per row quantized kernel, without changing
numerics or other things.

Test Plan:
```
python test/quantization/quantize_/workflows/float8/test_float8_tensor.py -k test_kernel_preference_numerical_equivalence
python test/quantization/quantize_/workflows/float8/test_float8_tensor.py -k test_expected_gpu_kernel_fbgemm
```

Reviewers:

Subscribers:

Tasks:

Tags:

stack-info: PR: #2884, branch: jerryzh168/stack/60
@jerryzh168 jerryzh168 force-pushed the jerryzh168/stack/60 branch from 94aec1a to 5f8d5e2 Compare August 29, 2025 19:03
@jerryzh168 jerryzh168 changed the base branch from jerryzh168/stack/59 to main August 29, 2025 19:03
Copy link
Contributor

@drisspg drisspg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we have a test, this seems like it was a bug but wasn't caught?

@jerryzh168
Copy link
Contributor Author

jerryzh168 commented Sep 2, 2025

@drisspg yeah test is this one:

python test/quantization/quantize_/workflows/float8/test_float8_tensor.py -k test_expected_gpu_kernel_fbgemm

but it's probably better to have a more exhaustive one for all options (auto and torch) as well

it is a performance "bug" I think, the effect is that we are not using the most efficient path if there is a bias

"torch.ops.triton.quantize_fp8_row.default", 1
).check_count("torch.ops.fbgemm.f8f8bf16_rowwise.default", 1).run(code[0])
).check_count("torch.ops.fbgemm.f8f8bf16_rowwise.default", 1).check_not(
"triton_poi_fused_add_0"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @drisspg we explicitly test that triton_poi_fused_add_0 is not generated, to make sure there is no additional res + bias there in this code path

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not safe to depend on inductor generated kernel names such as triton_poi_fused_add_0, I think it would be better to ensure that there are no additional kernels called after torch.ops.fbgemm.f8f8bf16_rowwise.default is called.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, updated

Summary:
Previously bias is not passed to fbgemm kernel for float8 per row quant,
this PR adds it.

Difference is we should have a faster float8 per row quantized kernel, without changing
numerics or other things.

Test Plan:
```
python test/quantization/quantize_/workflows/float8/test_float8_tensor.py -k test_kernel_preference_numerical_equivalence
python test/quantization/quantize_/workflows/float8/test_float8_tensor.py -k test_expected_gpu_kernel_fbgemm
```

Reviewers:

Subscribers:

Tasks:

Tags:

stack-info: PR: #2884, branch: jerryzh168/stack/60
@jerryzh168 jerryzh168 merged commit e7b310b into main Sep 5, 2025
18 checks passed
andrewor14 pushed a commit that referenced this pull request Sep 5, 2025
Summary:
Previously bias is not passed to fbgemm kernel for float8 per row quant,
this PR adds it.

Difference is we should have a faster float8 per row quantized kernel, without changing
numerics or other things.

Test Plan:
```
python test/quantization/quantize_/workflows/float8/test_float8_tensor.py -k test_kernel_preference_numerical_equivalence
python test/quantization/quantize_/workflows/float8/test_float8_tensor.py -k test_expected_gpu_kernel_fbgemm
```

Reviewers:

Subscribers:

Tasks:

Tags:

stack-info: PR: #2884, branch: jerryzh168/stack/60
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. topic: improvement Use this tag if this PR is an improvement (doesn't fit into any of the other categories)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants