fp8 bwd #108

micmelesse · 2024-12-09T16:37:06Z

No description provided.

feat: added fp32 output to input_helper passing feat: fp8 tests. small amount of error added fp8e5m2 type note: RuntimeError: "abs_cuda" not implemented for 'Float8_e4m3fnuz' enabled fp8 GEMMs fix: error down to < 0.1 added another fp8 dtype best accuracy is with no scaling improved accuracy to within < 0.02. issue related to torch side casting fix: passes if we allow v to be fp16 instead of fp8. otherwise we have error < 0.1 all error is < 0.07 feat: added per head scaling tensors progress towards implementing scaling tensors in kernel save issue: error caused by acc += tl.dot(p.to(v.type.element_ty), v)

…scaling

Error: UnboundLocalError: local variable 'q_scale_stride_z' referenced before assignment. Fix: Initialize 'q_scale_stride_z' and 'kv_scale_stride_z' before assignment.

Warning: I don't know if this is the correct thing to do.

Warning - 2 test cases are failing due to this change: AssertionError: Tensor-likes are not close! FAILED test.py::test_op_prefill_fwd_impl[False-dtype1-True-bshd-0.0-False-4-6-6-1024-1023-32] Mismatched elements: 1 / 786432 (0.0%) Greatest absolute difference: 0.14855387806892395 at index (0, 309, 2, 18) (up to 0.1009 allowed) Greatest relative difference: 0.28865116834640503 at index (0, 309, 2, 18) (up to 0.09128 allowed) FAILED test.py::test_op_prefill_fwd_impl[False-dtype1-False-bshd-0.0-False-4-6-6-1024-1023-32] Mismatched elements: 1 / 786432 (0.0%) Greatest absolute difference: 0.14855387806892395 at index (0, 309, 2, 18) (up to 0.1009 allowed) Greatest relative difference: 0.28865116834640503 at index (0, 309, 2, 18) (up to 0.09128 allowed)

Two tests are still failling.

* Do not track gradients for scale factors. * Handle maximum absolute value equals to zero in per batch / head scaling method.

alexkranias-amd and others added 29 commits December 9, 2024 10:09

fix mismatches

7434112

no navi for now

2ea54b1

fix: ref uses scaling + added ENV VAR to enable/disable quantization …

89d3d7d

…scaling

fix: fp8 ref matches kernel

c65af82

misc: added note about p_scale

9297d78

feat: added precision error test for various triton ops

f92ca5b

save

9ed1d00

feat: added benchmark for fp8 flash attention

1c3f756

fix: quantization scaling in fp8 benchmark

c4ca789

checkpoint

937e814

feat: added fp8 to precision test

fd342f7

fix: refactor fp32 for torch, moved scaling of fp8 to out of kernel

543736b

Fix test_op_fwd_prefill

210e2df

Error: UnboundLocalError: local variable 'q_scale_stride_z' referenced before assignment. Fix: Initialize 'q_scale_stride_z' and 'kv_scale_stride_z' before assignment.

Document two tests that are failing with FP8

d8dd966

Remove cast to fp16, output is already being cast to fp32

1835390

Increase error tolerance for fp8

a2624a9

Warning: I don't know if this is the correct thing to do.

Enable more test cases

8eab5e5

Fix bug for "bshd" layout

0cf49ce

Compute 1st FA GEMM without casting to fp16

4f3e633

Remove redundant v.to(v.type.element_ty) cast

6773e3a

Fix global scaling for "bhsd" and "bshd" layouts

b31cd5d

[WIP] First attempt to support "thd" layout

3044d7b

Refactor fp8 scale computation

5856c6b

Compute p scale factor and pass it to the kernel

a170a08

Fix minor coding mistakes

85c62ae

Use p scale factor in the kernel

13b07df

Two tests are still failling.

Improve scale factor generation

02a4d8f

* Do not track gradients for scale factors. * Handle maximum absolute value equals to zero in per batch / head scaling method.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fp8 bwd #108

fp8 bwd #108

micmelesse commented Dec 9, 2024

fp8 bwd #108

Are you sure you want to change the base?

fp8 bwd #108

Conversation

micmelesse commented Dec 9, 2024