Add fp8-fused gemm kernel #5764

sfc-gh-reyazda · 2024-07-11T02:06:18Z

This PR adds the new fused kernel for the Dense GeMM using fp8-quantized weight.

jeffra · 2024-07-12T15:49:01Z

One thing that needs to be resolved before merging this, this kernel requires triton==2.3.0. This should be checked at runtime and communicated to users somehow.

HeyangQin · 2024-07-12T17:14:02Z

Hi @jeffra. To clarify, does this kernel require exactly triton==2.3 or triton>=2.3?

jeffra · 2024-07-12T19:50:53Z

Hi @jeffra. To clarify, does this kernel require exactly triton==2.3 or triton>=2.3?

@sfc-gh-reyazda would know better, I am not sure if we've tested with newer triton than 2.3.0. I have not personally tested this at least.

…cales

sfc-gh-reyazda · 2024-07-15T17:08:26Z

Hi @jeffra. To clarify, does this kernel require exactly triton==2.3 or triton>=2.3?

It needs that specific version, unfortunately triton keeps changing/improving and their APIs change too so it is hard to track it properly. That's also another motivation to move to cutlass soon and have a more solid implementation to work independent of other libraries. On the other hand, Triton gives the flexibility to run on various hardwares. So, it is always a tradeoff. I think we need to have some more discussions on such dependencies later in a different discussion.
Best,
Reza

@RezaYazdaniAminabadi

This is a refresh of of `OptimizedLinear` with the following features to improve performance and usability: * More efficient sharing of base weights using `all_gather_into_tensor` * Flattened sharded weights * Selectively offload frozen weights to cpu * `deepspeed.linear.Init` that allows injecting OptimizedLinear during model construction (similar to zero.Init) * Support for load state dict directly in OptimizedLinear, this allows loading HF model weights correctly into sharded params * Various bug fixes for the LoRA implementation introduced previously * Several new unit tests Builds on-top of @RezaYazdaniAminabadi's previous FP8 updates (#5764) to support dense model fp8 quantization. Example usage of this to fine-tune llama-3.1-405B on a single node: https://github.com/Snowflake-Labs/snowflake-arctic/tree/main/training/llama3.1 --------- Co-authored-by: Reza Yazdani <reza.yazdani@snowflake.com> Co-authored-by: Reza Yazdani <152926435+sfc-gh-reyazda@users.noreply.github.com>

sfc-gh-reyazda added 2 commits July 11, 2024 02:04

Add fp8-fused gemm kernel

4c3b8fd

add get_scale function

c0e97f1

sfc-gh-reyazda requested review from awan-10, arashb, tjruwase and loadams as code owners July 11, 2024 02:06

HeyangQin approved these changes Jul 11, 2024

View reviewed changes

sfc-gh-reyazda and others added 2 commits July 11, 2024 20:49

fix a few things to run the test

cb0e0a6

Merge branch 'master' into add-fp8-gemm

a11f9c5

fixes for optim linear

4169b13

fix illegal memory corner cases with an extra condition for reading s…

2f82f2b

…cales

reduce memory pressure

e489c56

jeffra mentioned this pull request Jul 23, 2024

OptimizedLinear updates #5791

Merged

jeffra and others added 9 commits July 24, 2024 09:24

Merge branch 'master' into add-fp8-gemm

3225796

add version check to fp quant op builder

c7e06dd

small fix for fp16 quantization

82d2a47

fix formatting issues

dc1ab2e

only import matmul_fp8 if triton is available

0dbddc2

delay import in test until after skip check

ab4d472

allow 2.3.0 and 2.3.1

67d7aa9

Merge branch 'master' into add-fp8-gemm

8a62ca4

fix issue with workflows that don't have pkg_version

8620900

tjruwase added this pull request to the merge queue Jul 26, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jul 26, 2024

tjruwase added this pull request to the merge queue Jul 26, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jul 26, 2024

loadams added this pull request to the merge queue Jul 29, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jul 29, 2024

loadams enabled auto-merge July 29, 2024 15:55

loadams added this pull request to the merge queue Jul 29, 2024

loadams disabled auto-merge July 29, 2024 15:57

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jul 29, 2024

loadams added this pull request to the merge queue Jul 29, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jul 29, 2024

loadams merged commit 4f95067 into microsoft:master Jul 29, 2024
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fp8-fused gemm kernel #5764

Add fp8-fused gemm kernel #5764

sfc-gh-reyazda commented Jul 11, 2024

jeffra commented Jul 12, 2024

HeyangQin commented Jul 12, 2024

jeffra commented Jul 12, 2024

sfc-gh-reyazda commented Jul 15, 2024

Add fp8-fused gemm kernel #5764

Add fp8-fused gemm kernel #5764

Conversation

sfc-gh-reyazda commented Jul 11, 2024

jeffra commented Jul 12, 2024

HeyangQin commented Jul 12, 2024

jeffra commented Jul 12, 2024

sfc-gh-reyazda commented Jul 15, 2024