Fp8 e4m3_fnuz support for rocm #2588

mht-sharma · 2024-09-30T14:01:17Z

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

mht-sharma · 2024-09-30T14:26:56Z

server/text_generation_server/layers/fp8.py

+            self.input_scale,
+            self.activation_scale_ub,
+            bias,
+            self.dtype,
        )


 class Fp8Linear(torch.nn.Module):


Would it be cleaner to have a separate Fp8LinearRocm?

Maybe, it depends a bit on how much conditional code we end up with. We did separate FP8 Marlin for this reason.

danieldk · 2024-10-02T15:00:18Z

server/text_generation_server/layers/fp8.py

@@ -92,9 +123,17 @@ def get_weights(self, weights: "Weights", prefix: str):
                .reshape(-1)
                .expand(w.shape[0])
            )
+            try:
+                input_scale = weights.get_tensor(


Weights also has _has_tensor maybe we should make it public and use it here?

Same for try: [...]get_tensor below.

Updated to use the has_tensor

danieldk · 2024-10-02T15:00:49Z

server/text_generation_server/layers/fp8.py

@@ -72,6 +99,10 @@ def fp8_quantize(
    # as both required as inputs to torch._scaled_mm
    qweight = qweight.to(qdtype)
    scale = scale.float().reciprocal()
+
+    if SYSTEM == "rocm":
+        qweight, scale, _ = normalize_e4m3fn_to_e4m3fnuz(qweight, scale)


We should wire up scale at some point for CUDA as well.

danieldk · 2024-10-02T15:06:37Z

server/text_generation_server/layers/fp8.py

+                bias=self.bias,
+            )
+
+            if type(output) is tuple and len(output) == 2:


Did this change between torch versions or is output for AMD different?

Suggested change

if type(output) is tuple and len(output) == 2:

if isinstance(output, tuple) and len(output) == 2:

This is a common change for torch 2.5. https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/cuda/Blas.cpp#L1175

danieldk · 2024-10-02T15:07:54Z

server/text_generation_server/layers/fp8.py

+            self.input_scale,
+            self.activation_scale_ub,
+            bias,
+            self.dtype,
        )


 class Fp8Linear(torch.nn.Module):


Maybe, it depends a bit on how much conditional code we end up with. We did separate FP8 Marlin for this reason.

danieldk · 2024-10-02T15:09:54Z

server/text_generation_server/layers/marlin/fp8.py

@@ -62,7 +62,7 @@ def from_unquant(cls, weight, bias, dtype):
        return cls(qweight=qweight, scales=scales.to(dtype), bias=bias)

    @classmethod
-    def from_fp8(cls, weight, scale, _input_scale, bias, dtype):
+    def from_fp8(cls, weight, scale, _input_scale, _scale_upper_bound, bias, dtype):


Type.

These arguments get a bit messy. It's easy to mix up a tensor or a float (which was already happening here?). Maybe we should switch these to kwargs-only so that the call sites need to be explicit (+ type annotations).

Converted them to kwargs and added type hints.

mht-sharma · 2024-10-14T11:55:34Z

server/text_generation_server/models/__init__.py

@@ -342,22 +342,19 @@ def get_model(
    model_type = config_dict.get("model_type", None)

    quantization_config = config_dict.get("quantization_config", None)
-    compression_config = config_dict.get("compression_config", None)


@danieldk config renamed to quantisation config.

I think we should check for both keys, at least for the time being. Some customers/users may have checkpoints that still have compression_config. Maybe with a comment that compression_config is for backwards compatibility?

danieldk · 2024-10-15T11:04:47Z

server/text_generation_server/layers/fp8.py

@@ -125,9 +164,24 @@ def get_weights_col_packed(
                )
            scale = scale.reshape(-1).expand(w.shape[0])

+            input_scale = None
+            if weights.get_tensor(f"{prefix}.input_scale"):


Suggested change

if weights.get_tensor(f"{prefix}.input_scale"):

if weights.has_tensor(f"{prefix}.input_scale"):

?

danieldk · 2024-10-15T11:07:49Z

server/text_generation_server/layers/fp8.py

+            input_scale = [
+                _load_scalar_or_matrix_scale(weights, f"{p}.input_scale", shape)
+                for p, shape in zip(prefixes, shapes)
+                if weights.has_tensor(f"{p}.input_scale")


Given this conditional, we probably need an assertion like

assert len(input_scale) == 0 or len(input_scale) == length(prefixes)

danieldk · 2024-10-15T11:12:44Z

server/text_generation_server/models/__init__.py

@@ -342,22 +342,19 @@ def get_model(
    model_type = config_dict.get("model_type", None)

    quantization_config = config_dict.get("quantization_config", None)
-    compression_config = config_dict.get("compression_config", None)


I think we should check for both keys, at least for the time being. Some customers/users may have checkpoints that still have compression_config. Maybe with a comment that compression_config is for backwards compatibility?

(feat) fp8 fnuz support for rocm

f772856

mht-sharma commented Sep 30, 2024

View reviewed changes

danieldk reviewed Oct 2, 2024

View reviewed changes

(review comments) Fix compression_config load, type hints

7a7cd5f

mht-sharma commented Oct 14, 2024

View reviewed changes

mht-sharma requested a review from danieldk October 14, 2024 12:00

(bug) update all has_tensor

b2b5024

danieldk reviewed Oct 15, 2024

View reviewed changes

mht-sharma added 2 commits October 15, 2024 12:01

(review_comments) fix typo and added comments

1de9627

(nit) improved comment

689aa26

danieldk self-requested a review October 15, 2024 13:32

danieldk approved these changes Oct 15, 2024

View reviewed changes

danieldk merged commit 704a58c into main Oct 16, 2024
11 of 12 checks passed

danieldk deleted the rocm-fp8 branch October 16, 2024 07:54

danieldk mentioned this pull request Oct 17, 2024

TGI does not support FP8 quantized models on ROCm #2654

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fp8 e4m3_fnuz support for rocm #2588

Fp8 e4m3_fnuz support for rocm #2588

mht-sharma commented Sep 30, 2024

mht-sharma Sep 30, 2024

danieldk Oct 2, 2024

danieldk Oct 2, 2024

danieldk Oct 2, 2024

mht-sharma Oct 14, 2024

danieldk Oct 2, 2024

danieldk Oct 2, 2024

mht-sharma Oct 14, 2024

mht-sharma Oct 14, 2024

danieldk Oct 2, 2024

danieldk Oct 2, 2024

mht-sharma Oct 14, 2024

mht-sharma Oct 14, 2024

danieldk Oct 15, 2024

danieldk Oct 15, 2024

danieldk Oct 15, 2024

danieldk Oct 15, 2024

	if type(output) is tuple and len(output) == 2:
	if isinstance(output, tuple) and len(output) == 2:

	if weights.get_tensor(f"{prefix}.input_scale"):
	if weights.has_tensor(f"{prefix}.input_scale"):

Fp8 e4m3_fnuz support for rocm #2588

Fp8 e4m3_fnuz support for rocm #2588

Conversation

mht-sharma commented Sep 30, 2024

What does this PR do?

Before submitting

Who can review?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment