Add support for exl2-quantized models #1965

danieldk · 2024-05-28T10:02:00Z

What does this PR do?

Add support for exl2 quantization

Mostly straightforward, changes to existing code:

Wrap quantizer parameters in a small wrapper to avoid passing around untyped tuples and needing to repack them as a dict.
Move scratch space computation to warmup, because we need the maximum input sequence length to avoid allocating huge scratch buffers that OOM.

Draft: needs a rebase, exllama kernels seem non-deterministic, so logprobs sometimes change slightly?

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2024-05-28T10:11:13Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Narsil

Things seem to be working.

In general though, I feel like the PR is currently adding way too many indirections than necessary, every logic should be in their appropriate files and never add more.

layers/tensor_parallel is about the actual parallel logic, it can have some slight variations based on quantization, but only to change higher order loading logic.
weights.py is all about creating the actual individual tensors required on the model. This one knows about quantization and how to shard tensors.
layers/gptq anything else that is GPTQ specific, not about loading/sharding tensors and more about running init phase (scratch buffers) and actual forwards.

docs/source/basic_tutorials/launcher.md

Narsil · 2024-05-29T08:24:17Z

integration-tests/models/test_flash_gemma_exl2.py

+        revision="3.0bpw",
+        # Set max input length to avoid OOM due to extremely large
+        # scratch buffer.
+        max_input_length=1024,


Do we still need that ? I thought you fixed it.

I fixed that it does not unconditionally use a 4096 input length, but the value given by warmup. The default is probably still too much. E.g., in Llama2 8B for the output layer

4096 length * 16 batch size 128,256 pieces * 2 sizeof(float16) = 15.7GiB

The scratch buffer is allocated for the worst-case.

integration-tests/models/test_flash_gemma_exl2.py

integration-tests/models/test_flash_llama_exl2.py

launcher/src/main.rs

Narsil · 2024-05-29T09:45:10Z

server/text_generation_server/utils/weights.py

+    q_invperm: Optional[torch.Tensor] = None
+
+    @property
+    def device(self) -> torch.device:


Let's remove that. weigths.qweight.device is just as easy to read and clearer imho.

server/text_generation_server/utils/weights.py

Narsil · 2024-05-29T09:47:58Z

server/text_generation_server/utils/weights.py

+
+        w = Exl2Weight(**tensors)
+        w.q_scale_max /= 256
+        w.q_perm = w.q_perm.short()


Why ?

Can't we make the dataclass immutable (or at least treat it as is).
imho once loaded, nothing should be modified in tensors and always just passed as-is. Nothing should be required at runtime.

Moved this to a proper dataclass constructor + __post_init__.

Narsil · 2024-05-29T09:58:33Z

server/text_generation_server/layers/gptq/exllamav2.py

+    # Find the size of the scratch space.
+    for layer in LAYERS:
+        FIXED_BYTES = max(
+            FIXED_BYTES, layer.scratch_space_fixed(max_input_len=max_total_tokens)


I'm wondering why we need 2 loops.

Do you have the link in the original repo for that logic ?

https://github.com/turboderp/exllamav2/blob/8a57be1edfd0c3e6387a14876a359ef95f477739/exllamav2/model.py#L477

We need to know to know the max memory use of all layers to allocate the scratch buffer, before we can pass (slices of) the scratch buffer to the layers in their post-init.

Also replaced the loop with a more readable for-comprehension now.

server/text_generation_server/utils/weights.py

docs/source/basic_tutorials/launcher.md

Mostly straightforward, changes to existing code: * Wrap quantizer parameters in a small wrapper to avoid passing around untyped tuples and needing to repack them as a dict. * Move scratch space computation to warmup, because we need the maximum input sequence length to avoid allocating huge scratch buffers that OOM.

This test fails somewhat regularly due to non-determinism and this test is primarily to verify that we are loading a model which doesn't have `float16` as the default dtype correctly.

Narsil

LGTM

danieldk force-pushed the feature/exl2 branch from 17511ed to 051690c Compare May 28, 2024 10:09

danieldk changed the title ~~Feature/exl2~~ Add support for exl2-quantized models May 28, 2024

danieldk force-pushed the feature/exl2 branch 5 times, most recently from f3e8eac to 8e03024 Compare May 28, 2024 15:26

danieldk marked this pull request as ready for review May 28, 2024 15:26

danieldk force-pushed the feature/exl2 branch from 8e03024 to e3856cd Compare May 29, 2024 08:36

Narsil reviewed May 29, 2024

View reviewed changes

danieldk force-pushed the feature/exl2 branch from e3856cd to 69daaa5 Compare May 29, 2024 14:41

Narsil reviewed May 29, 2024

View reviewed changes

docs/source/basic_tutorials/launcher.md Outdated Show resolved Hide resolved

danieldk force-pushed the feature/exl2 branch 2 times, most recently from d14c046 to 4057345 Compare May 29, 2024 18:53

danieldk force-pushed the feature/exl2 branch from 4057345 to 3fa24fb Compare May 29, 2024 18:58

Gemma GPTQ checks: skip logprob checks

0369983

This test fails somewhat regularly due to non-determinism and this test is primarily to verify that we are loading a model which doesn't have `float16` as the default dtype correctly.

danieldk requested a review from Narsil May 30, 2024 09:15

Narsil approved these changes May 30, 2024

View reviewed changes

danieldk merged commit 967ced2 into main May 30, 2024
9 checks passed

danieldk deleted the feature/exl2 branch May 30, 2024 09:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for exl2-quantized models #1965

Add support for exl2-quantized models #1965

danieldk commented May 28, 2024

HuggingFaceDocBuilderDev commented May 28, 2024

Narsil left a comment

Narsil May 29, 2024

danieldk May 29, 2024

Narsil May 29, 2024

Narsil May 29, 2024

danieldk May 29, 2024 •

edited

Loading

Narsil May 29, 2024

danieldk May 29, 2024 •

edited

Loading

danieldk May 29, 2024 •

edited

Loading

Narsil left a comment

Add support for exl2-quantized models #1965

Add support for exl2-quantized models #1965

Conversation

danieldk commented May 28, 2024

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented May 28, 2024

Narsil left a comment

Choose a reason for hiding this comment

Narsil May 29, 2024

Choose a reason for hiding this comment

danieldk May 29, 2024

Choose a reason for hiding this comment

Narsil May 29, 2024

Choose a reason for hiding this comment

Narsil May 29, 2024

Choose a reason for hiding this comment

danieldk May 29, 2024 • edited Loading

Choose a reason for hiding this comment

Narsil May 29, 2024

Choose a reason for hiding this comment

danieldk May 29, 2024 • edited Loading

Choose a reason for hiding this comment

danieldk May 29, 2024 • edited Loading

Choose a reason for hiding this comment

Narsil left a comment

Choose a reason for hiding this comment

danieldk May 29, 2024 •

edited

Loading

danieldk May 29, 2024 •

edited

Loading

danieldk May 29, 2024 •

edited

Loading