Add support for repacking AWQ weights for GPTQ-Marlin #2278

danieldk · 2024-07-22T17:46:57Z

What does this PR do?

So far we couldn't support AWQ because virtually all AWQ models use symmetric quantization, which GPTQ-Marlin did not suppors. GPTQ-Marlin has recently added support AWQ repacking and AWQ asymmetric quantization (zero_point=True).

This change updates all GPTQ-Marlin kernels from upstream and wires up AWQ support. For now enabling AWQ using Marlin requires running TGI with --quantize gptq.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

So far we couldn't support AWQ because virtually all AWQ models use symmetric quantization, which GPTQ-Marlin did not suppors. GPTQ-Marlin has recently added support AWQ repacking and AWQ asymmetric quantization (zero_point=True). This change updates all GPTQ-Marlin kernels from upstream and wires up AWQ support. For now enabling AWQ using Marlin requires running TGI with `--quantize gptq`.

This makes the AWQ -> GPTQ repack test redundant, since we are now testing this with the regular AWQ test.

Narsil

LGTM.

Shouldn't we make that the default for AWQ models ?
Also shouldn't we add an integration test for this variant execution on AWQ tests ? (Making sure results are the same)

danieldk · 2024-07-23T10:04:13Z

Shouldn't we make that the default for AWQ models ?
Also shouldn't we add an integration test for this variant execution on AWQ tests ? (Making sure results are the same)

Starting with the second commit in this PR it's the default, the existing AWQ test uses the GPTQ-Marlin kernel as a result, and the snapshots match (the first commit had a separate test, but I removed it, since it's now covered by the existing test).

paulcx · 2024-07-24T02:22:38Z

Regardless, the results using GPTQ-Marlin are not reproducible even with seed control. How do we use the original kernel though? @danieldk @Narsil

* Add support for repacking AWQ weights for GPTQ-Marlin So far we couldn't support AWQ because virtually all AWQ models use symmetric quantization, which GPTQ-Marlin did not suppors. GPTQ-Marlin has recently added support AWQ repacking and AWQ asymmetric quantization (zero_point=True). This change updates all GPTQ-Marlin kernels from upstream and wires up AWQ support. For now enabling AWQ using Marlin requires running TGI with `--quantize gptq`. * Enable Marlin for supported AWQ configurations by default This makes the AWQ -> GPTQ repack test redundant, since we are now testing this with the regular AWQ test.

Narsil · 2024-07-29T08:28:03Z

the results using GPTQ-Marlin are not reproducible even with seed control

What results ? Can you provide a bit more context about the issue, maybe create an issue for it ?

paulcx · 2024-07-30T00:20:36Z

the results using GPTQ-Marlin are not reproducible even with seed control

What results ? Can you provide a bit more context about the issue, maybe create an issue for it ?

I made two API (/generate) requests with temperature=null and seeds=42, yet received distinct outputs. Can I disable the Marlin kernel via launch command? btw, when I switch older image and do same thing there is no such issue.

update: just confirm this issue has been gone in latest image:sha-0b95693

update on 15/8: I found a slightly different response (controlled by seed and 0 temp) based on sha-0b95693. Is there any workaround I can try to use orgnal kernel instead of marlin kernel? @Narsil

* Add support for repacking AWQ weights for GPTQ-Marlin So far we couldn't support AWQ because virtually all AWQ models use symmetric quantization, which GPTQ-Marlin did not suppors. GPTQ-Marlin has recently added support AWQ repacking and AWQ asymmetric quantization (zero_point=True). This change updates all GPTQ-Marlin kernels from upstream and wires up AWQ support. For now enabling AWQ using Marlin requires running TGI with `--quantize gptq`. * Enable Marlin for supported AWQ configurations by default This makes the AWQ -> GPTQ repack test redundant, since we are now testing this with the regular AWQ test.

danieldk marked this pull request as ready for review July 23, 2024 07:04

Enable Marlin for supported AWQ configurations by default

712729b

This makes the AWQ -> GPTQ repack test redundant, since we are now testing this with the regular AWQ test.

Narsil approved these changes Jul 23, 2024

View reviewed changes

danieldk merged commit 9935720 into main Jul 23, 2024
9 checks passed

danieldk deleted the feature/awq-marlin-repack branch July 23, 2024 11:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for repacking AWQ weights for GPTQ-Marlin #2278

Add support for repacking AWQ weights for GPTQ-Marlin #2278

danieldk commented Jul 22, 2024 •

edited

Loading

Narsil left a comment

danieldk commented Jul 23, 2024 •

edited

Loading

paulcx commented Jul 24, 2024 •

edited

Loading

Narsil commented Jul 29, 2024

paulcx commented Jul 30, 2024 •

edited

Loading

Add support for repacking AWQ weights for GPTQ-Marlin #2278

Add support for repacking AWQ weights for GPTQ-Marlin #2278

Conversation

danieldk commented Jul 22, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

Narsil left a comment

Choose a reason for hiding this comment

danieldk commented Jul 23, 2024 • edited Loading

paulcx commented Jul 24, 2024 • edited Loading

Narsil commented Jul 29, 2024

paulcx commented Jul 30, 2024 • edited Loading

danieldk commented Jul 22, 2024 •

edited

Loading

danieldk commented Jul 23, 2024 •

edited

Loading

paulcx commented Jul 24, 2024 •

edited

Loading

paulcx commented Jul 30, 2024 •

edited

Loading