Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for repacking AWQ weights for GPTQ-Marlin #2278

Merged
merged 2 commits into from
Jul 23, 2024

Conversation

danieldk
Copy link
Member

@danieldk danieldk commented Jul 22, 2024

What does this PR do?

So far we couldn't support AWQ because virtually all AWQ models use symmetric quantization, which GPTQ-Marlin did not suppors. GPTQ-Marlin has recently added support AWQ repacking and AWQ asymmetric quantization (zero_point=True).

This change updates all GPTQ-Marlin kernels from upstream and wires up AWQ support. For now enabling AWQ using Marlin requires running TGI with --quantize gptq.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

So far we couldn't support AWQ because virtually all AWQ models use
symmetric quantization, which GPTQ-Marlin did not suppors. GPTQ-Marlin
has recently added support AWQ repacking and AWQ asymmetric quantization
(zero_point=True).

This change updates all GPTQ-Marlin kernels from upstream and wires up
AWQ support. For now enabling AWQ using Marlin requires running TGI with
`--quantize gptq`.
@danieldk danieldk marked this pull request as ready for review July 23, 2024 07:04
This makes the AWQ -> GPTQ repack test redundant, since we are now
testing this with the regular AWQ test.
Copy link
Collaborator

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Shouldn't we make that the default for AWQ models ?
Also shouldn't we add an integration test for this variant execution on AWQ tests ? (Making sure results are the same)

@danieldk
Copy link
Member Author

danieldk commented Jul 23, 2024

Shouldn't we make that the default for AWQ models ?
Also shouldn't we add an integration test for this variant execution on AWQ tests ? (Making sure results are the same)

Starting with the second commit in this PR it's the default, the existing AWQ test uses the GPTQ-Marlin kernel as a result, and the snapshots match (the first commit had a separate test, but I removed it, since it's now covered by the existing test).

@danieldk danieldk merged commit 9935720 into main Jul 23, 2024
9 checks passed
@danieldk danieldk deleted the feature/awq-marlin-repack branch July 23, 2024 11:08
@paulcx
Copy link

paulcx commented Jul 24, 2024

Regardless, the results using GPTQ-Marlin are not reproducible even with seed control. How do we use the original kernel though? @danieldk @Narsil

ErikKaum pushed a commit that referenced this pull request Jul 25, 2024
* Add support for repacking AWQ weights for GPTQ-Marlin

So far we couldn't support AWQ because virtually all AWQ models use
symmetric quantization, which GPTQ-Marlin did not suppors. GPTQ-Marlin
has recently added support AWQ repacking and AWQ asymmetric quantization
(zero_point=True).

This change updates all GPTQ-Marlin kernels from upstream and wires up
AWQ support. For now enabling AWQ using Marlin requires running TGI with
`--quantize gptq`.

* Enable Marlin for supported AWQ configurations by default

This makes the AWQ -> GPTQ repack test redundant, since we are now
testing this with the regular AWQ test.
ErikKaum pushed a commit that referenced this pull request Jul 26, 2024
* Add support for repacking AWQ weights for GPTQ-Marlin

So far we couldn't support AWQ because virtually all AWQ models use
symmetric quantization, which GPTQ-Marlin did not suppors. GPTQ-Marlin
has recently added support AWQ repacking and AWQ asymmetric quantization
(zero_point=True).

This change updates all GPTQ-Marlin kernels from upstream and wires up
AWQ support. For now enabling AWQ using Marlin requires running TGI with
`--quantize gptq`.

* Enable Marlin for supported AWQ configurations by default

This makes the AWQ -> GPTQ repack test redundant, since we are now
testing this with the regular AWQ test.
@Narsil
Copy link
Collaborator

Narsil commented Jul 29, 2024

the results using GPTQ-Marlin are not reproducible even with seed control

What results ? Can you provide a bit more context about the issue, maybe create an issue for it ?

@paulcx
Copy link

paulcx commented Jul 30, 2024

the results using GPTQ-Marlin are not reproducible even with seed control

What results ? Can you provide a bit more context about the issue, maybe create an issue for it ?

I made two API (/generate) requests with temperature=null and seeds=42, yet received distinct outputs. Can I disable the Marlin kernel via launch command? btw, when I switch older image and do same thing there is no such issue.

update: just confirm this issue has been gone in latest image:sha-0b95693

update on 15/8: I found a slightly different response (controlled by seed and 0 temp) based on sha-0b95693. Is there any workaround I can try to use orgnal kernel instead of marlin kernel? @Narsil

yuanwu2017 pushed a commit to yuanwu2017/tgi-gaudi that referenced this pull request Sep 26, 2024
* Add support for repacking AWQ weights for GPTQ-Marlin

So far we couldn't support AWQ because virtually all AWQ models use
symmetric quantization, which GPTQ-Marlin did not suppors. GPTQ-Marlin
has recently added support AWQ repacking and AWQ asymmetric quantization
(zero_point=True).

This change updates all GPTQ-Marlin kernels from upstream and wires up
AWQ support. For now enabling AWQ using Marlin requires running TGI with
`--quantize gptq`.

* Enable Marlin for supported AWQ configurations by default

This makes the AWQ -> GPTQ repack test redundant, since we are now
testing this with the regular AWQ test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants