Skip to content

GPTQ updates #2235

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

GPTQ updates #2235

wants to merge 1 commit into from

Conversation

HDCharles
Copy link
Contributor

@HDCharles HDCharles commented May 21, 2025

Summary:

  1. reorganized GPTQ
    a) got rid of old GPTQ and renamed GPTQ_MT to GPTQ
    b) moved new GPTQ to prototype
    c) moved quantized linear modules in GPTQ.py to linear_quant_modules.py
  2. removed dependence on lm_eval for input_recorder
    a) created new input recorder that doesn't depend on lm_eval
    b) made lm_eval input recorder depend on new generic input_recorder
    c) made TransformerEvalWrapper the base class and made LMEvalInputRecorder inherit from it (rather than vice versa like before)
    d) updated apis generally to work with new input recorder (inputs have to be passed in as though they were being passed into the model, so input_recorder(*args) rather than input_recorder(args)
  3. reorganized GPTQ tests
    a) moved tests from test_quant_api.py to test_gptq.py
    b) added new test that can be run in CI that doesn't depend on
    c) removed all the 8aw4 tests that we never got working (is this fine?)
    lm_eval/llama weights
    c) got rid of test_gptq_mt.py (consolidated into test_gptq where relevant.
  4. added new documentation for lm_eval
    a) new readme and eval benchmarks for GPTQ
    b) comments in GPTQ.py
  5. GPTQ improvements
    a) tested compilation of hessian calculation and parts of faster quant,
    generally they were slower or buggy. Possible to get a speedup but it was inconsistent so removed it.
    b) reimplemented faster quant while trying to compile it(improved speed by 2-5% and code is clearer while trying to compile pasrts)
    c) moved helper functions out of the class. They're largely generic and
    this is less cluttered. May need to revisit how generic they are if new GPTQQuantizers are made.
    d) some improvements to the duplication checking and copying to be
    faster when possible (previously MultiTensor.unpad had checks but at the point that its called, we already did equality checks so we don't need them in unpad.
    e) fixed some bugs due to this not being in CI and things changing for
    int4wo tensor subclass.
  6. BC
    a) got rid of Int8DynActInt4WeightGPTQQuantizer since it was unused. Can re-add if desired.
    b) for other imports, maintained BC, previous impoirts from quantization/GPTQ.py now go to quantization/GPTQ/init.py
    c) InputRecorder -> LMEvalInputRecorder but left BC import in as an option.

Test Plan:

  1. python test_gptq.py

note1: the skipped test test_gptq_quantizer_int4_weight_only also ran.
note2: we now have a CI ready test in test_gptq using the generic input recorder.

  1. I verified that all activation match between old GPTQ (non MT) and current
    GPTQ, this can be seen by passing the test_gptq_quantizer_int4_weight_only as mentioned above but was also verified by comparing debug outputs and printing activation values for first 3 multi tensors.

  2. eval benchmarks:

export CHECKPOINT_PATH=../../../checkpoints # path to checkpoints folder

export MODEL_REPO=meta-llama/Llama-2-7b-chat-hf
python eval.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth
python eval.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth
--quantization int4wo-64
python eval.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth
--quantization int4wo-gptq-64 --calibration_limit 10

export MODEL_REPO=meta-llama/Meta-Llama-3-8B
python eval.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth
python eval.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth
--quantization int4wo-64
python eval.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth
--quantization int4wo-gptq-64 --calibration_limit 10

see README.md for results but they show GPTQ is working

Reviewers:

Subscribers:

Tasks:

Tags:

Copy link

pytorch-bot bot commented May 21, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2235

Note: Links to docs will display an error until the docs builds have been completed.

❌ 4 New Failures

As of commit 4b44d67 with merge base 446f07d (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 21, 2025
@HDCharles HDCharles added the topic: improvement Use this tag if this PR is an improvement (doesn't fit into any of the other categories) label May 21, 2025
@HDCharles HDCharles force-pushed the 098_gptq branch 10 times, most recently from 913bf3c to 956e16c Compare May 23, 2025 05:35
Summary:

1) reorganized GPTQ
 a) got rid of old GPTQ and renamed GPTQ_MT to GPTQ
 b) moved new GPTQ to prototype
 c) moved quantized linear modules in GPTQ.py to linear_quant_modules.py
2) removed dependence on lm_eval for input_recorder
 a) created new input recorder that doesn't depend on lm_eval
 b) made lm_eval input recorder depend on new generic input_recorder
 c) made TransformerEvalWrapper the base class and made
 d) updated apis generally to work with new input recorder
 LMEvalInputRecorder inherit from it instead of vice-versa
3) reorganized GPTQ tests
 a) moved tests from test_quant_api.py to test_gptq.py
 b) added new test that can be run in CI that doesn't depend on
 lm_eval/llama weights
 c) got rid of test_gptq_mt.py
4) added new documentation for lm_eval
5) GPTQ improvements
 a) reimplemented faster quant
 b) tested compilation of hessian calculation and parts of faster quant,
 generally they were slower.
 c) moved helper functions out of the class. They're largely generic and
 this is less cluttered.
 d) some improvements to the duplication checking and copying to be
 faster when possible
 e) fixed some bugs due to this not being in CI and things changing for
 int4wo tensor subclass.

Test Plan:

1) `python test_gptq.py`

note: the skipped test test_gptq_quantizer_int4_weight_only also ran.

2) I verified that all activation match between old GPTQ and current
   GPTQ

3)

```shell

export CHECKPOINT_PATH=../../../checkpoints # path to checkpoints folder

export MODEL_REPO=meta-llama/Llama-2-7b-chat-hf
python eval.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth
python eval.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth
--quantization int4wo-64
python eval.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth
--quantization int4wo-gptq-64 --calibration_limit 10

export MODEL_REPO=meta-llama/Meta-Llama-3-8B
python eval.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth
python eval.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth
--quantization int4wo-64
python eval.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth
--quantization int4wo-gptq-64 --calibration_limit 10

```
see README.md for results but they show GPTQ is working

Reviewers:

Subscribers:

Tasks:

Tags:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. topic: improvement Use this tag if this PR is an improvement (doesn't fit into any of the other categories)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants