Enable Int4WeightOnlyGPTQQuantizer on Intel GPU. #2200

xiaowangintel · 2025-05-12T12:54:32Z

Following pytorch/pytorch#153019 requests, we enable int4wo-GPTQ for Intel GPU in pytorch/ao after RTN ready. Currently, the implementation of int4wo-GPTQ uses the ZeroPointDomain.FLOAT and ZeroPointDomain.INT.

How to run int4wo-GPTQ with ZeroPointDomain.INT:

import copy
import gc
import tempfile
import unittest
from pathlib import Path

import torch

from torchao._models._eval import InputRecorder, TransformerEvalWrapper
from torchao.quantization.GPTQ import Int4WeightOnlyGPTQQuantizer

from torchao._models.llama.model import Transformer, prepare_inputs_for_model
from torchao._models.llama.tokenizer import get_tokenizer

from torchao.quantization.quant_primitives import (
    ZeroPointDomain,
)

precision = torch.bfloat16
device = "xpu"
checkpoint_path = Path("../gpt-fast/checkpoints/meta-llama/Llama-2-7b-chat-hf/model.pth")
model = Transformer.from_name(checkpoint_path.parent.name)
checkpoint = torch.load(str(checkpoint_path), mmap=True, weights_only=True)
model.load_state_dict(checkpoint, assign=True)
model = model.to(dtype=precision, device="cpu")
model.eval()
tokenizer_path = checkpoint_path.parent / "tokenizer.model"
assert tokenizer_path.is_file(), tokenizer_path
tokenizer = get_tokenizer( 
    tokenizer_path,
    "Llama-2-7b-chat-hf",
)
blocksize = 128
percdamp = 0.01
groupsize = 128
calibration_tasks = ["wikitext"]
calibration_limit = 1
calibration_seq_length = 100
input_prep_func = prepare_inputs_for_model
pad_calibration_inputs = False

inputs = (
    InputRecorder(
        tokenizer,
        calibration_seq_length,
        input_prep_func,
        pad_calibration_inputs,
        model.config.vocab_size,
        device="cpu",
    )
    .record_inputs(
        calibration_tasks,
        calibration_limit,
    )
    .get_inputs()
)

quantizer = Int4WeightOnlyGPTQQuantizer(
    blocksize,
    percdamp,
    groupsize,
    device=torch.device(device),
    zero_point_domain = ZeroPointDomain.INT,
)
model.setup_caches(max_batch_size=1, max_seq_length=calibration_seq_length)
model = quantizer.quantize(model, inputs).to(device)

How to run int4wo-GPTQ with ZeroPointDomain.FLOAT:

quantizer = Int4WeightOnlyGPTQQuantizer(
    blocksize,
    percdamp,
    groupsize,
    device=torch.device(device),
)

pytorch-bot · 2025-05-12T12:54:35Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2200

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

liangan1 · 2025-05-13T08:50:19Z

torchao/quantization/GPTQ.py

@@ -561,6 +582,30 @@ def linear_forward_int4(
    return c


+def linear_forward_int4_zero_domain(


Suggested change

def linear_forward_int4_zero_domain(

def linear_forward_int4_zero_point_domain_int(

Modify done

liangan1 · 2025-05-13T08:51:00Z

torchao/quantization/utils.py

@@ -364,6 +369,13 @@ def get_groupwise_affine_qparams(
    ).reshape(w.shape[0], -1)


+def align_tinygemm_scales_and_zeros(scales, zeros):


Only use once. Suggest to remove this function.

Modify done

liangan1 · 2025-05-13T08:52:09Z

torchao/quantization/utils.py

@@ -436,6 +448,9 @@ def groupwise_affine_quantize_tensor_from_qparams(
            not (check_xpu_version(int_data.device))
        ):
            int_data = (int_data[::, ::2] << 4 | int_data[::, 1::2]).to(torch.uint8)
+        if check_xpu_version(int_data.device):
+            int_data = (int_data[::, 1::2] << 4 | int_data[::, ::2]).to(torch.uint8)
+


This function is also used by RTN path, this changes may break RTN, Pls double check.

align all usages

liangan1 · 2025-05-20T05:57:06Z

torchao/quantization/README.md

@@ -438,6 +438,7 @@ quantizer = Int4WeightOnlyGPTQQuantizer(
    percdamp,
    groupsize,
 )
+


Suggested change

liangan1 · 2025-05-20T08:48:48Z

@jerryzh168 Can you help to review this PR?

liangan1 · 2025-05-20T08:49:40Z

@EikanWang

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 12, 2025

Enable Int4WeightOnlyGPTQQuantizer on Intel GPU.

eb73883

xiaowangintel force-pushed the xw/int4-gptq branch from 8ce1b8b to eb73883 Compare May 13, 2025 03:04

liangan1 reviewed May 13, 2025

View reviewed changes

xiaowangintel added 3 commits May 14, 2025 19:06

Enable Int4WeightOnlyGPTQQuantizer on Intel GPU.

a182410

Enable Int4WeightOnlyGPTQQuantizer on Intel GPU.

12c42d4

Enable Int4WeightOnlyGPTQQuantizer on Intel GPU.

168c6ed

liangan1 reviewed May 20, 2025

View reviewed changes

xiaowangintel added 2 commits May 20, 2025 16:50

Update README.md

4ba8a02

Update GPTQ.py

5bc2ead

xiaowangintel changed the title ~~[WIP]Enable Int4WeightOnlyGPTQQuantizer on Intel GPU.~~ Enable Int4WeightOnlyGPTQQuantizer on Intel GPU. May 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable Int4WeightOnlyGPTQQuantizer on Intel GPU. #2200

Enable Int4WeightOnlyGPTQQuantizer on Intel GPU. #2200

Uh oh!

xiaowangintel commented May 12, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented May 12, 2025 •

edited

Loading

Uh oh!

liangan1 May 13, 2025

Uh oh!

xiaowangintel May 15, 2025

Uh oh!

liangan1 May 13, 2025

Uh oh!

xiaowangintel May 15, 2025

Uh oh!

liangan1 May 13, 2025

Uh oh!

xiaowangintel May 15, 2025

Uh oh!

liangan1 May 20, 2025

Uh oh!

liangan1 commented May 20, 2025

Uh oh!

liangan1 commented May 20, 2025

Uh oh!

Uh oh!

		@@ -561,6 +582,30 @@ def linear_forward_int4(
		return c


		def linear_forward_int4_zero_domain(

	def linear_forward_int4_zero_domain(
	def linear_forward_int4_zero_point_domain_int(

		@@ -364,6 +369,13 @@ def get_groupwise_affine_qparams(
		).reshape(w.shape[0], -1)


		def align_tinygemm_scales_and_zeros(scales, zeros):

Enable Int4WeightOnlyGPTQQuantizer on Intel GPU. #2200

Are you sure you want to change the base?

Enable Int4WeightOnlyGPTQQuantizer on Intel GPU. #2200

Uh oh!

Conversation

xiaowangintel commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2200

Uh oh!

liangan1 May 13, 2025

Choose a reason for hiding this comment

Uh oh!

xiaowangintel May 15, 2025

Choose a reason for hiding this comment

Uh oh!

liangan1 May 13, 2025

Choose a reason for hiding this comment

Uh oh!

xiaowangintel May 15, 2025

Choose a reason for hiding this comment

Uh oh!

liangan1 May 13, 2025

Choose a reason for hiding this comment

Uh oh!

xiaowangintel May 15, 2025

Choose a reason for hiding this comment

Uh oh!

liangan1 May 20, 2025

Choose a reason for hiding this comment

Uh oh!

liangan1 commented May 20, 2025

Uh oh!

liangan1 commented May 20, 2025

Uh oh!

Uh oh!

xiaowangintel commented May 12, 2025 •

edited

Loading

pytorch-bot bot commented May 12, 2025 •

edited

Loading