Skip to content

Enable Int4WeightOnlyGPTQQuantizer on Intel GPU. #2200

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

xiaowangintel
Copy link

@xiaowangintel xiaowangintel commented May 12, 2025

Following pytorch/pytorch#153019 requests, we enable int4wo-GPTQ for Intel GPU in pytorch/ao after RTN ready. Currently, the implementation of int4wo-GPTQ uses the ZeroPointDomain.FLOAT and ZeroPointDomain.INT.

How to run int4wo-GPTQ with ZeroPointDomain.INT:

import copy
import gc
import tempfile
import unittest
from pathlib import Path

import torch

from torchao._models._eval import InputRecorder, TransformerEvalWrapper
from torchao.quantization.GPTQ import Int4WeightOnlyGPTQQuantizer

from torchao._models.llama.model import Transformer, prepare_inputs_for_model
from torchao._models.llama.tokenizer import get_tokenizer

from torchao.quantization.quant_primitives import (
    ZeroPointDomain,
)

precision = torch.bfloat16
device = "xpu"
checkpoint_path = Path("../gpt-fast/checkpoints/meta-llama/Llama-2-7b-chat-hf/model.pth")
model = Transformer.from_name(checkpoint_path.parent.name)
checkpoint = torch.load(str(checkpoint_path), mmap=True, weights_only=True)
model.load_state_dict(checkpoint, assign=True)
model = model.to(dtype=precision, device="cpu")
model.eval()
tokenizer_path = checkpoint_path.parent / "tokenizer.model"
assert tokenizer_path.is_file(), tokenizer_path
tokenizer = get_tokenizer( 
    tokenizer_path,
    "Llama-2-7b-chat-hf",
)
blocksize = 128
percdamp = 0.01
groupsize = 128
calibration_tasks = ["wikitext"]
calibration_limit = 1
calibration_seq_length = 100
input_prep_func = prepare_inputs_for_model
pad_calibration_inputs = False

inputs = (
    InputRecorder(
        tokenizer,
        calibration_seq_length,
        input_prep_func,
        pad_calibration_inputs,
        model.config.vocab_size,
        device="cpu",
    )
    .record_inputs(
        calibration_tasks,
        calibration_limit,
    )
    .get_inputs()
)

quantizer = Int4WeightOnlyGPTQQuantizer(
    blocksize,
    percdamp,
    groupsize,
    device=torch.device(device),
    zero_point_domain = ZeroPointDomain.INT,
)
model.setup_caches(max_batch_size=1, max_seq_length=calibration_seq_length)
model = quantizer.quantize(model, inputs).to(device)

How to run int4wo-GPTQ with ZeroPointDomain.FLOAT:

quantizer = Int4WeightOnlyGPTQQuantizer(
    blocksize,
    percdamp,
    groupsize,
    device=torch.device(device),
)

Copy link

pytorch-bot bot commented May 12, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2200

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 12, 2025
@@ -561,6 +582,30 @@ def linear_forward_int4(
return c


def linear_forward_int4_zero_domain(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def linear_forward_int4_zero_domain(
def linear_forward_int4_zero_point_domain_int(

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modify done

@@ -364,6 +369,13 @@ def get_groupwise_affine_qparams(
).reshape(w.shape[0], -1)


def align_tinygemm_scales_and_zeros(scales, zeros):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only use once. Suggest to remove this function.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modify done

@@ -436,6 +448,9 @@ def groupwise_affine_quantize_tensor_from_qparams(
not (check_xpu_version(int_data.device))
):
int_data = (int_data[::, ::2] << 4 | int_data[::, 1::2]).to(torch.uint8)
if check_xpu_version(int_data.device):
int_data = (int_data[::, 1::2] << 4 | int_data[::, ::2]).to(torch.uint8)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is also used by RTN path, this changes may break RTN, Pls double check.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

align all usages

@@ -438,6 +438,7 @@ quantizer = Int4WeightOnlyGPTQQuantizer(
percdamp,
groupsize,
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

@liangan1
Copy link

@jerryzh168 Can you help to review this PR?

@liangan1
Copy link

@EikanWang

@xiaowangintel xiaowangintel changed the title [WIP]Enable Int4WeightOnlyGPTQQuantizer on Intel GPU. Enable Int4WeightOnlyGPTQQuantizer on Intel GPU. May 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants