Skip to content

Conversation

metascroy
Copy link
Contributor

The previous version of IntxUnpackedToInt8Tensor had a reshape operator after export. Although this does not change any numerics, it does interfere with XNNPACK's ability to recognize and lower the dynamic quantization pattern.

This PR removes the reshape operator, and adjusts the unit tests to look for 0 reshapes.

Copy link

pytorch-bot bot commented Sep 4, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2941

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit aa67cb3 with merge base 9d01b43 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 4, 2025
@metascroy metascroy requested a review from jerryzh168 September 4, 2025 21:20
assert isinstance(weight_tensor, IntxUnpackedToInt8Tensor)

# Apply dynamic activation quant
if weight_tensor.apply_int8_act_asym_per_token_quant:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jerryzh168 do you think we should make apply_int8_act_asym_per_token_quant an enum instead of a boolean. It will make it easier to extend if we want in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah sure, if you anticipate more cases then it makes sense for it to be a enum I think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add just in case for future-proofing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also apply_int8_act_asym_per_token_quant should be apply_int8_act_asym_per_token_quant_dequant I think

)


def _fake_apply_int8_act_asym_per_token_quant(hp_tensor):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually maybe a closer name for this is quantize_and_dequantize

to not confuse with fake_quant, which does not quantize to the real int values in quantize_affine

Copy link
Contributor

@jerryzh168 jerryzh168 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good!

target_dtype=torch.int8,
mapping_type=MappingType.ASYMMETRIC,
).dequantize()
input_tensor = _fake_apply_int8_act_asym_per_token_quant(input_tensor)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this changed to a function call btw?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

choose_qparams returns a rank 0 tensor instead of a rank 1 when there is only 1 number, and this fails the assert in the constructor of IntxUnpackedToInt8Tensor (scale.shape == n_blocks).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need this assert? if you pass around block_size to quantize/dequantize op, this will be handled correctly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if a user constructs the tensor using from_hp, it's not a concern.

But the user can also construct the tensor directly by supplying qdata, scale, zero_point and block_size separately, and in that case I think it makes sense to assert

_FLOAT_TYPES: List[torch.dtype] = [torch.float16, torch.bfloat16, torch.float32]


class ActivationQuantization(enum.Enum):
Copy link
Contributor

@jerryzh168 jerryzh168 Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use ActivationQuantization(str, Enum) to avoid checking for str btw

example

# can switch to StrEnum (https://docs.python.org/3/library/enum.html#enum.StrEnum)
# after python 3.10 is end of life (https://devguide.python.org/versions/)
class KernelPreference(str, Enum):

from torchao.quantization.quant_primitives import _DTYPE_TO_BIT_WIDTH
from torchao.quantization.quantize_.workflows.intx.intx_unpacked_to_int8_tensor import (
IntxUnpackedToInt8Tensor,
ActivationQuantization,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming is a bit too general I feel, how about:

IntxUnpackedToInt8ActivationQuantization

we can move this to intx folder and rename to IntxActivationQuantization if this enum can be reused by other intx tensors

@jerryzh168 jerryzh168 added the topic: bug fix Use this tag for PRs that fix bugs label Sep 5, 2025
@metascroy metascroy merged commit 2c18dcb into main Sep 5, 2025
18 of 19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. topic: bug fix Use this tag for PRs that fix bugs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants