-
Notifications
You must be signed in to change notification settings - Fork 349
Fix xnnpack export #2941
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix xnnpack export #2941
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2941
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit aa67cb3 with merge base 9d01b43 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
assert isinstance(weight_tensor, IntxUnpackedToInt8Tensor) | ||
|
||
# Apply dynamic activation quant | ||
if weight_tensor.apply_int8_act_asym_per_token_quant: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jerryzh168 do you think we should make apply_int8_act_asym_per_token_quant an enum instead of a boolean. It will make it easier to extend if we want in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah sure, if you anticipate more cases then it makes sense for it to be a enum I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add just in case for future-proofing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also apply_int8_act_asym_per_token_quant
should be apply_int8_act_asym_per_token_quant_dequant
I think
) | ||
|
||
|
||
def _fake_apply_int8_act_asym_per_token_quant(hp_tensor): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually maybe a closer name for this is quantize_and_dequantize
to not confuse with fake_quant, which does not quantize to the real int values in quantize_affine
test/quantization/quantize_/workflows/intx/test_intx_unpacked_to_int8_tensor.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good!
target_dtype=torch.int8, | ||
mapping_type=MappingType.ASYMMETRIC, | ||
).dequantize() | ||
input_tensor = _fake_apply_int8_act_asym_per_token_quant(input_tensor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this changed to a function call btw?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
choose_qparams returns a rank 0 tensor instead of a rank 1 when there is only 1 number, and this fails the assert in the constructor of IntxUnpackedToInt8Tensor (scale.shape == n_blocks).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need this assert? if you pass around block_size to quantize/dequantize op, this will be handled correctly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if a user constructs the tensor using from_hp, it's not a concern.
But the user can also construct the tensor directly by supplying qdata, scale, zero_point and block_size separately, and in that case I think it makes sense to assert
_FLOAT_TYPES: List[torch.dtype] = [torch.float16, torch.bfloat16, torch.float32] | ||
|
||
|
||
class ActivationQuantization(enum.Enum): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use ActivationQuantization(str, Enum)
to avoid checking for str btw
example
ao/torchao/quantization/quantize_/common/kernel_preference.py
Lines 12 to 14 in b34c103
# can switch to StrEnum (https://docs.python.org/3/library/enum.html#enum.StrEnum) | |
# after python 3.10 is end of life (https://devguide.python.org/versions/) | |
class KernelPreference(str, Enum): |
from torchao.quantization.quant_primitives import _DTYPE_TO_BIT_WIDTH | ||
from torchao.quantization.quantize_.workflows.intx.intx_unpacked_to_int8_tensor import ( | ||
IntxUnpackedToInt8Tensor, | ||
ActivationQuantization, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naming is a bit too general I feel, how about:
IntxUnpackedToInt8ActivationQuantization
we can move this to intx
folder and rename to IntxActivationQuantization
if this enum can be reused by other intx tensors
The previous version of IntxUnpackedToInt8Tensor had a reshape operator after export. Although this does not change any numerics, it does interfere with XNNPACK's ability to recognize and lower the dynamic quantization pattern.
This PR removes the reshape operator, and adjusts the unit tests to look for 0 reshapes.