Skip to content

[RFC] More general affine quantization primitives #160

Open
@jerryzh168

Description

@jerryzh168

PR is here, please feel free to comment in PR directly: #159

Context

Currently there are many q/dq functions in torchao and pytorch, they mainly differ in the following dimensions:

  • dtype/bitwidth + quant_min/quant_max: e.g. torch.uint8 with quant_min=0 and quant_max = 255
  • symmetric/asymmetric quantization
  • granularity: per_tensor, per_channel, per_channel_group
  • dtype for scales and zero_points

Ideally, I think we should unify them, it might complicate the operator pattern that’s used by backends like xnnpack, but the code sharing and simplification of the representation it brings will be beneficial in the long term.

We defined three functions: choose_qparams_affine_per_block, quantize_affine_per_block, dequantize_affine_per_block, please checkout the docstrings of these functions in the PR for the definitions

Some Questions

  • for input and scale/zero_point, what do we do when they have different dtypes, e.g. when input is fp16, scales and zero_points are fp32? do we always convert to fp32 and then do the computation?
  • Concerns about using torch.Tensor for per_tensor quantization instead of float/int numbers?
  • It may run slower, is there any concerns on perf?
  • Other ways to choose qparams apart from symmetric and asymmetric?
  • clampping for quant_min/quant_max, should we include this in the quantize op or leave this out?
  • I'm also thinking of API for end users, I think we could provide a util function to get the block size, e.g. get_block_size(input, {"quant_type": "per_channel_group", "group_size": 32, "axis": -1})

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions