Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QNN] [RFC] QNN Dialect -- Prequantize Models #3591

Closed
anijain2305 opened this issue Jul 19, 2019 · 25 comments
Closed

[QNN] [RFC] QNN Dialect -- Prequantize Models #3591

anijain2305 opened this issue Jul 19, 2019 · 25 comments

Comments

@anijain2305
Copy link
Contributor

We are proposing a new dialect named QNN, that introduces a quantized version of existing relay operators. The goal is to support the models that have been pre-quantized in the framework.

Some important notes about QNN dialect are

  • QNN operators are lowered to existing Relay operators to ensure that we can reuse Relay infrastructure.
  • Code resides in new directory. Python files are in python/relay/qnn and CPP files are in src/relay/qnn.
  • QNN, like any other dialect, introduces new Relay passes. These passes only deal with QNN ops (like lowering of QNN ops to existing Relay ops). For any generic optimization, we rely on existing Relay passes.

We can use this thread to discuss various open questions. Some of these questions can be

  1. Code organization, namespaces, API discussion.
  2. QNN operator lowering - Infrastructure, correct sequence of Relay operations etc.
  3. Ways to efficiently add new operators with minimal engineering efforts.
  4. Requirements (if any) of new generic Relay passes to achieve good performance.
  5. Any new bugs that arise as we start testing integer computations more thoroughly.

The idea of QNN dialect was a result of discussion at Issue #2351. Thanks @tqchen @FrozenGene @jackwish @jnorwood @shoubhik for the discussions.

First few PRs for the QNN dialect

@tqchen
Copy link
Member

tqchen commented Jul 20, 2019

also cc @ajtulloch @ZihengJiang @vinx13 @eqy @jroesch , @anijain2305 can you please list the API proposals and the reference APIs(in tflite etcs?) Then we can try to get everyone's thoughts on these specific API designs

@anijain2305
Copy link
Contributor Author

anijain2305 commented Jul 20, 2019

Let's start with just Requantize to keep it focussed. I would suggest to add +1 if one likes the API, to show that one agrees.

QNN proposal

def requantize(data,
               input_scale,
               input_zero_point,
               output_scale,
               output_zero_point,
               rounding="AWAY_FROM_ZERO",
               out_dtype="int8"):
    r"""Requantized operator.
     The requantize operator converts one quantized tensor representation to
    another quantized tensor representation. For the output tensor, we are
    provided with output scale and zero point. The computation is as follows
     Q_output = zp_output +  (scale_input)/(scale_ouptut) * (Q_input - zp_input)
     Parameters
    ----------
    data : tvm.relay.Expr
        The input data to the operator.
     input_scale: float
           The quantization scale for the input tensor.
     input_zero_point: int
           The zero point of the input tensor.
     output_scale: float
           The quantization scale for the output tensor.
     output_zero_point: int
           The zero point of the output tensor.
     rounding : string, optional
        Defines the rounding direction when the value is midway between two
        representable values.
     out_dtype : str, optional
        Specifies the output data type for mixed precision conv2d.
     Returns
    -------
    result : tvm.relay.Expr
        The computed result.
    """

TF Requantize

Arguments:

scope: A Scope object
input_min: The float value that the minimum quantized input value represents.
input_max: The float value that the maximum quantized input value represents.
requested_output_min: The float value that the minimum quantized output value represents.
requested_output_max: The float value that the maximum quantized output value represents.
out_type: The type of the output. Should be a lower bit depth than Tinput.
  • The min/max in TF are represented by scale and zero_point in QNN
  • rounding in QNN proposal is to give a choice between standard rounding (away_from_zero) and round_towards_zero, presenting a performance-accuracy knob.

@tqchen tqchen changed the title [QNN] [RFC] QNN Dialect - Supporting pre-quantized models in TVM [QNN] [RFC] QNN Dialect -- Prequantize Models Jul 21, 2019
@anijain2305
Copy link
Contributor Author

anijain2305 commented Jul 23, 2019

@FrozenGene @tqchen @u99127 Can you please approve the above API, so that we can discuss/move to next discussion. I have so many things to discuss :)

@FrozenGene
Copy link
Member

@anijain2305 Let me look at it afternoon of today or evening.

@tqchen
Copy link
Member

tqchen commented Jul 23, 2019

@anijain2305 to increase the throughput, can you also list other APIs, one per post and we use lazy consensus, wait for a week to see people's feedback then summarize.

@jnorwood
Copy link

There are a couple of things in this gemmlowp quantization example, which you should perhaps consider how it could be supported in your requantize api.

  1. ability to specify that range should be extended to include a zero value.
  2. ability to specify that the range should be nudged so that exact zero is an integer value.
    https://github.com/google/gemmlowp/blob/master/doc/quantization_example.cc

@anijain2305
Copy link
Contributor Author

@jnorwood Thanks for the comment. Both good points. I will keep those abilities, though, outside of the scope of requantize op.
Another function (not necessarily a Relay operator) can take min/max, a config (like nudge that zero is exactly representable) and generates scale and zero point as per the given choice. As of now, I am thinking of them as utility functions.

@anijain2305
Copy link
Contributor Author

anijain2305 commented Jul 23, 2019

QNN Conv2D operator

Tensorflow

tf.nn.quantized_conv2d(
    input,
    filter,
    min_input,
    max_input,
    min_filter,
    max_filter,
    strides,
    padding,
    out_type=tf.dtypes.qint32,
    dilations=[1, 1, 1, 1],
    name=None
)

MxNet

mxnet.symbol.contrib.quantized_conv(
data=None, 
weight=None,
bias=None,
min_data=None,
max_data=None,
min_weight=None,
max_weight=None, 
min_bias=None, 
max_bias=None, 
kernel=_Null, 
stride=_Null, 
dilate=_Null, 
pad=_Null, 
num_filter=_Null, 
num_group=_Null, 
workspace=_Null, 
no_bias=_Null, 
cudnn_tune=_Null, 
cudnn_off=_Null, 
layout=_Null, 
name=None, 
attr=None, 
out=None, **kwargs)

TFLite

inline void Conv(
const ConvParams& params,
const RuntimeShape& input_shape,                 
const uint8* input_data, 
const RuntimeShape& filter_shape,                 
const uint8* filter_data, 
const RuntimeShape& bias_shape,                 
const int32* bias_data, 
const RuntimeShape& output_shape,                 
uint8* output_data, 
const RuntimeShape& im2col_shape,                 
uint8* im2col_data, void* cpu_backend_context) {


Here ConvParams are

struct ConvParams {
  PaddingType padding_type;
  PaddingValues padding_values;
  // TODO(starka): This was just "stride", so check that width+height is OK.
  int16 stride_width;
  int16 stride_height;
  int16 dilation_width_factor;
  int16 dilation_height_factor;
  // uint8 inference params.
  // TODO(b/65838351): Use smaller types if appropriate.
  int32 input_offset;
  int32 weights_offset;
  int32 output_offset;
  int32 output_multiplier;
  int output_shift;
  // uint8, etc, activation params.
  int32 quantized_activation_min;
  int32 quantized_activation_max;
  // float activation params.
  float float_activation_min;
  float float_activation_max;
};

QNNPACK

enum qnnp_status qnnp_create_convolution2d_nhwc_q8(
    uint32_t input_padding_top,
    uint32_t input_padding_right,
    uint32_t input_padding_bottom,
    uint32_t input_padding_left,
    uint32_t kernel_height,
    uint32_t kernel_width,
    uint32_t subsampling_height,
    uint32_t subsampling_width,
    uint32_t dilation_height,
    uint32_t dilation_width,
    uint32_t groups,
    size_t group_input_channels,
    size_t group_output_channels,
    uint8_t input_zero_point,
    float input_scale,
    uint8_t kernel_zero_point,
    float kernel_scale,
    const uint8_t* kernel,
    const int32_t* bias,
    uint8_t output_zero_point,
    float output_scale,
    uint8_t output_min,
    uint8_t output_max,
    uint32_t flags,
    qnnp_operator_t* convolution_out)

QNN Proposal

def conv2d(data,
           weight,
           input_zero_point,
           kernel_zero_point,
           strides=(1, 1),
           padding=(0, 0),
           dilation=(1, 1),
           groups=1,
           channels=None,
           kernel_size=None,
           data_layout="NCHW",
           kernel_layout="OIHW",
           out_layout="",
           out_dtype="int32"):
    r"""Quantized 2D convolution.

    This operator convolves quantized weight with quantized data. The scale of
    the output quantized tensor is the product of the weight_scale and
    input_scale of the input quantized tensors. The zero point of the output
    quantized tensor is 0. By default, the dtype of output is int32. Please also
    refer to Requantize operator to understand how to scale back the int32
    output to (u)int8.


    Parameters
    ----------
    data : tvm.relay.Expr
        The input data to the operator.

    weight : tvm.relay.Expr
        The weight expressions.

    input_zero_point: int
           The zero point of the data distribution.

    kernel_zero_point: int
           The zero point of the quantized_kernel distribution.

    strides : tuple of int, optional
        The strides of convolution.

    padding : tuple of int, optional
        The padding of convolution on both sides of inputs before convolution.

    dilation : tuple of int, optional
        Specifies the dilation rate to be used for dilated convolution.

    groups : int, optional
        Number of groups for grouped convolution.

    channels : int, optional
        Number of output channels of this convolution.

    kernel_size : tuple of int, optional
        The spatial of the convolution weight.

    data_layout : str, optional
        Layout of the input.

    kernel_layout : str, optional
        Layout of the weight.

    out_layout : str, optional
        Layout of the output, by default, out_layout is the same as data_layout

    out_dtype : str, optional
        Specifies the output data type for mixed precision conv2d.

    Returns
    -------
    result : tvm.relay.Expr
        The computed result.
    """

Key points

  • API is very close to TF API.
  • Others have bias in the API to perform fused computation. Instead of complicating the Conv2D API, our proposal is to add a bias_add after the qnn.conv2d. Relay fusion will itself fuse the computation.
  • TFLite and QNNPACK also have output_scale and output_zero_point. This is again because they want to fuse conv + bias + requantize in one op. We will have all of these 3 ops separately, connected by the framework parser depending on the framework graph. We will rely on Relay fusion.
  • TFLite also has output_min_activation/max_Activation because of fused relu. We can easily call a clip operator to do that.

An example of conversion from TFLite to Relay graph will look like this.

image

@FrozenGene
Copy link
Member

@anijain2305 Could we also list the api of TFLite and QNNPACK? I think both should be considered. Because we will parse TFLite model and QNNPACK is an good quantization accelerated library

@jnorwood
Copy link

The reference integer implementation of tflite conv2d has optional bias parameters which are pointers to 32 bit signed values, where a null pointer can be used to skip the operation.

Is it the intention to always separate out the bias add operation in tvm apis?

one other comment ... I've seen an implementation of quantized conv2d where the 32 bit signed bias values were preloaded to the accumulator, rather than clearing accumulator prior to the mac loop. I don't recall a discussion of it. I'm wondering if this a common enhancement to reduce overflow/underflow possibility of the signed 32 bit accumulations when a bias is expected?

@anijain2305
Copy link
Contributor Author

@FrozenGene Updated the Conv2D API. Also, added a diagram explaining how to go from TFLite to Relay operators.

@anijain2305
Copy link
Contributor Author

anijain2305 commented Jul 23, 2019

@jnorwood Yes, bias is kept outside as a separate operator. But, this can be fused with the qnn.conv2d.

Regarding the accumulation point, if we perform fusion and add the bias in int32 in the accumulator at the end, is it any different than preloading the accumulator? We need to ensure that op is fused i.e. the bias addition happens in the same accumulator where conv2d has just finished.

@jnorwood
Copy link

Regarding the accumulation point, if we perform fusion and add the bias in int32 in the accumulator at the end, is it any different than preloading the accumulator?

When preloading a negative bias, a signed 32 bit accumulator positive accumulate range is extended (before overflow), for example. Maybe the result from a post bias_add is the same for most implementations, but signed int overflow behavior is undefined in the C standards... so the order of bias_add operations might matter.

I saw the bias preload used in some paper. I'll check my notes and see if I can find it.

@FrozenGene
Copy link
Member

FrozenGene commented Jul 24, 2019

@anijain2305 Could we list the ConvParams of TFLite? I think it is more clean what TFLite's convolution computation need.

For the diagram, I think maybe we could avoid the intermediate clip. Because we have one clip inside requantize operator. If we accept one output_min / output_max, then we just do one clip in requantize, no intermediate clip.

@anijain2305
Copy link
Contributor Author

anijain2305 commented Jul 24, 2019

@FrozenGene Thanks for the suggestion to add ConvParams. Just updated the original post.

@FrozenGene I dont think requantize should take output_min and output_max. We can use requantize after/before any operator, where relu might not be at all applicable. Instead, I would suggest having two clip operators. And then relying on Relay passes to optimize the graph - in this case converting two back-to-back clip into a single clip.

We had similar discussion here - #2351 (comment)

@zhenhuaw-me
Copy link
Contributor

I was looking into PR #3531 and #3512 , and noticed that the PRs are going to support 32 bits quantization. I'd like to move to this RFC thread to loop more people with my consideration.

Before going far, let me clarify that, Requantize is obviously needed to support int32 input which is the accumulated multiplication intermediate result in conv2d when tensor type is int8. The problem is about whether Quantize/Dequantize of which the output/input needs int32 - which means a 32 bit integer quantization approach. (or even 16 bit)

First, which is trivial, if we can use 32 bit, why not 32 bit floating point rather than integer. Yes, there is devices supporting only integer, but, the industry shows that 8 bit is enough for accuracy and they are even moving to 4 bit. So, maybe we don't need to be so generic I guess. The less is more :)

Second, I wonder that it's impossible to support 32 bit due to arithmetic limitations.

  1. Let's start from 8 bit. Considering we are using 8 bit quantization approach, of which the tensor value range is (-2^7, 2^7]. Multiplying two int8 generates value range (-2^14, 2^14]. As the accumulation could be of thousands, it's not safe to accumulate them in int16, but int32. That is why we need int32 in int8 quantization.
  2. Similarly, taking 16 bit. The original value range is (-2^15, 2^15], and multiplied to be (-2^30, 2^30]. When accumulating, we should use 64 bit integer. Hmmm, int64 is still available though needs to be emulated sometimes on very low-low-end device. In this way, Requantize needs to support 64 bit input if we are going to enable 16 bit support.
  3. Now, 32 bit. (-2^31, 2^31] multiplied to (-2^62, 2^62] which should be hold in 64 bit registers. 128 bit integer is required to accumulate int64, unless there are only severals of them.

That's the general talking. Strong assumptions may be introduced to handle the issues raised above, but I can hardly see the benefits. Maybe 8 bit quantization is enough so far :)

@anijain2305
Copy link
Contributor Author

Thanks @jackwish
This is a very good analysis. Everything makes sense. I upvote for restricting to (u)int8 for now for Quantize and Dequantize.

If in future, we see (u)int16, we can tackle then. int32 is highly unlikely (why not just go to FP32 as you say).

@tqchen
Copy link
Member

tqchen commented Jul 25, 2019

@ajtulloch @antinucleon @ZihengJiang @jroesch, would be great if you can also review this week and sign off the APIs

@jnorwood
Copy link

on the discussion of the intermediate formats supported ...

The intel avx512 vnni/dlboost operations are simd operations that support 8 bit quantized inference.

Their intrinsic descriptions show that the int8 multiplies go to intermediate 16 bit results before they are accumulated (in the fma) to int32 registers.

Would it be worth considering how tvm can detect sequences that can be substituted with these dlboost/vnni simd intrinsic operations? If tvm supported operations with 16 bit accumulators, then you could conceivably specify sequences of operations that would exactly match the intrinsic descriptions. Perhaps that would make the intrinsic's pattern easier to detect.

From the publicity, I think these dlboost/vnni avx512 simd operations are considered important for the new xeon ai inference support.

I'm providing the expanded intrinsic description for one of the ops as an example

https://software.intel.com/sites/landingpage/IntrinsicsGuide/#avx512techs=AVX512_VNNI&expand=2202

@anijain2305
Copy link
Contributor Author

anijain2305 commented Jul 25, 2019

@jnorwood We are using intrinsics for Skylake (not VNNI) and there is already a PR to take advantage of VNNI intrinsics for Cascade Lake #3388

@shoubhik
Copy link
Contributor

shoubhik commented Jul 25, 2019

@jackwish, i want to get my understanding correct, when you say

I was looking into PR #3531 and #3512 , and noticed that the PRs are going to support 32 bits quantization.

are you talking about the inputs or outputs of quantize/dequantize ops being int32? Because, the current implementation for

  1. Quantize - limits the inputs to be float32 and output to be (u)i8
  2. Dequantize - The input to be (u)int8 and output to be float32

Or are you suggesting we should support higher number of bits (>16) for these ops?

@zhenhuaw-me
Copy link
Contributor

zhenhuaw-me commented Jul 26, 2019

@jackwish, i want to get my understanding correct, when you say

I was looking into PR #3531 and #3512 , and noticed that the PRs are going to support 32 bits quantization.

are you talking about the inputs or outputs of quantize/dequantize ops being int32? Because, the current implementation for

  1. Quantize - limits the inputs to be float32 and output to be (u)i8
  2. Dequantize - The input to be (u)int8 and output to be float32

Or are you suggesting we should support higher number of bits (>16) for these ops?

@shoubhik I was saying to limit to int8. I know your PR only restricts to int8, while PR #3531 seems trying to enable int8/16/32. I move to here because I saw the two PRs share same code but seems are not consistent in quantization approach. Thanks for helping to clarify that.

@zhenhuaw-me
Copy link
Contributor

Thanks @jackwish
This is a very good analysis. Everything makes sense. I upvote for restricting to (u)int8 for now for Quantize and Dequantize.

If in future, we see (u)int16, we can tackle then. int32 is highly unlikely (why not just go to FP32 as you say).

I am very glad that it helped :)

@shoubhik
Copy link
Contributor

@jackwish, i want to get my understanding correct, when you say

I was looking into PR #3531 and #3512 , and noticed that the PRs are going to support 32 bits quantization.

are you talking about the inputs or outputs of quantize/dequantize ops being int32? Because, the current implementation for

  1. Quantize - limits the inputs to be float32 and output to be (u)i8
  2. Dequantize - The input to be (u)int8 and output to be float32

Or are you suggesting we should support higher number of bits (>16) for these ops?

@shoubhik I was saying to limit to int8. I know your PR only your PR restricts to int8, while PR #3531 seems trying to enable int8/16/32. I move to here because I saw the two PRs share same code but seems are not consistent in quantization approach. Thanks for helping to clarify.

Thanks for the explaination.

@tqchen
Copy link
Member

tqchen commented Nov 27, 2019

close as QNN has become part of the repo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants