-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC][Quantization] Support quantized models from TensorflowLite #2351
Comments
Starting from TFLite importer to relay sounds great. cc @jroesch @ajtulloch @yzhliu |
If you want to support transforming quantized model, be careful to transform ops like |
Thanks for reminding. However, I don't fully understand your reminder. Do you mean I should be careful |
Hi, I recently wrote some code to read in the tflite quantized examples and translate them to nnef output. Their operations are pretty similar to nnvm ops. I translated the two mobilenets and the four inception models. There's a cmake config that pulls down all the models and converts them. Please feel free to use whatever you want from it. I forked the NNEF Tools project, https://github.com/jnorwood and put the converter under the contrib/converters/tflite_converters/tflite_to_nnef I only added processing for the ops I needed, and I only did quantized data. tflite uses uint8 quantization, btw, with offsets for both weights and features. Biases are int32. NNEF passes quantization configuration in a separate file from the graph. Also, note that tflite uses nhwc everywhere. |
@FrozenGene I am interested in contributing to this Issue. Is it possible to share the progress? |
Hey, @anijain2305 Thanks for your interest. Currently, I am doing #3141. After that, I will start it. BTW, our internal support is based on NNVM and we have completed support it, we have the same result compared with TFLite and have better performance than TFLite. However, I have to spare some time translating to Relay when to make PR. But I have to say that I am busy this month in our product development and it will go to open source progress in my company. I will @ you when that PR is ready. |
Thanks. Let's lay down the high-level API design for some of the quantized operators. A large portion of this is coming from the following relevant discussions. Thanks to @jackwish, @FrozenGene and @jnorwood for sharing their experiences with quantization, and also @shoubhik for helping design this RFC. Other non-TVM related links that were used to understand quantization Covered frameworks for now - TFLite and MxNet List of required operators - quantize, quantized_conv2d, qunatized_relu, quantized_pool2d, quantized_fully_connected, quantized_concat, dequantize It will be good if we can agree on Relay ops - its inputs/outputs and the attributes. The initial proposal for the quantize, quantized_conv2d and dequantize ops is as follows (other quantized_* operators will be on the same lines as that of quantized_conv2d) Op quantizedef quantize(data, scale, zero_point, out_dtype):
"""
Quantize takes the scale and zero_point attributes and quantizes the
FP32 input data to int8/uint8 tensor.
Parameters
-----------
data: FP32 tensor
The input tensor in FP32.
scale: FP32 scalar (An attribute of the op)
The float scalar to scale the int8 values back to FP32.
zero_point: Int32 zero point (An attribute of the op)
The zero point of the distribution.
out_dtype: String
The dtype of the output. Can only be int8/uint8
Returns
-------
quantized_data: int8/uint8 tensor
The quantized tensor.
""" Key points to discuss
Op quantized_conv2ddef quantized_conv2d(quantized_data, quantized_kernel,
input_scale, input_zero_point,
kernel_scale, kernel_zero_point,
output_scale, output_zero_point,
out_dtype,
# All the old remaining ones from conv2d
strides=(1, 1),
padding=(0, 0),
dilation=(1, 1),
groups=1,
channels=None,
kernel_size=None,
data_layout="NCHW",
kernel_layout="OIHW",
out_layout=""):
"""
Quantize takes the scale and zero_point attributes and quantizes the
FP32 input data to int8/uint8 tensor. The scale and zero_point calculations
happen outside the relay graph, i.e., the framework parsers will have to compute
the scale and offset if only min and max are provided.
Parameters
-----------
quantized_data: int8/uint8 tensor
The quantized input tensor in int8/uint8.
quantized_kernel: FP32 tensor
The quantized kernel tensor in int8/uint8.
input_scale: FP32 scalar (An attribute of the op)
The float scalar to scale the quantized_data int8 values back to FP32.
input_zero_point: Int32 zero point (An attribute of the op)
The zero point of the quantized_data distribution.
kernel_scale: FP32 scalar (An attribute of the op)
The float scalar to scale the quantized_kernel int8 values back to FP32.
kernel_zero_point: Int32 zero point (An attribute of the op)
The zero point of the quantized_kernel distribution.
output_scale: FP32 scalar (An attribute of the op)
The output scale is set during the quantization process using training/calibration.
The float scalar to scale the quantized_output int8 values back to FP32.
output_zero_point: Int32 zero point (An attribute of the op)
The output zero point is set during the quantization process using training/calibration.
The zero point of the quantized_output distribution.
out_dtype: String
The dtype of the quantized_output. Can only be int8/uint8.
The requantization from int32 to int8/uint8 is a part of the op compute.
out_dtype: String
The dtype of the output. Can only be int8/uint8
..... Other attributes are same as before.
Returns
-------
quantized_output: int8/uint8 tensor
The quantized tensor.
""" Key points to discuss further
Op dequantizeDequantization is required while connecting a quantized operator and an FP32 operator. This might be a temporary stage where we do not have a quantized implementation of the second op. Dequantization might also be required at the end of the network to keep the output of the graph in FP32. def dequantize(quantized_data, scale, zero_point, out_dtype):
"""
Dequantize takes the scale and zero_point attributes and dequantizes the
int8/uint8 tensor to FP32 tensor.
Parameters
-----------
quantized_data: int8/uint8 quantized input tensor
The input tensor in int8/uint8.
scale: FP32 scalar (An attribute of the op)
The float scalar to scale the int8 values back to FP32.
zero_point: Int32 zero point (An attribute of the op)
The zero point of the distribution.
out_dtype: String
The dtype of the output. Can only be float32.
Returns
-------
data: FP32 tensor
The dequantized tensor.
""" |
For the output_min=0,
output_max=0 These will be used for restrict the output range, which could be calculated previously. see TFLite's From my experience, we needn't |
Yes, I believe the MobilenetV2 relu_6 is effectively fused in by the downscale saturation. You might need it if you want to support their way of training, though. Yes Mobilenet has the q_add, but I suggest the Inceptionv3 for q_concatenate, since it also has concat nodes feeding into concat nodes, and tflite also has to rescale inputs inside the concat operations. Also, the MobilenetV2 q_add inputs require rescale... but in both q_concat and q_add you can recalculate the prior op downscale multipliers so you can eliminate the extra rescales. Also, depending on your allocation capabilities, you can get rid of all concats. |
Hi @anijain2305 regarding the requantization, if the it is not going to put in conv op, the op may suppose to output FP32, otherwise the semantic is confusing. The requantization can convert FP32 to INT8. The multiplier/shift based reuantization approach introduced by TFLite is also adopted by Caffe2/QNNPACK. And, maybe we can put the quantization parameters in tensor, as the scale and zero point are describing the INT8 tensor data rather than the op. The op are supposed to read these parameters and get things done. |
I see what you are saying, but I am not sure if this is the right approach. In my opinion, it will be better to put it out of conv. The reason we have these 2 extra min/maxes is because of fused activation in TFLite. It seems better to keep it separate so that both MxNet and TFLite can share quantized_conv2d. In case of TFLite, when we see a fused conv, we can add one more clamp operator in the sequence of ops at the end. |
Make sense. For now, I was thinking of not worrying about depth-wise conv. So, decided to take Inception V3 into account. I think given we are in the starting position, I don't have any big inclination towards any network. My motive is to focus on getting the right infrastructure in the start and showcase it with one large network. The performance micro-optimizations can then phased. |
Makes sense. Does it make sense to add accumulator_dtype as one of the attributes of quantized_conv2d. This will be set to int32 for TFLite, Caffe2, QNNPACK. But, if some network needs accumulation in FP32, then it will support that as well.
Not sure about this. The good thing is the conv2d relay operator can be shared across FP32 and quantized tensor types. The bad thing is compute depends on the quantized tensor type now. This might require new Relay optimizations, preventing us to fully use the existing infrastructure. |
No matter whether we have fused activation function, we always need output_min / output_max. Because we will get conv int32 result, but we will need uint8 result. So we must restrict int32 to uint8. If we don't have fused activation function, (When we have quantized model of TFLite, we don't have fused activation many cases), the output_min / output_max will be 0 / 255 to restrict int32 result. If we have relu6, output_min / output_max will be 0 / 6. So I think we are better put these two into conv argument. And we could avoid producing another clamp, just be calculated in conv2d requantize int32 -> uint8 process and it is nature. |
In the case the activation is not fused, the values have to clamped to 0/255 or uint8 range, which is basically the out_dtype. So, we do not need any extra information for the quantized_conv2d for going back to uint8/int8 other than out_dtype. Correct? Now, If the activation is fused, I agree that we will have two clamps now. One inside the quantized_conv2d (0/255), and one for the relu6 (0/6). I think this is fine. We can also write a Relay that replaces two back-to-back clamping with one clamp Relay operator. The reason I am saying this is that TFLite chooses one way to handle things, which other frameworks might not. So, it is necessary to come up with right abstractions first. The performance can be then be achieved by writing Relay passes. |
Yes, I agree when we don't have activation, we don't need anything. However, Another thing we should consider: How to integrate with other libraries, such as QNNPACK. QNNPACK also need output min / output max too. https://github.com/pytorch/QNNPACK/blob/master/include/qnnpack.h#L62-L63 |
Here are some points to discuss:
Some of the discussions involve fusion, and that is something where TVM might be able to help. For example, in the current symmetric scheme, clip, relu6, and subsequent downcasting ops are automatically fused into the conv2d ops. While the conv2d op can simply just output int32(because followup ops will get fused). I agree with @anijain2305 that we could try to get something minimum that is working, then start thinking about possible rewriting rules to get to some useful patterns if we decide that manual intervention is necessary. Ideally, we should have a generic schedule template that works for any fused patterns, just as those in the current symmetric version, so we do not need to have all the different variants of fused conv2d ops also cc @vinx13 @ZihengJiang |
I want to point out that the min and max values you mentioned are not related to the activation range in the original model. They are saturation values. In the case of mobilenet, for example, which has relu_6 use everywhere, I'm printing out the min and max activation values from the tflite mobilenet V2 below. The model uses uint8 downscale between layers, and uses the min and max value to clamp/saturate the values to 0..255 for all layers in that model. The thing it could be used for (but isn't here) is for more or fewer quantization bits or for signed int quantization ... but tflite is using all uint8 quantization for MobilenetV2. the amin and amax values below are tflite output_activation_min, output_activation_max from their quantized reference ops for conv and dw_conv. (base) jay@jay-desktop: ` |
similarly, for the tflite quantized inception v3 model, all those output_activation_min, output_activation_max are 0 and 255 |
to explain a little further ... during training they determine the range of input values, and they determine the downscale multiplier that will shrink the observed range to 0..255 (for the uint8 quantization). The fp downscale multiplier is coverted to integer mpy and right-shift constants, which are the mpy and shft values in my log. At inference time, the downscaled accumulator (after applying the downscale) may be outside the uint8 quantization range, and so they clamp/saturate to that range. In these current models, they are using uint8 quantization ... so the range is 0..255, but it appears to me they are providing the min and max to support other numbers of bits in the quantization. I see support for several 4 bit gpu implementations recently, so maybe this is to support something like that. |
Some comments for @anijain2305 's reply :)
A network uses operators (or layers or anything we'd like to call it) regardless of the accumulation format. The format is part of a software system's mechanism. So, I guess we don't need a
I was saying extending existing tensor rather than introduce new tensor type. I assume that this won't lead to new Relay opt :) EDIT: Btw, the channel-wise quantization parameter is likely to be included in TensorFlow/TFLite, also the TVM stack as a roadmap. In this way, it could be easier to manage a tensor described parameter. |
Regarding @jnorwood 's comments on output min/max of conv2d. Your observations about the values of output min max are correct. But they are still activations. One thing I always try to deliver is that, the INT8 values in quantization are a representation of original FP32 values. When we talking about ReLU6 activations, it means that in FP32 format, the op outputs FP32 values in range [0, 6]. For INT8 quantization, INT8 data is an representation of FP32 value, which means, the output min/max (which is typically [0, 255] of INT8 type in pre-provided quantized MobileNet) are representing [0, 6] of FP32 type - the INT8 0/255 is actually FP32 0/6. Try the output scale (0.023528477177023888) with the activation min/max, we will get value range like [0, 5.999761581420898] (from output of the first conv of the pre-provided quantized MobileNet). Conclusions can easily draw once we have this in mind :) |
I would suggest to design the infrastructure that supports both symmetric/asymmetric quantization. We can certainly start with symmetric to flush the flow, while keeping in mind that we can share as much infrastructure as possible between them.
I think this is required for both asymmetric and symmetric quantization. These ops will be rewritten to low-level instructions by a Relay pass. How about using
I am not sure yet. The only unknown to me are the special rounding operations that are used in converting the Floating point to Integer multiplication in scaling the quantized conv matrix. But, they might already be covered in current low-level ops.
I was hoping to re-use the FForwardRewrite infrastructure to lower the ops. Do you anticipate more passes here? |
All the tflite quantized models I've tested use the asymmetric uint8 quantization. If you are planning to use those as examples, it will be hard to debug if you throw in the change to symmetric. |
If your round is the concept of my previous comment, maybe |
tflite computes the output_multiplier and output_shift integer parameters from a double input in the call to QuantizeMultiplier . These are the integer downscale multiplier and right_shift divider parameters.
I'd suggest you print out their output_multiplier and output_shift values for comparison, since errors can start there. Their downscale operations are implemented in int64. |
Goal - Act as medium of discussion for pull request apache#2351 Features - New quantized conv2D op in Relay - Python API interface to instantiate the Relay op - Infer Type implemented - Lowering of quantized_conv op to low-level Relay ops Discussion points - Does the namespace look correct? - Relay op is called 'relay.op.nn._quantize.quantized_conv2d' - Idea is that any op under '_quantize' namespace will go through rewrite. - Should we reuse Conv2DRel and Conv2DAttrs - Tried protoyping. Found it hard to derive from Conv2DAttr struct - Infer Type has a param field. This need to come from the right datatype. Missing implememtation - Lowering of quantized conv into conv+cast is incomplete. - Will work on it async. This is orthogonal to the discussion.
Goal - Act as medium of discussion for pull request apache#2351 Features - New quantized conv2D op in Relay - Python API interface to instantiate the Relay op - Infer Type implemented - Lowering of quantized_conv op to low-level Relay ops Discussion points - Does the namespace look correct? - Relay op is called 'relay.op.nn._quantize.quantized_conv2d' - Idea is that any op under '_quantize' namespace will go through rewrite. - Should we reuse Conv2DRel and Conv2DAttrs - Tried protoyping. Found it hard to derive from Conv2DAttr struct - Infer Type has a param field. This need to come from the right datatype. Missing implememtation - Lowering of quantized conv into conv+cast is incomplete. - Will work on it async. This is orthogonal to the discussion.
Goal - Act as medium of discussion for pull request apache#2351 Features - New quantized conv2D op in Relay - Python API interface to instantiate the Relay op - Infer Type implemented - Lowering of quantized_conv op to low-level Relay ops Discussion points - Does the namespace look correct? - Relay op is called 'relay.op.nn._quantize.quantized_conv2d' - Idea is that any op under '_quantize' namespace will go through rewrite. - Should we reuse Conv2DRel and Conv2DAttrs - Tried protoyping. Found it hard to derive from Conv2DAttr struct - Infer Type has a param field. This need to come from the right datatype. Missing implememtation - Lowering of quantized conv into conv+cast is incomplete. - Will work on it async. This is orthogonal to the discussion. [Relay] [Quantization] WIP - Protoyping the quantized convolution op Goal - Act as medium of discussion for pull request apache#2351 Features - New quantized conv2D op in Relay - Python API interface to instantiate the Relay op - Infer Type implemented - Lowering of quantized_conv op to low-level Relay ops Discussion points - Does the namespace look correct? - Relay op is called 'relay.op.nn._quantize.quantized_conv2d' - Idea is that any op under '_quantize' namespace will go through rewrite. - Should we reuse Conv2DRel and Conv2DAttrs - Tried protoyping. Found it hard to derive from Conv2DAttr struct - Infer Type has a param field. This need to come from the right datatype. Missing implememtation - Lowering of quantized conv into conv+cast is incomplete. - Will work on it async. This is orthogonal to the discussion. Adding the fixed point compute handling for requantiazation.
Goal - Act as medium of discussion for pull request apache#2351 Features - New quantized conv2D op in Relay - Python API interface to instantiate the Relay op - Infer Type implemented - Lowering of quantized_conv op to low-level Relay ops Discussion points - Does the namespace look correct? - Relay op is called 'relay.op.nn._quantize.quantized_conv2d' - Idea is that any op under '_quantize' namespace will go through rewrite. - Should we reuse Conv2DRel and Conv2DAttrs - Tried protoyping. Found it hard to derive from Conv2DAttr struct - Infer Type has a param field. This need to come from the right datatype. Missing implememtation - Lowering of quantized conv into conv+cast is incomplete. - Will work on it async. This is orthogonal to the discussion. [Relay] [Quantization] WIP - Protoyping the quantized convolution op Goal - Act as medium of discussion for pull request apache#2351 Features - New quantized conv2D op in Relay - Python API interface to instantiate the Relay op - Infer Type implemented - Lowering of quantized_conv op to low-level Relay ops Discussion points - Does the namespace look correct? - Relay op is called 'relay.op.nn._quantize.quantized_conv2d' - Idea is that any op under '_quantize' namespace will go through rewrite. - Should we reuse Conv2DRel and Conv2DAttrs - Tried protoyping. Found it hard to derive from Conv2DAttr struct - Infer Type has a param field. This need to come from the right datatype. Missing implememtation - Lowering of quantized conv into conv+cast is incomplete. - Will work on it async. This is orthogonal to the discussion. Adding the fixed point compute handling for requantiazation.
Goal - Act as medium of discussion for pull request apache#2351 Features - New quantized conv2D op in Relay - Python API interface to instantiate the Relay op - Infer Type implemented - Lowering of quantized_conv op to low-level Relay ops Discussion points - Does the namespace look correct? - Relay op is called 'relay.op.nn._quantize.quantized_conv2d' - Idea is that any op under '_quantize' namespace will go through rewrite. - Should we reuse Conv2DRel and Conv2DAttrs - Tried protoyping. Found it hard to derive from Conv2DAttr struct - Infer Type has a param field. This need to come from the right datatype. Missing implememtation - Lowering of quantized conv into conv+cast is incomplete. - Will work on it async. This is orthogonal to the discussion. [Relay] [Quantization] WIP - Protoyping the quantized convolution op Goal - Act as medium of discussion for pull request apache#2351 Features - New quantized conv2D op in Relay - Python API interface to instantiate the Relay op - Infer Type implemented - Lowering of quantized_conv op to low-level Relay ops Discussion points - Does the namespace look correct? - Relay op is called 'relay.op.nn._quantize.quantized_conv2d' - Idea is that any op under '_quantize' namespace will go through rewrite. - Should we reuse Conv2DRel and Conv2DAttrs - Tried protoyping. Found it hard to derive from Conv2DAttr struct - Infer Type has a param field. This need to come from the right datatype. Missing implememtation - Lowering of quantized conv into conv+cast is incomplete. - Will work on it async. This is orthogonal to the discussion. Adding the fixed point compute handling for requantiazation.
Goal - Act as medium of discussion for pull request apache#2351 Features - New quantized conv2D op in Relay - Python API interface to instantiate the Relay op - Infer Type implemented - Lowering of quantized_conv op to low-level Relay ops Discussion points - Does the namespace look correct? - Relay op is called 'relay.op.nn._quantize.quantized_conv2d' - Idea is that any op under '_quantize' namespace will go through rewrite. - Should we reuse Conv2DRel and Conv2DAttrs - Tried protoyping. Found it hard to derive from Conv2DAttr struct - Infer Type has a param field. This need to come from the right datatype. Missing implememtation - Lowering of quantized conv into conv+cast is incomplete. - Will work on it async. This is orthogonal to the discussion. Adding the fixed point compute handling for requantiazation.
Goal - Act as medium of discussion for pull request apache#2351 Features - New quantized conv2D op in Relay - Python API interface to instantiate the Relay op - Infer Type implemented - Lowering of quantized_conv op to low-level Relay ops Discussion points - Does the namespace look correct? - Relay op is called 'relay.op.nn._quantize.quantized_conv2d' - Idea is that any op under '_quantize' namespace will go through rewrite. - Should we reuse Conv2DRel and Conv2DAttrs - Tried protoyping. Found it hard to derive from Conv2DAttr struct - Infer Type has a param field. This need to come from the right datatype. Missing implememtation - Lowering of quantized conv into conv+cast is incomplete. - Will work on it async. This is orthogonal to the discussion. Adding the fixed point compute handling for requantiazation.
Requantize converts one quantized tensor representation to another quantized representation. The PR has following implementation features - Requantize operator defined in qnn namespace - relay.qnn.requantize - Lowering of the requantize to exisiting Relay operators - Integer fixed point implementation of requantize - Two rounding modes - FE_UPWARDS (round towards infinity) and FE_AWAY_FROM_ZERO (std::round behavior) - Floating point implementation as well, that can act as reference or can be used for devices when FP32 computation is not used. - Unit test cases Relevant Issue - apache#2351 Credit to TFLite and GemmLowp to provide reference implementations.
Requantize converts one quantized tensor representation to another quantized representation. The PR has following implementation features - Requantize operator defined in qnn namespace - relay.qnn.requantize - Lowering of the requantize to exisiting Relay operators - Integer fixed point implementation of requantize - Two rounding modes - FE_UPWARDS (round towards infinity) and FE_AWAY_FROM_ZERO (std::round behavior) - Floating point implementation as well, that can act as reference or can be used for devices when FP32 computation is not used. - Unit test cases Relevant Issue - apache#2351 Credit to TFLite and GemmLowp to provide reference implementations.
The discussion in this thread has get quite long and given that we are converging. I recommend we close this thread, and open a new RFC thread "QNN Dialect". With latest proposals of the APIs that @anijain2305 @shoubhik is putting together(please also put in related APIs in TF/QNN for reference to back the decision). This way we keep the community informed and we can move forward with these implementations. I hope we can get +1 representation from different groups who are interested in this direction, in particular @jnorwood @ajtulloch @FrozenGene @yzhliu Thanks everyone for the hard work. |
@anijain2305 can you lead the proposal discussion? |
I agree, we should move the proposal to a new thread. |
@anijain2305 can you open the RFC thread? Sorry for being a bit formal in this case, we want to set an example for the first dialect public discussions. |
@tqchen Thanks for reminding. Just created one :) |
Let us move to #3591 |
A quick question here since I can't see this mentioned on #3591 Is this network going to be quantized per tensor as well as the new per-channel quantization that is appearing in tflite 2.0 ? IIUC, tf1.13 has per tensor quantization rather than the per channel quantization. i.e. more interestingly can the relay design support both ? regards regards |
Good question. We have only supported TF1.13 quantization. TF2.0 has separate scale and doesn't be considered in previous discussion. Seems there is a gap here. cc @anijain2305 |
Requantize converts one quantized tensor representation to another quantized representation. The PR has following implementation features - Requantize operator defined in qnn namespace - relay.qnn.requantize - Lowering of the requantize to exisiting Relay operators - Integer fixed point implementation of requantize - Two rounding modes - FE_UPWARDS (round towards infinity) and FE_AWAY_FROM_ZERO (std::round behavior) - Floating point implementation as well, that can act as reference or can be used for devices when FP32 computation is not used. - Unit test cases Relevant Issue - apache#2351 Credit to TFLite and GemmLowp to provide reference implementations.
Requantize converts one quantized tensor representation to another quantized representation. The PR has following implementation features - Requantize operator defined in qnn namespace - relay.qnn.requantize - Lowering of the requantize to exisiting Relay operators - Integer fixed point implementation of requantize - Two rounding modes - FE_UPWARDS (round towards infinity) and FE_AWAY_FROM_ZERO (std::round behavior) - Floating point implementation as well, that can act as reference or can be used for devices when FP32 computation is not used. - Unit test cases Relevant Issue - apache#2351 Credit to TFLite and GemmLowp to provide reference implementations.
Requantize converts one quantized tensor representation to another quantized representation. The PR has following implementation features - Requantize operator defined in qnn namespace - relay.qnn.requantize - Lowering of the requantize to exisiting Relay operators - Integer fixed point implementation of requantize - Two rounding modes - FE_UPWARDS (round towards infinity) and FE_AWAY_FROM_ZERO (std::round behavior) - Floating point implementation as well, that can act as reference or can be used for devices when FP32 computation is not used. - Unit test cases Relevant Issue - apache#2351 Credit to TFLite and GemmLowp to provide reference implementations.
Requantize converts one quantized tensor representation to another quantized representation. The PR has following implementation features - Requantize operator defined in qnn namespace - relay.qnn.requantize - Lowering of the requantize to exisiting Relay operators - Integer fixed point implementation of requantize - Two rounding modes - FE_UPWARDS (round towards infinity) and FE_AWAY_FROM_ZERO (std::round behavior) - Floating point implementation as well, that can act as reference or can be used for devices when FP32 computation is not used. - Unit test cases Relevant Issue - apache#2351 Credit to TFLite and GemmLowp to provide reference implementations.
* [Relay] [Quantization] WIP - Common files for the qauntization work. * [Relay] [Quantization] WIP - Prototyping requantize op. * Requantize operator implementation. Requantize converts one quantized tensor representation to another quantized representation. The PR has following implementation features - Requantize operator defined in qnn namespace - relay.qnn.requantize - Lowering of the requantize to exisiting Relay operators - Integer fixed point implementation of requantize - Two rounding modes - FE_UPWARDS (round towards infinity) and FE_AWAY_FROM_ZERO (std::round behavior) - Floating point implementation as well, that can act as reference or can be used for devices when FP32 computation is not used. - Unit test cases Relevant Issue - #2351 Credit to TFLite and GemmLowp to provide reference implementations. * Typo and lint fixes. * Doc fix. * Uncommenting the lint script (fixing mistake). * Modifying the unit tests. * Moving C++ files into src/relay/qnn * Moving python files to python/tvm/relay/qnn. Some minor fixes. * Moving the attrs.h inside the include directory. * Pushing files that I forgot earlier. Changing util location. * Incorporating comments. API change. Lint fixes. * Modifying the GetFixedPointMultiplierShift API as per comments. * Forgot the dialect change. * Changing rewrite to qnn_lower. * Renaming Quantize to Qnn for clarity. * Remove use_int_domain. * Incorportaing review comments. * Adding API doc for QNN dialect. * Move the qnn_lower pass to transform namespace. * Moving from expr to module. Adding namespace in C++. * Minor sentence rewrites. Added qnn namespace. * Added the API doc. * Chanding default out_dtype to int8. Adding a test with in/out_dtype as uint8. * Style fixes. Better error messages. * Adding documentation. * More documentation fixes. * Adding out dtype check for requantize. * Adding corner case for FP32 to fixed point conversion. * Adding extra line. * Documentation fix. * Adding static inline. * Incorporating jackwish comment. Removed idtype from requantize lowering. * Removing Quantize/Dequantize code. Restricting Requantize to (u)int8/int32. * Style fixes. * Fix the docs. * Move to Legalize API.
* [Relay] [Quantization] WIP - Common files for the qauntization work. * [Relay] [Quantization] WIP - Prototyping requantize op. * Requantize operator implementation. Requantize converts one quantized tensor representation to another quantized representation. The PR has following implementation features - Requantize operator defined in qnn namespace - relay.qnn.requantize - Lowering of the requantize to exisiting Relay operators - Integer fixed point implementation of requantize - Two rounding modes - FE_UPWARDS (round towards infinity) and FE_AWAY_FROM_ZERO (std::round behavior) - Floating point implementation as well, that can act as reference or can be used for devices when FP32 computation is not used. - Unit test cases Relevant Issue - apache#2351 Credit to TFLite and GemmLowp to provide reference implementations. * Typo and lint fixes. * Doc fix. * Uncommenting the lint script (fixing mistake). * Modifying the unit tests. * Moving C++ files into src/relay/qnn * Moving python files to python/tvm/relay/qnn. Some minor fixes. * Moving the attrs.h inside the include directory. * Pushing files that I forgot earlier. Changing util location. * Incorporating comments. API change. Lint fixes. * Modifying the GetFixedPointMultiplierShift API as per comments. * Forgot the dialect change. * Changing rewrite to qnn_lower. * Renaming Quantize to Qnn for clarity. * Remove use_int_domain. * Incorportaing review comments. * Adding API doc for QNN dialect. * Move the qnn_lower pass to transform namespace. * Moving from expr to module. Adding namespace in C++. * Minor sentence rewrites. Added qnn namespace. * Added the API doc. * Chanding default out_dtype to int8. Adding a test with in/out_dtype as uint8. * Style fixes. Better error messages. * Adding documentation. * More documentation fixes. * Adding out dtype check for requantize. * Adding corner case for FP32 to fixed point conversion. * Adding extra line. * Documentation fix. * Adding static inline. * Incorporating jackwish comment. Removed idtype from requantize lowering. * Removing Quantize/Dequantize code. Restricting Requantize to (u)int8/int32. * Style fixes. * Fix the docs. * Move to Legalize API.
* [Relay] [Quantization] WIP - Common files for the qauntization work. * [Relay] [Quantization] WIP - Prototyping requantize op. * Requantize operator implementation. Requantize converts one quantized tensor representation to another quantized representation. The PR has following implementation features - Requantize operator defined in qnn namespace - relay.qnn.requantize - Lowering of the requantize to exisiting Relay operators - Integer fixed point implementation of requantize - Two rounding modes - FE_UPWARDS (round towards infinity) and FE_AWAY_FROM_ZERO (std::round behavior) - Floating point implementation as well, that can act as reference or can be used for devices when FP32 computation is not used. - Unit test cases Relevant Issue - apache#2351 Credit to TFLite and GemmLowp to provide reference implementations. * Typo and lint fixes. * Doc fix. * Uncommenting the lint script (fixing mistake). * Modifying the unit tests. * Moving C++ files into src/relay/qnn * Moving python files to python/tvm/relay/qnn. Some minor fixes. * Moving the attrs.h inside the include directory. * Pushing files that I forgot earlier. Changing util location. * Incorporating comments. API change. Lint fixes. * Modifying the GetFixedPointMultiplierShift API as per comments. * Forgot the dialect change. * Changing rewrite to qnn_lower. * Renaming Quantize to Qnn for clarity. * Remove use_int_domain. * Incorportaing review comments. * Adding API doc for QNN dialect. * Move the qnn_lower pass to transform namespace. * Moving from expr to module. Adding namespace in C++. * Minor sentence rewrites. Added qnn namespace. * Added the API doc. * Chanding default out_dtype to int8. Adding a test with in/out_dtype as uint8. * Style fixes. Better error messages. * Adding documentation. * More documentation fixes. * Adding out dtype check for requantize. * Adding corner case for FP32 to fixed point conversion. * Adding extra line. * Documentation fix. * Adding static inline. * Incorporating jackwish comment. Removed idtype from requantize lowering. * Removing Quantize/Dequantize code. Restricting Requantize to (u)int8/int32. * Style fixes. * Fix the docs. * Move to Legalize API.
Let me reference @ajtulloch 's comment about quantization workflow firstly:
However, we have framework can do step 1 -> step 5 well like Tensorflow. For example, Tensorflow has quantization-aware training which will do step 2 and get good accuracy at last.
In the industry development, one common scenario is company will divide algorithm and engine / framework into two different teams. Algorithm team just send an model to engine team to boost the performance. So if algorithm team can use Tensorflow's quantization-aware training, they will know the accuracy before delivering the model to engine team. Engine team just be responsible for boosting the performance.
I will make several PRs to support importing exist quantized model (TFLite INT8 model) In TVM for previous reason. This is not an replacement of #2116, it is just a supplement for TVM's quantization.
After initial investigation and effort, in the Mobilenet V1 model, INT8 can get speed up about 30% when compared with FP32 on ARM CPU.
Support TFLite FP32 Relay frontend. PR: [TFLite] Support TFLite FP32 Relay frontend. #2365
Support TFLite INT8 Relay frontend
Extend the attribute of the convolution and related ops to support quantization
Auto-TVM on ARM CPU can work with INT8
Welcome any feedback.
The text was updated successfully, but these errors were encountered: