Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dot product support for quantized convolution. #6445

Merged
merged 10 commits into from
Oct 6, 2020

Conversation

giuseros
Copy link
Contributor

High level description of the submission

We added two new intrinsics in: topi/arm_cpu/tensor_intrin.py, namely

  • mmla4x4: compute a matrix multiplication between tile A(4,4) and tile
    B(4,4)
  • mmla16x4: compute a matrix multiplication between tile A(rows,4) and tile
    B(4,16)
    Then we used those intrinsics in two separate strategies. We added the
    strategies in topi/arm_cpu/conv2d_int8.py and implemented the schedules
    in topi/arm_cpu/conv2d_gemm.py. In particular:
  • schedule_conv2d_gemm, when accelerated, packs matrix A, compute GEMM,
    and unpack the resulting matrix. This uses the mmla4x4 intrinsic
  • schedule_conv2d_gemm_hybrid doesn't do any packing on A and C which
    are in native form. This uses the mmla16x4 intrinsic

Please note that for the limitations of tensorize we need to pad
matrix A in both cases (when dimensions are not multiple of the tiling
shape)

RFC

This PR is based on the following RFC: https://discuss.tvm.apache.org/t/rfc-accelerate-quantized-convolution-through-dot-product/7873

Change-Id: Id0d818d84ffc458c6dad7983fd350a0f3d5db395

@giuseros giuseros force-pushed the conv2d_dot_prod branch 2 times, most recently from f7b86dc to 995c6ab Compare September 16, 2020 17:24
@ZihengJiang
Copy link
Contributor

@anijain2305 @FrozenGene Would you please have a look at this?

@u99127
Copy link
Contributor

u99127 commented Sep 17, 2020

Can you see why ci is failing @giuseros ?

@giuseros
Copy link
Contributor Author

@u99127 , I am on it. It is strange that the command I ran locally didn't catch this.

@giuseros
Copy link
Contributor Author

Mmmm the last failure seems like a tolerance issue. Let me retrigger the CI

Copy link
Contributor

@mbaret mbaret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broadly looks good, although I find some of the naming confusing and combining the strategies of dot product/no dot product could become difficult to maintain as more intrinsics get added.

Also on testing, I think we want some tests which at least exercise the compilation route here and test against either some known TIR or assembly (until we get Arm CI).

python/tvm/relay/op/strategy/arm_cpu.py Outdated Show resolved Hide resolved
python/tvm/relay/op/strategy/arm_cpu.py Outdated Show resolved Hide resolved
python/tvm/topi/arm_cpu/arm_utils.py Outdated Show resolved Hide resolved
python/tvm/topi/arm_cpu/conv2d_alter_op.py Outdated Show resolved Hide resolved
python/tvm/topi/arm_cpu/conv2d_alter_op.py Outdated Show resolved Hide resolved
python/tvm/topi/arm_cpu/conv2d_gemm.py Outdated Show resolved Hide resolved
python/tvm/topi/arm_cpu/conv2d_gemm.py Outdated Show resolved Hide resolved
python/tvm/topi/arm_cpu/conv2d_int8.py Outdated Show resolved Hide resolved
python/tvm/topi/arm_cpu/tensor_intrin.py Outdated Show resolved Hide resolved
python/tvm/topi/arm_cpu/tensor_intrin.py Show resolved Hide resolved
@giuseros
Copy link
Contributor Author

Hi @mbaret ,
Thanks for the review!

I addressed the comments and added compilation tests to verify the compilation flow with dot-product.

Copy link
Member

@FrozenGene FrozenGene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Giuseppe Rossini added 8 commits September 29, 2020 14:50
We added two new intrinsics in: topi/arm_cpu/tensor_intrin.py, namely
- mmla4x4: compute a matrix multiplication between tile A(4,4) and tile
  B(4,4)
- mmla16x4: compute a matrix multiplication between tile A(rows,4) and tile
  B(4,16)
Then we used those intrinsics in two separate strategies. We added the
strategies in topi/arm_cpu/conv2d_int8.py and implemented the schedules
in topi/arm_cpu/conv2d_gemm.py. In particular:
- schedule_conv2d_gemm, when accelerated, packs matrix A, compute GEMM,
  and unpack the resulting matrix. This uses the mmla4x4 intrinsic
- schedule_conv2d_gemm_hybrid doesn't do any packing on A and C which
  are in native form.  This uses the mmla16x4 intrinsic

Please note that for the limitations of `tensorize` we need to pad
matrix A in both cases (when dimensions are not multiple of the tiling
shape)

Change-Id: Id0d818d84ffc458c6dad7983fd350a0f3d5db395
Change-Id: I8b1826a7ae1d742956296e8d157da19955a4942c
Change-Id: Ic74ef5461a90bca9f4d4980a214137e384d5f923
Change-Id: I5fb8a2ae4467a87bd3470f6b3753c074f9b7cc78
Change-Id: I284b1f2c121051e672f548d6c6ee2a3267854e31
Change-Id: I1813b0226b536aedee0dce9eeeba27aa2d95518b
Change-Id: Ic040722abd5538fccb85af4de922394c939e7000
Change-Id: If09e3baa514c85dc78d3c27c2ac2fa2e01773d89
python/tvm/topi/arm_cpu/tensor_intrin.py Outdated Show resolved Hide resolved
python/tvm/topi/arm_cpu/tensor_intrin.py Outdated Show resolved Hide resolved
python/tvm/topi/arm_cpu/tensor_intrin.py Outdated Show resolved Hide resolved
python/tvm/topi/arm_cpu/tensor_intrin.py Outdated Show resolved Hide resolved
python/tvm/topi/arm_cpu/tensor_intrin.py Outdated Show resolved Hide resolved
python/tvm/topi/arm_cpu/tensor_intrin.py Show resolved Hide resolved
python/tvm/topi/arm_cpu/tensor_intrin.py Outdated Show resolved Hide resolved
python/tvm/topi/arm_cpu/tensor_intrin.py Outdated Show resolved Hide resolved
python/tvm/topi/arm_cpu/tensor_intrin.py Outdated Show resolved Hide resolved
python/tvm/topi/arm_cpu/tensor_intrin.py Outdated Show resolved Hide resolved
Change-Id: I857b28b6f9b23307d8c1eebc509de6ad2783c756
Copy link
Contributor

@u99127 u99127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks.

Copy link
Contributor

@mbaret mbaret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final few comments

python/tvm/topi/arm_cpu/tensor_intrin.py Outdated Show resolved Hide resolved
python/tvm/topi/arm_cpu/tensor_intrin.py Outdated Show resolved Hide resolved
python/tvm/topi/arm_cpu/tensor_intrin.py Outdated Show resolved Hide resolved
python/tvm/topi/arm_cpu/tensor_intrin.py Outdated Show resolved Hide resolved
python/tvm/topi/arm_cpu/tensor_intrin.py Outdated Show resolved Hide resolved
python/tvm/topi/arm_cpu/tensor_intrin.py Outdated Show resolved Hide resolved
python/tvm/topi/arm_cpu/tensor_intrin.py Outdated Show resolved Hide resolved
python/tvm/topi/arm_cpu/tensor_intrin.py Outdated Show resolved Hide resolved
python/tvm/topi/arm_cpu/tensor_intrin.py Show resolved Hide resolved
python/tvm/topi/arm_cpu/tensor_intrin.py Show resolved Hide resolved
Change-Id: I63d1a639d4a72abeb33148fd2868cd356ef84122
Copy link
Contributor

@mbaret mbaret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your patience :) LGTM

@giuseros
Copy link
Contributor Author

giuseros commented Oct 5, 2020

Hi @mbaret ,
Thank you for the careful review!

@FrozenGene , @anijain2305 should we merge this in?

Copy link
Member

@FrozenGene FrozenGene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last comment

@@ -445,7 +443,7 @@ def gemm_quantized(M, N, K, unroll, interleave, in_type, out_type):
)

c_buffer = tvm.tir.decl_buffer(
C.shape, dtype=out_type, name="c_buffer", offset_factor=1, strides=[te.var("sc"), 1]
C.shape, dtype="int32", name="c_buffer", offset_factor=1, strides=[te.var("sc"), 1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the reason changing this?

Copy link
Contributor Author

@giuseros giuseros Oct 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @FrozenGene , the problem is the following: in quantized conv2d, we do conv2d and then requantization (those are two different relay operators). Conv2d goes from int8->int32, and requantization goes from int32->int8. So in theory this would work with out_type.

However, in some tests (pre-existing to my changes, that I run internally) I noticed that they set the (conv2d) out_type to int8(or uint8). In this case the intrinsic still needs to produce an int32 value and the cast to int8 (or uint8) needs to happen at a later stage.

This change is basically saying: no matter the out_type the intrinsic will produce a int32 result. If we want the output to be int8 (which would be wrong, but some tests do it to simplify the testing) the conversion needs to happen later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get it

@FrozenGene FrozenGene merged commit 8369adc into apache:master Oct 6, 2020
@FrozenGene
Copy link
Member

Thanks everyone. Merged now.

@giuseros
Copy link
Contributor Author

giuseros commented Oct 6, 2020

Thanks @FrozenGene !

@giuseros giuseros deleted the conv2d_dot_prod branch October 6, 2020 14:00
TusharKanekiDey pushed a commit to TusharKanekiDey/tvm that referenced this pull request Oct 13, 2020
* Add dot product support for quantized convolution.

We added two new intrinsics in: topi/arm_cpu/tensor_intrin.py, namely
- mmla4x4: compute a matrix multiplication between tile A(4,4) and tile
  B(4,4)
- mmla16x4: compute a matrix multiplication between tile A(rows,4) and tile
  B(4,16)
Then we used those intrinsics in two separate strategies. We added the
strategies in topi/arm_cpu/conv2d_int8.py and implemented the schedules
in topi/arm_cpu/conv2d_gemm.py. In particular:
- schedule_conv2d_gemm, when accelerated, packs matrix A, compute GEMM,
  and unpack the resulting matrix. This uses the mmla4x4 intrinsic
- schedule_conv2d_gemm_hybrid doesn't do any packing on A and C which
  are in native form.  This uses the mmla16x4 intrinsic

Please note that for the limitations of `tensorize` we need to pad
matrix A in both cases (when dimensions are not multiple of the tiling
shape)

Change-Id: Id0d818d84ffc458c6dad7983fd350a0f3d5db395

* Add back nhwc_spatial_pack strategy as default

Change-Id: I8b1826a7ae1d742956296e8d157da19955a4942c

* Fix linting through Black

Change-Id: Ic74ef5461a90bca9f4d4980a214137e384d5f923

* Fix python linting

Change-Id: I5fb8a2ae4467a87bd3470f6b3753c074f9b7cc78

* Addressing review comments

Change-Id: I284b1f2c121051e672f548d6c6ee2a3267854e31

* Fix black linting issues

Change-Id: I1813b0226b536aedee0dce9eeeba27aa2d95518b

* Fixing failing test and adding tests for dot-product compilation

Change-Id: Ic040722abd5538fccb85af4de922394c939e7000

* Fixing linting and review comments

Change-Id: If09e3baa514c85dc78d3c27c2ac2fa2e01773d89

* Fixing black linting and address comments

Change-Id: I857b28b6f9b23307d8c1eebc509de6ad2783c756

* Address review comments

Change-Id: I63d1a639d4a72abeb33148fd2868cd356ef84122
TusharKanekiDey pushed a commit to TusharKanekiDey/tvm that referenced this pull request Oct 14, 2020
* Add dot product support for quantized convolution.

We added two new intrinsics in: topi/arm_cpu/tensor_intrin.py, namely
- mmla4x4: compute a matrix multiplication between tile A(4,4) and tile
  B(4,4)
- mmla16x4: compute a matrix multiplication between tile A(rows,4) and tile
  B(4,16)
Then we used those intrinsics in two separate strategies. We added the
strategies in topi/arm_cpu/conv2d_int8.py and implemented the schedules
in topi/arm_cpu/conv2d_gemm.py. In particular:
- schedule_conv2d_gemm, when accelerated, packs matrix A, compute GEMM,
  and unpack the resulting matrix. This uses the mmla4x4 intrinsic
- schedule_conv2d_gemm_hybrid doesn't do any packing on A and C which
  are in native form.  This uses the mmla16x4 intrinsic

Please note that for the limitations of `tensorize` we need to pad
matrix A in both cases (when dimensions are not multiple of the tiling
shape)

Change-Id: Id0d818d84ffc458c6dad7983fd350a0f3d5db395

* Add back nhwc_spatial_pack strategy as default

Change-Id: I8b1826a7ae1d742956296e8d157da19955a4942c

* Fix linting through Black

Change-Id: Ic74ef5461a90bca9f4d4980a214137e384d5f923

* Fix python linting

Change-Id: I5fb8a2ae4467a87bd3470f6b3753c074f9b7cc78

* Addressing review comments

Change-Id: I284b1f2c121051e672f548d6c6ee2a3267854e31

* Fix black linting issues

Change-Id: I1813b0226b536aedee0dce9eeeba27aa2d95518b

* Fixing failing test and adding tests for dot-product compilation

Change-Id: Ic040722abd5538fccb85af4de922394c939e7000

* Fixing linting and review comments

Change-Id: If09e3baa514c85dc78d3c27c2ac2fa2e01773d89

* Fixing black linting and address comments

Change-Id: I857b28b6f9b23307d8c1eebc509de6ad2783c756

* Address review comments

Change-Id: I63d1a639d4a72abeb33148fd2868cd356ef84122
TusharKanekiDey pushed a commit to TusharKanekiDey/tvm that referenced this pull request Oct 15, 2020
* Add dot product support for quantized convolution.

We added two new intrinsics in: topi/arm_cpu/tensor_intrin.py, namely
- mmla4x4: compute a matrix multiplication between tile A(4,4) and tile
  B(4,4)
- mmla16x4: compute a matrix multiplication between tile A(rows,4) and tile
  B(4,16)
Then we used those intrinsics in two separate strategies. We added the
strategies in topi/arm_cpu/conv2d_int8.py and implemented the schedules
in topi/arm_cpu/conv2d_gemm.py. In particular:
- schedule_conv2d_gemm, when accelerated, packs matrix A, compute GEMM,
  and unpack the resulting matrix. This uses the mmla4x4 intrinsic
- schedule_conv2d_gemm_hybrid doesn't do any packing on A and C which
  are in native form.  This uses the mmla16x4 intrinsic

Please note that for the limitations of `tensorize` we need to pad
matrix A in both cases (when dimensions are not multiple of the tiling
shape)

Change-Id: Id0d818d84ffc458c6dad7983fd350a0f3d5db395

* Add back nhwc_spatial_pack strategy as default

Change-Id: I8b1826a7ae1d742956296e8d157da19955a4942c

* Fix linting through Black

Change-Id: Ic74ef5461a90bca9f4d4980a214137e384d5f923

* Fix python linting

Change-Id: I5fb8a2ae4467a87bd3470f6b3753c074f9b7cc78

* Addressing review comments

Change-Id: I284b1f2c121051e672f548d6c6ee2a3267854e31

* Fix black linting issues

Change-Id: I1813b0226b536aedee0dce9eeeba27aa2d95518b

* Fixing failing test and adding tests for dot-product compilation

Change-Id: Ic040722abd5538fccb85af4de922394c939e7000

* Fixing linting and review comments

Change-Id: If09e3baa514c85dc78d3c27c2ac2fa2e01773d89

* Fixing black linting and address comments

Change-Id: I857b28b6f9b23307d8c1eebc509de6ad2783c756

* Address review comments

Change-Id: I63d1a639d4a72abeb33148fd2868cd356ef84122
TusharKanekiDey pushed a commit to TusharKanekiDey/tvm that referenced this pull request Oct 16, 2020
* Add dot product support for quantized convolution.

We added two new intrinsics in: topi/arm_cpu/tensor_intrin.py, namely
- mmla4x4: compute a matrix multiplication between tile A(4,4) and tile
  B(4,4)
- mmla16x4: compute a matrix multiplication between tile A(rows,4) and tile
  B(4,16)
Then we used those intrinsics in two separate strategies. We added the
strategies in topi/arm_cpu/conv2d_int8.py and implemented the schedules
in topi/arm_cpu/conv2d_gemm.py. In particular:
- schedule_conv2d_gemm, when accelerated, packs matrix A, compute GEMM,
  and unpack the resulting matrix. This uses the mmla4x4 intrinsic
- schedule_conv2d_gemm_hybrid doesn't do any packing on A and C which
  are in native form.  This uses the mmla16x4 intrinsic

Please note that for the limitations of `tensorize` we need to pad
matrix A in both cases (when dimensions are not multiple of the tiling
shape)

Change-Id: Id0d818d84ffc458c6dad7983fd350a0f3d5db395

* Add back nhwc_spatial_pack strategy as default

Change-Id: I8b1826a7ae1d742956296e8d157da19955a4942c

* Fix linting through Black

Change-Id: Ic74ef5461a90bca9f4d4980a214137e384d5f923

* Fix python linting

Change-Id: I5fb8a2ae4467a87bd3470f6b3753c074f9b7cc78

* Addressing review comments

Change-Id: I284b1f2c121051e672f548d6c6ee2a3267854e31

* Fix black linting issues

Change-Id: I1813b0226b536aedee0dce9eeeba27aa2d95518b

* Fixing failing test and adding tests for dot-product compilation

Change-Id: Ic040722abd5538fccb85af4de922394c939e7000

* Fixing linting and review comments

Change-Id: If09e3baa514c85dc78d3c27c2ac2fa2e01773d89

* Fixing black linting and address comments

Change-Id: I857b28b6f9b23307d8c1eebc509de6ad2783c756

* Address review comments

Change-Id: I63d1a639d4a72abeb33148fd2868cd356ef84122
trevor-m pushed a commit to neo-ai/tvm that referenced this pull request Oct 19, 2020
* Add dot product support for quantized convolution.

We added two new intrinsics in: topi/arm_cpu/tensor_intrin.py, namely
- mmla4x4: compute a matrix multiplication between tile A(4,4) and tile
  B(4,4)
- mmla16x4: compute a matrix multiplication between tile A(rows,4) and tile
  B(4,16)
Then we used those intrinsics in two separate strategies. We added the
strategies in topi/arm_cpu/conv2d_int8.py and implemented the schedules
in topi/arm_cpu/conv2d_gemm.py. In particular:
- schedule_conv2d_gemm, when accelerated, packs matrix A, compute GEMM,
  and unpack the resulting matrix. This uses the mmla4x4 intrinsic
- schedule_conv2d_gemm_hybrid doesn't do any packing on A and C which
  are in native form.  This uses the mmla16x4 intrinsic

Please note that for the limitations of `tensorize` we need to pad
matrix A in both cases (when dimensions are not multiple of the tiling
shape)

Change-Id: Id0d818d84ffc458c6dad7983fd350a0f3d5db395

* Add back nhwc_spatial_pack strategy as default

Change-Id: I8b1826a7ae1d742956296e8d157da19955a4942c

* Fix linting through Black

Change-Id: Ic74ef5461a90bca9f4d4980a214137e384d5f923

* Fix python linting

Change-Id: I5fb8a2ae4467a87bd3470f6b3753c074f9b7cc78

* Addressing review comments

Change-Id: I284b1f2c121051e672f548d6c6ee2a3267854e31

* Fix black linting issues

Change-Id: I1813b0226b536aedee0dce9eeeba27aa2d95518b

* Fixing failing test and adding tests for dot-product compilation

Change-Id: Ic040722abd5538fccb85af4de922394c939e7000

* Fixing linting and review comments

Change-Id: If09e3baa514c85dc78d3c27c2ac2fa2e01773d89

* Fixing black linting and address comments

Change-Id: I857b28b6f9b23307d8c1eebc509de6ad2783c756

* Address review comments

Change-Id: I63d1a639d4a72abeb33148fd2868cd356ef84122
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants