[RFC] Add AVX512VNNI support for TVM #3388

jianyuh · 2019-06-18T07:36:18Z

We add the first AVX512VNNI instruction support in TVM in this PR.

To compute the intrinsics kernels of uint8 * int8 -> accumulation int32,

Originally, we need the following instructions:

vpmaddubsw zmm28, zmm31, zmm30
vpmaddwd zmm28, zmm29, zmm28
vpaddd zmm0, zmm28, zmm0

After this PR with AVX512 VNNI, we only need the following instructions:

vpdpbusd zmm0, zmm31, zmm30

We benchmark the current TVM with the benchmark routine on an Intel Cascade Lake machine (Intel(R) Xeon(R) Gold 5220 CPU @ 2.20GHz ) in this PR:
The theoretical peak performance for this Intel Cascade Lake machine is ~280 Gops/s with Turbo off.

Before this PR:

Tensorization: running time: 15.545 ms, 138.14 Gops/s, effiency: 0.49

After this PR:

Tensorization: running time: 10.443 ms, 205.64 Gops/s, effiency: 0.73

As a reference, for our current ongoing PR (pytorch/FBGEMM#111), FBGEMM can achieve ~252 Gops/s. The measured performance for MKL-DNN compiled in GCC is similar. As pointed out by @were in the comments below, the reason is that in the current implementation in this PR, we didn't fully utilize the accumulation in vpdpbusd instruction.

jianyuh · 2019-06-18T18:12:23Z

It appears that the CI test is not using LLVM 8.0 or higher version, thus not supporting AVX512 VNNI instructions.

tqchen · 2019-06-18T20:02:21Z

Perhaps a good time to update the CI infra to keep up with LLVM mainline, see steps in https://docs.tvm.ai/contribute/pull_request.html#ci-environment

anijain2305

Thanks for the contribution!

tqchen · 2019-06-27T17:46:55Z

Because we need the test case to work for all environments, please add a feature detection step to skip the test if VNNI is not yet available.

tqchen · 2019-07-08T16:57:24Z

@jianyuh can you add a pre-condition to the test cases so it skips the test when LLVM8 is not enabled? You can use https://github.com/dmlc/tvm/blob/9bfdc55c572e03f6cfac6994a9e75f8fd9252850/python/tvm/intrin.py#L193 to look up the intrinsic to see if it is available.

I will update the CI to add LLVM8 this week. However, we also need to test against the older versions of LLVM

jianyuh · 2019-07-09T05:54:15Z

@tqchen Will update this soon (sorry for being busy with some other things recently).

jianyuh · 2019-07-15T16:58:26Z

My PR was based on the previous version of TVM. Not sure what are the recent changes for TVM.

http://ci.tvm.ai:8080/blue/organizations/jenkins/tvm/detail/PR-3388/6/pipeline/
Not sure why “llvm.x86.avx512.pmaddubs.w.512“ (AVX512 instruction, not VNNI instruction) is not recognized as an LLVM intrinsic.

http://ci.tvm.ai:8080/blue/organizations/jenkins/tvm/detail/PR-3388/7/pipeline
When I use tensorize routine and pass in “dot_16x1x16_int8_int8_int32” function (the original AVX512 implementation for uint8 x int8 multiplication), it reports “TVMError: Check failed: type_code_ == kNodeHandle (10 vs. 8) : expected NodeHandle but get FunctionHandle“ error.

Any insights about what might be wrong here? Thanks for advance!

anijain2305 · 2019-07-15T22:43:50Z

I will update the CI to add LLVM8 this week.

Hi @tqchen, is there any update on the LLVM8 front? We are also looking into this and have similar test issue.

anijain2305 · 2019-07-15T22:49:11Z

http://ci.tvm.ai:8080/blue/organizations/jenkins/tvm/detail/PR-3388/6/pipeline/
Not sure why “llvm.x86.avx512.pmaddubs.w.512“ (AVX512 instruction, not VNNI instruction) is not recognized as an LLVM intrinsic.

This is happening because the LLVM version is 6.0 in CI as Tianqi mentioned. You can skip the test by perform the intrinsic lookup as shown in the above post.

topi/python/topi/x86/tensor_intrin.py

were · 2019-07-25T00:19:05Z

topi/python/topi/x86/tensor_intrin.py

+    data = tvm.placeholder((num_int8_elements,), dtype='uint8', name='data')
+    kernel = tvm.placeholder((int32_lanes, num_int8_elements), dtype='int8', name='kernel')
+    k = tvm.reduce_axis((0, num_int8_elements), name='k')
+    C = tvm.compute((int32_lanes,),


In my shallow knowledge on the semantics of both VNNI and TVM generated code.
There might be better software-defined description of VNNI so that you can avoid buffer reset.

Say VNNI does something like this:

for (int i = 0; i < 16; ++i) { uint32_t sum = 0; for (int j = 0; j < 4; ++j) sum += a[i * 4 + j] * b[i * 4 + j]; c[i] = c[i] + sum; // We do not want to set c[i] to zero, it is an accumulation }

However, if we build C.op it will generate code like this:

for (int i = 0; i< 16; ++i) { c[i] = 0; // To emulate the semantics of VNNI, we definitely do not want this reset. for (int j = 0; j < 4; ++j) c[i] = c[i] + a[i * 4 + j] * b[i * 4 + j]; }

Therefore, I suggest C is only an intermediate result, still we need another tensor:
The whole code looks like this:

a = placeholder() b = placeholder() d = placeholder() c = reduce(sum(a, b)) e = c + d # binds e and d to the same buffer so that the results can be retained between invocations.

@were : Thanks for pointing it out. I am not familiar with TVM part code either. Let me think about this. Background: we have recently incorporated VNNI into the FBGEMM (https://github.com/pytorch/fbgemm, VNNI part code will be publish soon: pytorch/FBGEMM#111). We also want to check if TVM can support VNNI and we want to do some performance comparisons between TVM, MKL-DNN, and FBGEMM.

tqchen · 2019-07-25T01:11:31Z

@jianyuh can you look into the CI error?

were · 2019-07-25T01:55:12Z

I am not sure if tensorize is a good way to suport VNNI:

VNNI is not true tensorization, though reduction dimension is introduced. It still operates on 1-D inputs. Due to the design of tensorization interface, you need to provide the declared intrin the shape of tensors offloaded, but essentially they are 1-D.
Another thing I am worrying about is imperfect tiling. Since tensorize cuts off the whole loop body down, without being aware of the loop body replaced. Thus, it is hard to extend this to imperfect tiling case.

jianyuh · 2019-07-25T07:06:41Z

@tqchen : Will take a look soon. Let me know if this PR becomes the blockers for other things.

The current failure is shown as the following:

http://ci.tvm.ai:8080/blue/organizations/jenkins/tvm/detail/PR-3388/7/pipeline
When I use tensorize routine and pass in “dot_16x1x16_int8_int8_int32” function (the original AVX512 implementation for uint8 x int8 multiplication), it reports “TVMError: Check failed: type_code_ == kNodeHandle (10 vs. 8) : expected NodeHandle but get FunctionHandle“ error.

Sorry I am not familiar with TVM code base. Do you have any insights about what are NodeHandle/FunctionHandle here?

tqchen · 2019-07-25T17:39:07Z

It could due to wrong types of arguments being passed to the tensor intrinsic. The corresponding function requires a NodeRef subtype but instead get a PackedFunc

tests/python/contrib/test_gemm_acc32_vnni.py

jianyuh · 2019-07-31T06:28:11Z

I can run the correct result locally with an older version of TVM. Updated the summary part for this PR to report the performance results.

However, I had the same issue as #3598 for the OSS compilation error (http://ci.tvm.ai:8080/blue/organizations/jenkins/tvm/detail/PR-3388/11/pipeline).

TVMError: Check failed: is_one(e.region[i]->extent): Tensorize tensor_intrin: Input dimension mismatch with tensor intrin expected shape=[16, 4], given region=[range(min=((j.outer16)/16), ext=(((((j.outer16) + 15)/16) + 1) - j.outer)), range(min=(((((k.outer.outer4) + k.outer.inner)4)/4)16), ext=((((((((k.outer.outer16) + (k.outer.inner4)) + 3)/4)16) + 16) - (k.outer.inner16)) - (k.outer.outer64))), range(min=0, ext=4)]

Any temporary workaround for that? cc @tqchen , @anijain2305 .

jianyuh · 2019-07-31T06:40:50Z

I am not sure if tensorize is a good way to suport VNNI:

VNNI is not true tensorization, though reduction dimension is introduced. It still operates on 1-D inputs. Due to the design of tensorization interface, you need to provide the declared intrin the shape of tensors offloaded, but essentially they are 1-D.

Another thing I am worrying about is imperfect tiling. Since tensorize cuts off the whole loop body down, without being aware of the loop body replaced. Thus, it is hard to extend this to imperfect tiling case.

@were : You are right. I report the current performance of the implementation in this PR in the summary. Not sure how to overcome the limitation of TVM. cc @tqchen @anijain2305

jianyuh · 2019-08-01T07:42:32Z

Similar to @anijain2305 's PR (#3516), currently we disable the AVX512 VNNI test in this PR.

Posted the question on tensorize failure in https://discuss.tvm.ai/t/workaround-for-tensorize-failure/3577. @anijain2305 posted the same issue in #3598.

jianyuh · 2019-08-01T07:43:59Z

@FrozenGene @tqchen @anijain2305 @llyfacebook @were Ping for review.

FrozenGene · 2019-08-01T08:56:47Z

tests/python/contrib/test_gemm_acc32_vnni.py

+        t_sch[t_fc].unroll(a_koi)
+        t_sch[t_fc].tensorize(a_yi, pc)
+
+        # print(tvm.lower(t_sch, [X, packedW, t_fc], simple_mode=True))


remove this and other unnecessary comments

FrozenGene

LGTM.

FrozenGene · 2019-08-01T09:01:51Z

If we have time, we could investigate why we couldn't achieve 252GFlops even more. Only 73% hardware efficiency means we have much work could dive.

tqchen · 2019-08-01T19:56:56Z

@jianyuh please act on the review comments @were please https://docs.tvm.ai/contribute/code_review.html#approve-and-request-changes-explicitly

jianyuh · 2019-08-05T22:46:52Z

If we have time, we could investigate why we couldn't achieve 252GFlops even more. Only 73% hardware efficiency means we have much work could dive.

252 Gops/s is a reasonable number as this is ~90% hardware efficiency. Currently FBGEMM and MKL-DNN can reach this number. For the current PR, the reason is that we didn't fully utilize the accumulation in vpdpbusd instruction, and we get 205.6 Gops/s (73% efficiency).

jianyuh · 2019-08-05T22:51:05Z

@jianyuh please act on the review comments @were please https://docs.tvm.ai/contribute/code_review.html#approve-and-request-changes-explicitly

I addressed the comments by @FrozenGene . For @were 's comment, I took a try but somehow I got lower performance. I think it might be related to various TVM code pieces, so it might take more efforts to address. I will take a look when I have time, but it might be slow. Maybe it is better to first ship this PR and add another PR later for optimizing the performance.

tqchen · 2019-09-13T20:27:58Z

Thanks @jianyuh @were @FrozenGene this PR is now merged

anijain2305 · 2019-10-11T22:22:01Z

Hi @jianyuh I am getting following error when I try to run my benchmark. It gives following error,

LLVM ERROR: Cannot select: 0x23809ef0: v16i32 = X86ISD::VPDPBUSD 0x210a09a8, 0x210a02c0, 0x19eb81b0
  0x210a09a8: v16i32 = BUILD_VECTOR Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
  0x210a02c0: v16i32 = X86ISD::VBROADCAST 0x1da48690
    0x1da48690: i32,ch = load<(load 4 from %ir.scevgep54, !tbaa !553)> 0x21cfaeb8, 0x23879d70, undef:i64
      0x23879d70: i64 = add 0x19eb83b8, Constant:i64<-224>
        0x19eb83b8: i64 = add 0x1da47da0, 0x19eb8218
          0x1da47da0: i64,ch = CopyFromReg 0x21cfaeb8, Register:i64 %50
            0x23879b68: i64 = Register %50
          0x19eb8218: i64,ch = CopyFromReg 0x21cfaeb8, Register:i64 %52
            0x1da47a60: i64 = Register %52
        0x1da478c0: i64 = Constant<-224>
      0x2254ed88: i64 = undef
  0x19eb81b0: v16i32,ch = load<(load 64 from %ir.lsr.iv35, !tbaa !556)> 0x21cfaeb8, 0x2387a1e8, undef:i64
    0x2387a1e8: i64,ch = CopyFromReg 0x21cfaeb8, Register:i64 %51
      0x2254e6a0: i64 = Register %51
    0x2254ed88: i64 = undef
In function: __tvm_parallel_lambda.85

Was wondering if you ever saw this. Let me know. I will try to debug on my end.

jianyuh · 2019-10-11T23:34:39Z

Hi @jianyuh I am getting following error when I try to run my benchmark. It gives following error,

LLVM ERROR: Cannot select: 0x23809ef0: v16i32 = X86ISD::VPDPBUSD 0x210a09a8, 0x210a02c0, 0x19eb81b0
  0x210a09a8: v16i32 = BUILD_VECTOR Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
  0x210a02c0: v16i32 = X86ISD::VBROADCAST 0x1da48690
    0x1da48690: i32,ch = load<(load 4 from %ir.scevgep54, !tbaa !553)> 0x21cfaeb8, 0x23879d70, undef:i64
      0x23879d70: i64 = add 0x19eb83b8, Constant:i64<-224>
        0x19eb83b8: i64 = add 0x1da47da0, 0x19eb8218
          0x1da47da0: i64,ch = CopyFromReg 0x21cfaeb8, Register:i64 %50
            0x23879b68: i64 = Register %50
          0x19eb8218: i64,ch = CopyFromReg 0x21cfaeb8, Register:i64 %52
            0x1da47a60: i64 = Register %52
        0x1da478c0: i64 = Constant<-224>
      0x2254ed88: i64 = undef
  0x19eb81b0: v16i32,ch = load<(load 64 from %ir.lsr.iv35, !tbaa !556)> 0x21cfaeb8, 0x2387a1e8, undef:i64
    0x2387a1e8: i64,ch = CopyFromReg 0x21cfaeb8, Register:i64 %51
      0x2254e6a0: i64 = Register %51
    0x2254ed88: i64 = undef
In function: __tvm_parallel_lambda.85

Was wondering if you ever saw this. Let me know. I will try to debug on my end.

I haven't seen such errors. Previously I had one issue (https://discuss.tvm.ai/t/workaround-for-tensorize-failure/3577) the same with your previous observation (#3598). I rebased on top of a previous version of TVM (the version #3081 is based, which is released in April 2019), and it worked fine.

jianyuh marked this pull request as ready for review June 18, 2019 18:14

jianyuh changed the title ~~Add AVX512VNNI support for TVM~~ [RFC] Add AVX512VNNI support for TVM Jun 18, 2019

anijain2305 reviewed Jun 21, 2019

View reviewed changes

anijain2305 approved these changes Jun 21, 2019

View reviewed changes

tqchen added the status: need update need update based on feedbacks label Jun 27, 2019

jianyuh force-pushed the avx512vnni branch 6 times, most recently from 53d1faa to 8e5f1a6 Compare July 15, 2019 16:19

jianyuh commented Jul 18, 2019

View reviewed changes

topi/python/topi/x86/tensor_intrin.py Show resolved Hide resolved

were reviewed Jul 24, 2019

View reviewed changes

topi/python/topi/x86/tensor_intrin.py Show resolved Hide resolved

were suggested changes Jul 25, 2019

View reviewed changes

anijain2305 reviewed Jul 25, 2019

View reviewed changes

tests/python/contrib/test_gemm_acc32_vnni.py Outdated Show resolved Hide resolved

anijain2305 mentioned this pull request Jul 25, 2019

[QNN] [RFC] QNN Dialect -- Prequantize Models #3591

Closed

jianyuh force-pushed the avx512vnni branch 2 times, most recently from ce91403 to ed1cb64 Compare July 26, 2019 19:03

jianyuh force-pushed the avx512vnni branch from ed1cb64 to 498f250 Compare July 31, 2019 06:04

jianyuh force-pushed the avx512vnni branch from 498f250 to 0d47497 Compare July 31, 2019 06:37

jianyuh force-pushed the avx512vnni branch 2 times, most recently from 459c07b to 1152721 Compare August 1, 2019 07:37

jianyuh force-pushed the avx512vnni branch from 1152721 to 0c0e8fa Compare August 1, 2019 07:46

FrozenGene reviewed Aug 1, 2019

View reviewed changes

FrozenGene approved these changes Aug 1, 2019

View reviewed changes

Add AVX512VNNI support for TVM

abd68ec

jianyuh force-pushed the avx512vnni branch from 0c0e8fa to abd68ec Compare August 5, 2019 22:43

tqchen added status: accepted and removed status: need update need update based on feedbacks labels Sep 13, 2019

tqchen merged commit bb82e09 into apache:master Sep 13, 2019

wweic pushed a commit to wweic/tvm that referenced this pull request Sep 16, 2019

Add AVX512VNNI support for TVM (apache#3388)

8004bd2

wweic pushed a commit to wweic/tvm that referenced this pull request Sep 16, 2019

Add AVX512VNNI support for TVM (apache#3388)

aa0131f

wweic pushed a commit to neo-ai/tvm that referenced this pull request Sep 16, 2019

Add AVX512VNNI support for TVM (apache#3388)

7408e20

mshawcroft mentioned this pull request Oct 10, 2019

[DOCKER] Fix broken test environment for test_gemm_acc32_vnni.py #4097

Closed

tqchen mentioned this pull request Nov 8, 2019

[RELEASE][DRAFT] TVM v0.6 Release candidate #4259

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Add AVX512VNNI support for TVM #3388

[RFC] Add AVX512VNNI support for TVM #3388

jianyuh commented Jun 18, 2019 •

edited

Loading

jianyuh commented Jun 18, 2019 •

edited

Loading

tqchen commented Jun 18, 2019

anijain2305 left a comment

tqchen commented Jun 27, 2019

tqchen commented Jul 8, 2019 •

edited

Loading

jianyuh commented Jul 9, 2019

jianyuh commented Jul 15, 2019 •

edited

Loading

anijain2305 commented Jul 15, 2019 •

edited

Loading

anijain2305 commented Jul 15, 2019

were Jul 25, 2019

jianyuh Jul 25, 2019 •

edited

Loading

tqchen commented Jul 25, 2019

were commented Jul 25, 2019

jianyuh commented Jul 25, 2019

tqchen commented Jul 25, 2019

jianyuh commented Jul 31, 2019 •

edited

Loading

jianyuh commented Jul 31, 2019

jianyuh commented Aug 1, 2019 •

edited

Loading

jianyuh commented Aug 1, 2019

FrozenGene Aug 1, 2019 •

edited

Loading

jianyuh Aug 5, 2019

FrozenGene left a comment

FrozenGene commented Aug 1, 2019

tqchen commented Aug 1, 2019

jianyuh commented Aug 5, 2019

jianyuh commented Aug 5, 2019 •

edited

Loading

tqchen commented Sep 13, 2019

anijain2305 commented Oct 11, 2019

jianyuh commented Oct 11, 2019 •

edited

Loading

[RFC] Add AVX512VNNI support for TVM #3388

[RFC] Add AVX512VNNI support for TVM #3388

Conversation

jianyuh commented Jun 18, 2019 • edited Loading

jianyuh commented Jun 18, 2019 • edited Loading

tqchen commented Jun 18, 2019

anijain2305 left a comment

Choose a reason for hiding this comment

tqchen commented Jun 27, 2019

tqchen commented Jul 8, 2019 • edited Loading

jianyuh commented Jul 9, 2019

jianyuh commented Jul 15, 2019 • edited Loading

anijain2305 commented Jul 15, 2019 • edited Loading

anijain2305 commented Jul 15, 2019

were Jul 25, 2019

Choose a reason for hiding this comment

jianyuh Jul 25, 2019 • edited Loading

Choose a reason for hiding this comment

tqchen commented Jul 25, 2019

were commented Jul 25, 2019

jianyuh commented Jul 25, 2019

tqchen commented Jul 25, 2019

jianyuh commented Jul 31, 2019 • edited Loading

jianyuh commented Jul 31, 2019

jianyuh commented Aug 1, 2019 • edited Loading

jianyuh commented Aug 1, 2019

FrozenGene Aug 1, 2019 • edited Loading

Choose a reason for hiding this comment

jianyuh Aug 5, 2019

Choose a reason for hiding this comment

FrozenGene left a comment

Choose a reason for hiding this comment

FrozenGene commented Aug 1, 2019

tqchen commented Aug 1, 2019

jianyuh commented Aug 5, 2019

jianyuh commented Aug 5, 2019 • edited Loading

tqchen commented Sep 13, 2019

anijain2305 commented Oct 11, 2019

jianyuh commented Oct 11, 2019 • edited Loading

jianyuh commented Jun 18, 2019 •

edited

Loading

jianyuh commented Jun 18, 2019 •

edited

Loading

tqchen commented Jul 8, 2019 •

edited

Loading

jianyuh commented Jul 15, 2019 •

edited

Loading

anijain2305 commented Jul 15, 2019 •

edited

Loading

jianyuh Jul 25, 2019 •

edited

Loading

jianyuh commented Jul 31, 2019 •

edited

Loading

jianyuh commented Aug 1, 2019 •

edited

Loading

FrozenGene Aug 1, 2019 •

edited

Loading

jianyuh commented Aug 5, 2019 •

edited

Loading

jianyuh commented Oct 11, 2019 •

edited

Loading