Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Add AVX512VNNI support for TVM #3388

Merged
merged 1 commit into from
Sep 13, 2019
Merged

Conversation

jianyuh
Copy link
Contributor

@jianyuh jianyuh commented Jun 18, 2019

We add the first AVX512VNNI instruction support in TVM in this PR.

To compute the intrinsics kernels of uint8 * int8 -> accumulation int32,

  • Originally, we need the following instructions:
vpmaddubsw zmm28, zmm31, zmm30
vpmaddwd zmm28, zmm29, zmm28
vpaddd zmm0, zmm28, zmm0
  • After this PR with AVX512 VNNI, we only need the following instructions:
vpdpbusd zmm0, zmm31, zmm30

We benchmark the current TVM with the benchmark routine on an Intel Cascade Lake machine (Intel(R) Xeon(R) Gold 5220 CPU @ 2.20GHz ) in this PR:
The theoretical peak performance for this Intel Cascade Lake machine is ~280 Gops/s with Turbo off.

  • Before this PR:

Tensorization: running time: 15.545 ms, 138.14 Gops/s, effiency: 0.49

  • After this PR:

Tensorization: running time: 10.443 ms, 205.64 Gops/s, effiency: 0.73

As a reference, for our current ongoing PR (pytorch/FBGEMM#111), FBGEMM can achieve ~252 Gops/s. The measured performance for MKL-DNN compiled in GCC is similar. As pointed out by @were in the comments below, the reason is that in the current implementation in this PR, we didn't fully utilize the accumulation in vpdpbusd instruction.

@jianyuh
Copy link
Contributor Author

jianyuh commented Jun 18, 2019

It appears that the CI test is not using LLVM 8.0 or higher version, thus not supporting AVX512 VNNI instructions.

@jianyuh jianyuh marked this pull request as ready for review June 18, 2019 18:14
@jianyuh jianyuh changed the title Add AVX512VNNI support for TVM [RFC] Add AVX512VNNI support for TVM Jun 18, 2019
@tqchen
Copy link
Member

tqchen commented Jun 18, 2019

Perhaps a good time to update the CI infra to keep up with LLVM mainline, see steps in https://docs.tvm.ai/contribute/pull_request.html#ci-environment

Copy link
Contributor

@anijain2305 anijain2305 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!

@tqchen tqchen added the status: need update need update based on feedbacks label Jun 27, 2019
@tqchen
Copy link
Member

tqchen commented Jun 27, 2019

Because we need the test case to work for all environments, please add a feature detection step to skip the test if VNNI is not yet available.

@tqchen
Copy link
Member

tqchen commented Jul 8, 2019

@jianyuh can you add a pre-condition to the test cases so it skips the test when LLVM8 is not enabled? You can use https://github.com/dmlc/tvm/blob/9bfdc55c572e03f6cfac6994a9e75f8fd9252850/python/tvm/intrin.py#L193 to look up the intrinsic to see if it is available.

I will update the CI to add LLVM8 this week. However, we also need to test against the older versions of LLVM

@jianyuh
Copy link
Contributor Author

jianyuh commented Jul 9, 2019

@tqchen Will update this soon (sorry for being busy with some other things recently).

@jianyuh jianyuh force-pushed the avx512vnni branch 6 times, most recently from 53d1faa to 8e5f1a6 Compare July 15, 2019 16:19
@jianyuh
Copy link
Contributor Author

jianyuh commented Jul 15, 2019

My PR was based on the previous version of TVM. Not sure what are the recent changes for TVM.

http://ci.tvm.ai:8080/blue/organizations/jenkins/tvm/detail/PR-3388/6/pipeline/
Not sure why “llvm.x86.avx512.pmaddubs.w.512“ (AVX512 instruction, not VNNI instruction) is not recognized as an LLVM intrinsic.

http://ci.tvm.ai:8080/blue/organizations/jenkins/tvm/detail/PR-3388/7/pipeline
When I use tensorize routine and pass in “dot_16x1x16_int8_int8_int32” function (the original AVX512 implementation for uint8 x int8 multiplication), it reports “TVMError: Check failed: type_code_ == kNodeHandle (10 vs. 8) : expected NodeHandle but get FunctionHandle“ error.

Any insights about what might be wrong here? Thanks for advance!

@anijain2305
Copy link
Contributor

anijain2305 commented Jul 15, 2019

I will update the CI to add LLVM8 this week.

Hi @tqchen, is there any update on the LLVM8 front? We are also looking into this and have similar test issue.

@anijain2305
Copy link
Contributor

http://ci.tvm.ai:8080/blue/organizations/jenkins/tvm/detail/PR-3388/6/pipeline/
Not sure why “llvm.x86.avx512.pmaddubs.w.512“ (AVX512 instruction, not VNNI instruction) is not recognized as an LLVM intrinsic.

This is happening because the LLVM version is 6.0 in CI as Tianqi mentioned. You can skip the test by perform the intrinsic lookup as shown in the above post.

data = tvm.placeholder((num_int8_elements,), dtype='uint8', name='data')
kernel = tvm.placeholder((int32_lanes, num_int8_elements), dtype='int8', name='kernel')
k = tvm.reduce_axis((0, num_int8_elements), name='k')
C = tvm.compute((int32_lanes,),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my shallow knowledge on the semantics of both VNNI and TVM generated code.
There might be better software-defined description of VNNI so that you can avoid buffer reset.

Say VNNI does something like this:

for (int i = 0; i < 16; ++i) {
    uint32_t sum = 0;
    for (int j = 0; j < 4; ++j)
        sum += a[i * 4 + j] * b[i * 4 + j];
    c[i] = c[i] + sum; // We do not want to set c[i] to zero, it is an accumulation
}

However, if we build C.op it will generate code like this:

for (int i = 0; i< 16; ++i) {
    c[i] = 0; // To emulate the semantics of VNNI, we definitely do not want this reset.
    for (int j = 0; j < 4; ++j)
        c[i] = c[i] + a[i * 4 + j] * b[i * 4 + j];
}

Therefore, I suggest C is only an intermediate result, still we need another tensor:
The whole code looks like this:

a = placeholder()
b = placeholder()
d = placeholder()
c = reduce(sum(a, b))
e = c + d
# binds e and d to the same buffer so that the results can be retained between invocations.

Copy link
Contributor Author

@jianyuh jianyuh Jul 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@were : Thanks for pointing it out. I am not familiar with TVM part code either. Let me think about this. Background: we have recently incorporated VNNI into the FBGEMM (https://github.com/pytorch/fbgemm, VNNI part code will be publish soon: pytorch/FBGEMM#111). We also want to check if TVM can support VNNI and we want to do some performance comparisons between TVM, MKL-DNN, and FBGEMM.

@tqchen
Copy link
Member

tqchen commented Jul 25, 2019

@jianyuh can you look into the CI error?

@were
Copy link
Contributor

were commented Jul 25, 2019

I am not sure if tensorize is a good way to suport VNNI:

  1. VNNI is not true tensorization, though reduction dimension is introduced. It still operates on 1-D inputs. Due to the design of tensorization interface, you need to provide the declared intrin the shape of tensors offloaded, but essentially they are 1-D.

  2. Another thing I am worrying about is imperfect tiling. Since tensorize cuts off the whole loop body down, without being aware of the loop body replaced. Thus, it is hard to extend this to imperfect tiling case.

@jianyuh
Copy link
Contributor Author

jianyuh commented Jul 25, 2019

@tqchen : Will take a look soon. Let me know if this PR becomes the blockers for other things.

The current failure is shown as the following:

http://ci.tvm.ai:8080/blue/organizations/jenkins/tvm/detail/PR-3388/7/pipeline
When I use tensorize routine and pass in “dot_16x1x16_int8_int8_int32” function (the original AVX512 implementation for uint8 x int8 multiplication), it reports “TVMError: Check failed: type_code_ == kNodeHandle (10 vs. 8) : expected NodeHandle but get FunctionHandle“ error.

Sorry I am not familiar with TVM code base. Do you have any insights about what are NodeHandle/FunctionHandle here?

@tqchen
Copy link
Member

tqchen commented Jul 25, 2019

It could due to wrong types of arguments being passed to the tensor intrinsic. The corresponding function requires a NodeRef subtype but instead get a PackedFunc

@jianyuh
Copy link
Contributor Author

jianyuh commented Jul 31, 2019

I can run the correct result locally with an older version of TVM. Updated the summary part for this PR to report the performance results.

However, I had the same issue as #3598 for the OSS compilation error (http://ci.tvm.ai:8080/blue/organizations/jenkins/tvm/detail/PR-3388/11/pipeline).

TVMError: Check failed: is_one(e.region[i]->extent): Tensorize tensor_intrin: Input dimension mismatch with tensor intrin expected shape=[16, 4], given region=[range(min=((j.outer16)/16), ext=(((((j.outer16) + 15)/16) + 1) - j.outer)), range(min=(((((k.outer.outer4) + k.outer.inner)4)/4)16), ext=((((((((k.outer.outer16) + (k.outer.inner4)) + 3)/4)16) + 16) - (k.outer.inner16)) - (k.outer.outer64))), range(min=0, ext=4)]

Any temporary workaround for that? cc @tqchen , @anijain2305 .

@jianyuh
Copy link
Contributor Author

jianyuh commented Jul 31, 2019

I am not sure if tensorize is a good way to suport VNNI:

  1. VNNI is not true tensorization, though reduction dimension is introduced. It still operates on 1-D inputs. Due to the design of tensorization interface, you need to provide the declared intrin the shape of tensors offloaded, but essentially they are 1-D.
  2. Another thing I am worrying about is imperfect tiling. Since tensorize cuts off the whole loop body down, without being aware of the loop body replaced. Thus, it is hard to extend this to imperfect tiling case.

@were : You are right. I report the current performance of the implementation in this PR in the summary. Not sure how to overcome the limitation of TVM. cc @tqchen @anijain2305

@jianyuh jianyuh force-pushed the avx512vnni branch 2 times, most recently from 459c07b to 1152721 Compare August 1, 2019 07:37
@jianyuh
Copy link
Contributor Author

jianyuh commented Aug 1, 2019

Similar to @anijain2305 's PR (#3516), currently we disable the AVX512 VNNI test in this PR.

Posted the question on tensorize failure in https://discuss.tvm.ai/t/workaround-for-tensorize-failure/3577. @anijain2305 posted the same issue in #3598.

@jianyuh
Copy link
Contributor Author

jianyuh commented Aug 1, 2019

@FrozenGene @tqchen @anijain2305 @llyfacebook @were Ping for review.

t_sch[t_fc].unroll(a_koi)
t_sch[t_fc].tensorize(a_yi, pc)

# print(tvm.lower(t_sch, [X, packedW, t_fc], simple_mode=True))
Copy link
Member

@FrozenGene FrozenGene Aug 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this and other unnecessary comments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Member

@FrozenGene FrozenGene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@FrozenGene
Copy link
Member

If we have time, we could investigate why we couldn't achieve 252GFlops even more. Only 73% hardware efficiency means we have much work could dive.

@tqchen
Copy link
Member

tqchen commented Aug 1, 2019

@jianyuh
Copy link
Contributor Author

jianyuh commented Aug 5, 2019

If we have time, we could investigate why we couldn't achieve 252GFlops even more. Only 73% hardware efficiency means we have much work could dive.

252 Gops/s is a reasonable number as this is ~90% hardware efficiency. Currently FBGEMM and MKL-DNN can reach this number. For the current PR, the reason is that we didn't fully utilize the accumulation in vpdpbusd instruction, and we get 205.6 Gops/s (73% efficiency).

@jianyuh
Copy link
Contributor Author

jianyuh commented Aug 5, 2019

@jianyuh please act on the review comments @were please https://docs.tvm.ai/contribute/code_review.html#approve-and-request-changes-explicitly

I addressed the comments by @FrozenGene . For @were 's comment, I took a try but somehow I got lower performance. I think it might be related to various TVM code pieces, so it might take more efforts to address. I will take a look when I have time, but it might be slow. Maybe it is better to first ship this PR and add another PR later for optimizing the performance.

@tqchen tqchen added status: accepted and removed status: need update need update based on feedbacks labels Sep 13, 2019
@tqchen tqchen merged commit bb82e09 into apache:master Sep 13, 2019
@tqchen
Copy link
Member

tqchen commented Sep 13, 2019

Thanks @jianyuh @were @FrozenGene this PR is now merged

wweic pushed a commit to wweic/tvm that referenced this pull request Sep 16, 2019
wweic pushed a commit to wweic/tvm that referenced this pull request Sep 16, 2019
wweic pushed a commit to neo-ai/tvm that referenced this pull request Sep 16, 2019
@anijain2305
Copy link
Contributor

Hi @jianyuh I am getting following error when I try to run my benchmark. It gives following error,

LLVM ERROR: Cannot select: 0x23809ef0: v16i32 = X86ISD::VPDPBUSD 0x210a09a8, 0x210a02c0, 0x19eb81b0
  0x210a09a8: v16i32 = BUILD_VECTOR Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
  0x210a02c0: v16i32 = X86ISD::VBROADCAST 0x1da48690
    0x1da48690: i32,ch = load<(load 4 from %ir.scevgep54, !tbaa !553)> 0x21cfaeb8, 0x23879d70, undef:i64
      0x23879d70: i64 = add 0x19eb83b8, Constant:i64<-224>
        0x19eb83b8: i64 = add 0x1da47da0, 0x19eb8218
          0x1da47da0: i64,ch = CopyFromReg 0x21cfaeb8, Register:i64 %50
            0x23879b68: i64 = Register %50
          0x19eb8218: i64,ch = CopyFromReg 0x21cfaeb8, Register:i64 %52
            0x1da47a60: i64 = Register %52
        0x1da478c0: i64 = Constant<-224>
      0x2254ed88: i64 = undef
  0x19eb81b0: v16i32,ch = load<(load 64 from %ir.lsr.iv35, !tbaa !556)> 0x21cfaeb8, 0x2387a1e8, undef:i64
    0x2387a1e8: i64,ch = CopyFromReg 0x21cfaeb8, Register:i64 %51
      0x2254e6a0: i64 = Register %51
    0x2254ed88: i64 = undef
In function: __tvm_parallel_lambda.85

Was wondering if you ever saw this. Let me know. I will try to debug on my end.

@jianyuh
Copy link
Contributor Author

jianyuh commented Oct 11, 2019

Hi @jianyuh I am getting following error when I try to run my benchmark. It gives following error,

LLVM ERROR: Cannot select: 0x23809ef0: v16i32 = X86ISD::VPDPBUSD 0x210a09a8, 0x210a02c0, 0x19eb81b0
  0x210a09a8: v16i32 = BUILD_VECTOR Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
    0x23809a10: i32 = Constant<0>
  0x210a02c0: v16i32 = X86ISD::VBROADCAST 0x1da48690
    0x1da48690: i32,ch = load<(load 4 from %ir.scevgep54, !tbaa !553)> 0x21cfaeb8, 0x23879d70, undef:i64
      0x23879d70: i64 = add 0x19eb83b8, Constant:i64<-224>
        0x19eb83b8: i64 = add 0x1da47da0, 0x19eb8218
          0x1da47da0: i64,ch = CopyFromReg 0x21cfaeb8, Register:i64 %50
            0x23879b68: i64 = Register %50
          0x19eb8218: i64,ch = CopyFromReg 0x21cfaeb8, Register:i64 %52
            0x1da47a60: i64 = Register %52
        0x1da478c0: i64 = Constant<-224>
      0x2254ed88: i64 = undef
  0x19eb81b0: v16i32,ch = load<(load 64 from %ir.lsr.iv35, !tbaa !556)> 0x21cfaeb8, 0x2387a1e8, undef:i64
    0x2387a1e8: i64,ch = CopyFromReg 0x21cfaeb8, Register:i64 %51
      0x2254e6a0: i64 = Register %51
    0x2254ed88: i64 = undef
In function: __tvm_parallel_lambda.85

Was wondering if you ever saw this. Let me know. I will try to debug on my end.

I haven't seen such errors. Previously I had one issue (https://discuss.tvm.ai/t/workaround-for-tensorize-failure/3577) the same with your previous observation (#3598). I rebased on top of a previous version of TVM (the version #3081 is based, which is released in April 2019), and it worked fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants