-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Add AVX512VNNI support for TVM #3388
Conversation
It appears that the CI test is not using LLVM 8.0 or higher version, thus not supporting AVX512 VNNI instructions. |
Perhaps a good time to update the CI infra to keep up with LLVM mainline, see steps in https://docs.tvm.ai/contribute/pull_request.html#ci-environment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution!
Because we need the test case to work for all environments, please add a feature detection step to skip the test if VNNI is not yet available. |
@jianyuh can you add a pre-condition to the test cases so it skips the test when LLVM8 is not enabled? You can use https://github.com/dmlc/tvm/blob/9bfdc55c572e03f6cfac6994a9e75f8fd9252850/python/tvm/intrin.py#L193 to look up the intrinsic to see if it is available. I will update the CI to add LLVM8 this week. However, we also need to test against the older versions of LLVM |
@tqchen Will update this soon (sorry for being busy with some other things recently). |
53d1faa
to
8e5f1a6
Compare
My PR was based on the previous version of TVM. Not sure what are the recent changes for TVM.
Any insights about what might be wrong here? Thanks for advance! |
Hi @tqchen, is there any update on the LLVM8 front? We are also looking into this and have similar test issue. |
This is happening because the LLVM version is 6.0 in CI as Tianqi mentioned. You can skip the test by perform the intrinsic lookup as shown in the above post. |
data = tvm.placeholder((num_int8_elements,), dtype='uint8', name='data') | ||
kernel = tvm.placeholder((int32_lanes, num_int8_elements), dtype='int8', name='kernel') | ||
k = tvm.reduce_axis((0, num_int8_elements), name='k') | ||
C = tvm.compute((int32_lanes,), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my shallow knowledge on the semantics of both VNNI and TVM generated code.
There might be better software-defined description of VNNI so that you can avoid buffer reset.
Say VNNI does something like this:
for (int i = 0; i < 16; ++i) {
uint32_t sum = 0;
for (int j = 0; j < 4; ++j)
sum += a[i * 4 + j] * b[i * 4 + j];
c[i] = c[i] + sum; // We do not want to set c[i] to zero, it is an accumulation
}
However, if we build C.op
it will generate code like this:
for (int i = 0; i< 16; ++i) {
c[i] = 0; // To emulate the semantics of VNNI, we definitely do not want this reset.
for (int j = 0; j < 4; ++j)
c[i] = c[i] + a[i * 4 + j] * b[i * 4 + j];
}
Therefore, I suggest C
is only an intermediate result, still we need another tensor:
The whole code looks like this:
a = placeholder()
b = placeholder()
d = placeholder()
c = reduce(sum(a, b))
e = c + d
# binds e and d to the same buffer so that the results can be retained between invocations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@were : Thanks for pointing it out. I am not familiar with TVM part code either. Let me think about this. Background: we have recently incorporated VNNI into the FBGEMM (https://github.com/pytorch/fbgemm, VNNI part code will be publish soon: pytorch/FBGEMM#111). We also want to check if TVM can support VNNI and we want to do some performance comparisons between TVM, MKL-DNN, and FBGEMM.
@jianyuh can you look into the CI error? |
I am not sure if tensorize is a good way to suport VNNI:
|
@tqchen : Will take a look soon. Let me know if this PR becomes the blockers for other things. The current failure is shown as the following:
Sorry I am not familiar with TVM code base. Do you have any insights about what are NodeHandle/FunctionHandle here? |
It could due to wrong types of arguments being passed to the tensor intrinsic. The corresponding function requires a NodeRef subtype but instead get a PackedFunc |
ce91403
to
ed1cb64
Compare
I can run the correct result locally with an older version of TVM. Updated the summary part for this PR to report the performance results. However, I had the same issue as #3598 for the OSS compilation error (http://ci.tvm.ai:8080/blue/organizations/jenkins/tvm/detail/PR-3388/11/pipeline).
Any temporary workaround for that? cc @tqchen , @anijain2305 . |
@were : You are right. I report the current performance of the implementation in this PR in the summary. Not sure how to overcome the limitation of TVM. cc @tqchen @anijain2305 |
459c07b
to
1152721
Compare
Similar to @anijain2305 's PR (#3516), currently we disable the AVX512 VNNI test in this PR. Posted the question on tensorize failure in https://discuss.tvm.ai/t/workaround-for-tensorize-failure/3577. @anijain2305 posted the same issue in #3598. |
@FrozenGene @tqchen @anijain2305 @llyfacebook @were Ping for review. |
t_sch[t_fc].unroll(a_koi) | ||
t_sch[t_fc].tensorize(a_yi, pc) | ||
|
||
# print(tvm.lower(t_sch, [X, packedW, t_fc], simple_mode=True)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this and other unnecessary comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
If we have time, we could investigate why we couldn't achieve 252GFlops even more. Only 73% hardware efficiency means we have much work could dive. |
@jianyuh please act on the review comments @were please https://docs.tvm.ai/contribute/code_review.html#approve-and-request-changes-explicitly |
252 Gops/s is a reasonable number as this is ~90% hardware efficiency. Currently FBGEMM and MKL-DNN can reach this number. For the current PR, the reason is that we didn't fully utilize the accumulation in |
I addressed the comments by @FrozenGene . For @were 's comment, I took a try but somehow I got lower performance. I think it might be related to various TVM code pieces, so it might take more efforts to address. I will take a look when I have time, but it might be slow. Maybe it is better to first ship this PR and add another PR later for optimizing the performance. |
Thanks @jianyuh @were @FrozenGene this PR is now merged |
Hi @jianyuh I am getting following error when I try to run my benchmark. It gives following error,
Was wondering if you ever saw this. Let me know. I will try to debug on my end. |
I haven't seen such errors. Previously I had one issue (https://discuss.tvm.ai/t/workaround-for-tensorize-failure/3577) the same with your previous observation (#3598). I rebased on top of a previous version of TVM (the version #3081 is based, which is released in April 2019), and it worked fine. |
We add the first AVX512VNNI instruction support in TVM in this PR.
To compute the intrinsics kernels of uint8 * int8 -> accumulation int32,
We benchmark the current TVM with the benchmark routine on an Intel Cascade Lake machine (Intel(R) Xeon(R) Gold 5220 CPU @ 2.20GHz ) in this PR:
The theoretical peak performance for this Intel Cascade Lake machine is ~280 Gops/s with Turbo off.
As a reference, for our current ongoing PR (pytorch/FBGEMM#111), FBGEMM can achieve ~252 Gops/s. The measured performance for MKL-DNN compiled in GCC is similar. As pointed out by @were in the comments below, the reason is that in the current implementation in this PR, we didn't fully utilize the accumulation in
vpdpbusd
instruction.