Don't use AVX-512 for small inputs in Memmove due to throttle issues #84577

EgorBo · 2023-04-10T17:48:06Z

I noticed that my PR #84348 regressed TE benchmarks

I assume that our hardware for these benchmarks matches the official TE hardware which is CPU Intel Xeon Gold 5120 - a Q3'17 SkylakeX where, presumably, AVX-512 first appeared. Presumably, most AVX-512 instructions have high power licenses on that CPU and can lead to throttle on extensive use. I validated that my change fixes RPS issue on that hardware, but I'm not sure it's the right fix, should we somehow check that the current CPU is modern enough to freely use AVX-512? As far as I know, LLVM switched to "prefer 256bit vectors" because of the same issue - https://reviews.llvm.org/D67259 (Although, if e.g. -mcpu=znver4 set - it will use 512)

I still enabled 129..256 range for unrolling because without AVX-512 we'll fallback to memmove call - I validated that this PR fixes the RPS issue and it's back to normal (even makes code faster than the base in my local runs)

PS: CoreCLR GC uses AVX-512 in VXSort to sort card tables and I think I saw perf improvements when I disabled AVX-512 there too.

cc @dotnet/avx512-contrib

ghost · 2023-04-10T17:48:17Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

I noticed that my PR #84348 regressed TE benchmarks

I assume that our hardware for these benchmarks matches the official TE hardware which is CPU Intel Xeon Gold 5120 - a Q3'17 SkylakeX where, presumably, AVX-512 first apeared. Presumably, most AVX-512 instructions have high power licenses on that CPU and can lead to throttle on extensive use. I validated that my change fixes RPS issue on that hardware, but I'm not sure it's the right fix, should we somehow check that the current CPU is modern enough to freely use AVX-512? As far as I know, LLVM switched to "prefer 256bit vectors" because of the same issue - https://reviews.llvm.org/D67259

I still enabled 129..256 range for unrolling because without AVX-512 we'll fallback to memmove call

PS: CoreCLR GC uses AVX-512 in VXSort to sort card tables and I think I saw perf improvements if I disabled AVX-512 there too.

cc @dotnet/avx512-contrib

Author:	EgorBo
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	-

EgorBo · 2023-04-10T18:02:21Z

Codegen example this PR changes:

bool Foo(Span<byte> src, Span<byte> dst) => src.Slice(0, 70).TryCopyTo(dst);

; Method Prog:Foo(System.Span`1[ubyte],System.Span`1[ubyte]):bool:this
G_M65161_IG01:
       sub      rsp, 40
       vzeroupper 
						;; size=7 bbWeight=1 PerfScore 1.25

G_M65161_IG02:
       cmp      dword ptr [rdx+08H], 70
       jb       SHORT G_M65161_IG06
       mov      rax, bword ptr [rdx]
       mov      rdx, bword ptr [r8]
       mov      ecx, dword ptr [r8+08H]
       xor      r8d, r8d
       cmp      ecx, 70
       jb       SHORT G_M65161_IG04
						;; size=24 bbWeight=1 PerfScore 11.50

G_M65161_IG03:
-      vmovdqu32 zmm0, zmmword ptr[rax]
-      vmovdqu  xmm1, xmmword ptr [rax+36H]
-      vmovdqu32 zmmword ptr[rdx], zmm0
-      vmovdqu  xmmword ptr [rdx+36H], xmm1
+      vmovdqu  ymm0, ymmword ptr [rax]
+      vmovdqu  ymm1, ymmword ptr [rax+20H]
+      vmovdqu  xmm2, xmmword ptr [rax+36H]
+      vmovdqu  ymmword ptr [rdx], ymm0
+      vmovdqu  ymmword ptr [rdx+20H], ymm1
+      vmovdqu  xmmword ptr [rdx+36H], xmm2
       mov      r8d, 1
						;; size=28 bbWeight=0.50 PerfScore 6.12

G_M65161_IG04:
       mov      eax, r8d
						;; size=3 bbWeight=1 PerfScore 0.25

G_M65161_IG05:
       add      rsp, 40
       ret      
						;; size=5 bbWeight=1 PerfScore 1.25

G_M65161_IG06:
       call     [System.ThrowHelper:ThrowArgumentOutOfRangeException()]
       int3     
						;; size=7 bbWeight=0 PerfScore 0.00
; Total bytes of code: 74

BruceForstall

This seems good for now. I wonder if we'll have to be more fine-grained in our performance tuning, e.g., even tuning the JIT dynamically by CPU part instead of only by ABI, as we do now.

BruceForstall · 2023-04-10T22:01:06Z

Maybe we should set our defaults on what hardware we expect to see the most "in the wild"?

E.g., here is a list of Azure hardware: https://azure.microsoft.com/en-in/pricing/details/virtual-machines/series

benaadams · 2023-04-10T23:27:43Z

Is due to frequency scaling on early AVX512 chips (TE chips are from 2017). Cloudflare did a write up on it https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling/

Xeon Gold 5120 - Intel (2017): Turbo Frequency/Active Cores https://en.wikichip.org/wiki/intel/xeon_gold/5120

The chips are 6 years old though

EgorBo · 2023-04-10T23:40:46Z

Is there any way to check that a chip is modern enough? I assume just checking processors we have in Azure is not enough or we really need to make an allow-list 😐

EgorBo · 2023-04-10T23:55:59Z

Btw, I think I saw improvements in TE when I also disabled AVX1-AVX2 and it matches Ben's picture. But I connected that with the fact we mostly work with small inputs..

tannergooding · 2023-04-11T03:18:54Z

Is there any way to check that a chip is modern enough?

Not outside model/stepping/revision checks, which we've avoided doing so far. Throttling is a general concern, even for Vector256<T>, its just "less" on newer chips and we've passed the point where its considered beneficial enough on average.

For Vector512<T>, we're probably still a ways out from being able to assume any 64-byte workload is good enough and so we'll probably need to maintain checks for >128-256 bytes for the time being. We still have a lot of benefit available via 128-bit and 256-bit instructions available on AVX-512+ hardware, however. -- The throttling largely comes from using 512-bits, not from using AVX-512 instructions.

BruceForstall · 2023-04-11T06:20:05Z

It would be nice if we could somehow identify sections in the JIT code where we are making optimization decisions assuming throttling will occur, which might change in the future, or perhaps which isn't a problem at all for newer chips.

It would also be nice if we could measure across a variety of hardware and make decisions based on actual performance, what hardware is most available (especially in Azure), etc.

damageboy · 2023-04-11T07:58:19Z

@EgorBo

Is there any way to check that a chip is modern enough? I assume just checking processors we have in Azure is not enough or we really need to make an allow-list

@tannergooding

Not outside model/stepping/revision checks, which we've avoided doing so far. Throttling is a general concern, even for Vector256<T>, its just "less" on newer chips and we've passed the point where its considered beneficial enough on average.

Agree with @tannergooding, you can only reasonably do this with checking CPUID information.
From past experience, this issue basically went away once icelake (e.g. Intel 10nm/7nm) was introduced, where the power-envelope was tuned (e.g. much less restrictive to begin with) and some process improvements kicked in. This issue never existed with AMD to begin with.

https://github.com/travisdowns/avx-turbo does a wonderful job at measuring the scaling effect to the extent that it exists.

While the quoted results have not been updated in a while, various PRs/Issue (including my own) report results for modern processors:

Just to be a little more helpful, the way I parse these results/gists is by comparing the measured frequency for scalar_iadd for N executing cores to any of the heavy-weight AVX512 operations such as horizontal permutes and fma, so for example for Tiger-Lake with a single core:

scalar_iadd runs at 4684Mhz, while
avx512_fma_t (fused multiply+add for double precision) runs at 4505Mhz and
avx512_vpermd_t (horizontal 512-bit free-form shuffles) run at 4505Mhz

Granted, 4505Mhz < 4684Mhz, but this is the worst it gets, on the earliest processors where Intel fixed their game, for the worst offender instructions (power-wise). To me this looks like a non-issue from a specific generation onward.

EgorBo · 2023-04-11T23:04:20Z

Thanks for the inputs Everyone!! Since this PR unblocks the TE regression I'm going to merge it and since we have a plan to enable AVX-512 in other places I guess we'll have to come up with some univeral heuristic or something like that.

Presumably the issue is still here but this PR just narrows the range where we use AVX-512 ([64..256] -> [129..256] bytes ) so less chances to be on hot path

…or-large-data

EgorBo added 2 commits April 10, 2023 19:05

Don't use AVX-512 for small inputs in Memmove

21d9fa3

clean up

c213ab6

ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 10, 2023

ghost assigned EgorBo Apr 10, 2023

fix 32bit

14d37d0

EgorBo mentioned this pull request Apr 10, 2023

Enable AVX-512 in Memmove unrolling #84348

Merged

BruceForstall approved these changes Apr 10, 2023

View reviewed changes

Merge branch 'main' of github.com:dotnet/runtime into limit-memmove-f…

37d26f7

…or-large-data

EgorBo merged commit 965a1be into dotnet:main Apr 12, 2023

EgorBo deleted the limit-memmove-for-large-data branch April 12, 2023 13:52

EgorBo mentioned this pull request Apr 28, 2023

Regression in JsonPlatform due to AVX-512 changes #85543

Closed

ghost locked as resolved and limited conversation to collaborators May 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Don't use AVX-512 for small inputs in Memmove due to throttle issues #84577

Don't use AVX-512 for small inputs in Memmove due to throttle issues #84577

Uh oh!

EgorBo commented Apr 10, 2023 •

edited

Loading

Uh oh!

ghost commented Apr 10, 2023

Uh oh!

EgorBo commented Apr 10, 2023 •

edited

Loading

Uh oh!

BruceForstall left a comment

Uh oh!

BruceForstall commented Apr 10, 2023

Uh oh!

benaadams commented Apr 10, 2023 •

edited

Loading

Uh oh!

EgorBo commented Apr 10, 2023 •

edited

Loading

Uh oh!

EgorBo commented Apr 10, 2023

Uh oh!

tannergooding commented Apr 11, 2023

Uh oh!

BruceForstall commented Apr 11, 2023

Uh oh!

damageboy commented Apr 11, 2023

Uh oh!

EgorBo commented Apr 11, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Don't use AVX-512 for small inputs in Memmove due to throttle issues #84577

Don't use AVX-512 for small inputs in Memmove due to throttle issues #84577

Uh oh!

Conversation

EgorBo commented Apr 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented Apr 10, 2023

Uh oh!

EgorBo commented Apr 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BruceForstall left a comment

Choose a reason for hiding this comment

Uh oh!

BruceForstall commented Apr 10, 2023

Uh oh!

benaadams commented Apr 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EgorBo commented Apr 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EgorBo commented Apr 10, 2023

Uh oh!

tannergooding commented Apr 11, 2023

Uh oh!

BruceForstall commented Apr 11, 2023

Uh oh!

damageboy commented Apr 11, 2023

Uh oh!

EgorBo commented Apr 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

EgorBo commented Apr 10, 2023 •

edited

Loading

EgorBo commented Apr 10, 2023 •

edited

Loading

benaadams commented Apr 10, 2023 •

edited

Loading

EgorBo commented Apr 10, 2023 •

edited

Loading

EgorBo commented Apr 11, 2023 •

edited

Loading