Skip to content

Conversation

@EgorBo
Copy link
Member

@EgorBo EgorBo commented Apr 10, 2023

I noticed that my PR #84348 regressed TE benchmarks

image

I assume that our hardware for these benchmarks matches the official TE hardware which is CPU Intel Xeon Gold 5120 - a Q3'17 SkylakeX where, presumably, AVX-512 first appeared. Presumably, most AVX-512 instructions have high power licenses on that CPU and can lead to throttle on extensive use. I validated that my change fixes RPS issue on that hardware, but I'm not sure it's the right fix, should we somehow check that the current CPU is modern enough to freely use AVX-512? As far as I know, LLVM switched to "prefer 256bit vectors" because of the same issue - https://reviews.llvm.org/D67259 (Although, if e.g. -mcpu=znver4 set - it will use 512)

I still enabled 129..256 range for unrolling because without AVX-512 we'll fallback to memmove call - I validated that this PR fixes the RPS issue and it's back to normal (even makes code faster than the base in my local runs)

PS: CoreCLR GC uses AVX-512 in VXSort to sort card tables and I think I saw perf improvements when I disabled AVX-512 there too.

cc @dotnet/avx512-contrib

@ghost ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 10, 2023
@ghost ghost assigned EgorBo Apr 10, 2023
@ghost
Copy link

ghost commented Apr 10, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

I noticed that my PR #84348 regressed TE benchmarks

image

I assume that our hardware for these benchmarks matches the official TE hardware which is CPU Intel Xeon Gold 5120 - a Q3'17 SkylakeX where, presumably, AVX-512 first apeared. Presumably, most AVX-512 instructions have high power licenses on that CPU and can lead to throttle on extensive use. I validated that my change fixes RPS issue on that hardware, but I'm not sure it's the right fix, should we somehow check that the current CPU is modern enough to freely use AVX-512? As far as I know, LLVM switched to "prefer 256bit vectors" because of the same issue - https://reviews.llvm.org/D67259

I still enabled 129..256 range for unrolling because without AVX-512 we'll fallback to memmove call

PS: CoreCLR GC uses AVX-512 in VXSort to sort card tables and I think I saw perf improvements if I disabled AVX-512 there too.

cc @dotnet/avx512-contrib

Author: EgorBo
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@EgorBo
Copy link
Member Author

EgorBo commented Apr 10, 2023

Codegen example this PR changes:

bool Foo(Span<byte> src, Span<byte> dst) => src.Slice(0, 70).TryCopyTo(dst);
; Method Prog:Foo(System.Span`1[ubyte],System.Span`1[ubyte]):bool:this
G_M65161_IG01:
       sub      rsp, 40
       vzeroupper 
						;; size=7 bbWeight=1 PerfScore 1.25

G_M65161_IG02:
       cmp      dword ptr [rdx+08H], 70
       jb       SHORT G_M65161_IG06
       mov      rax, bword ptr [rdx]
       mov      rdx, bword ptr [r8]
       mov      ecx, dword ptr [r8+08H]
       xor      r8d, r8d
       cmp      ecx, 70
       jb       SHORT G_M65161_IG04
						;; size=24 bbWeight=1 PerfScore 11.50

G_M65161_IG03:
-      vmovdqu32 zmm0, zmmword ptr[rax]
-      vmovdqu  xmm1, xmmword ptr [rax+36H]
-      vmovdqu32 zmmword ptr[rdx], zmm0
-      vmovdqu  xmmword ptr [rdx+36H], xmm1
+      vmovdqu  ymm0, ymmword ptr [rax]
+      vmovdqu  ymm1, ymmword ptr [rax+20H]
+      vmovdqu  xmm2, xmmword ptr [rax+36H]
+      vmovdqu  ymmword ptr [rdx], ymm0
+      vmovdqu  ymmword ptr [rdx+20H], ymm1
+      vmovdqu  xmmword ptr [rdx+36H], xmm2
       mov      r8d, 1
						;; size=28 bbWeight=0.50 PerfScore 6.12

G_M65161_IG04:
       mov      eax, r8d
						;; size=3 bbWeight=1 PerfScore 0.25

G_M65161_IG05:
       add      rsp, 40
       ret      
						;; size=5 bbWeight=1 PerfScore 1.25

G_M65161_IG06:
       call     [System.ThrowHelper:ThrowArgumentOutOfRangeException()]
       int3     
						;; size=7 bbWeight=0 PerfScore 0.00
; Total bytes of code: 74

Copy link
Contributor

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems good for now. I wonder if we'll have to be more fine-grained in our performance tuning, e.g., even tuning the JIT dynamically by CPU part instead of only by ABI, as we do now.

@BruceForstall
Copy link
Contributor

Maybe we should set our defaults on what hardware we expect to see the most "in the wild"?

E.g., here is a list of Azure hardware: https://azure.microsoft.com/en-in/pricing/details/virtual-machines/series

@benaadams
Copy link
Member

benaadams commented Apr 10, 2023

Is due to frequency scaling on early AVX512 chips (TE chips are from 2017). Cloudflare did a write up on it https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling/

Xeon Gold 5120 - Intel (2017): Turbo Frequency/Active Cores https://en.wikichip.org/wiki/intel/xeon_gold/5120

image

The chips are 6 years old though

@EgorBo
Copy link
Member Author

EgorBo commented Apr 10, 2023

Is there any way to check that a chip is modern enough? I assume just checking processors we have in Azure is not enough or we really need to make an allow-list 😐

@EgorBo
Copy link
Member Author

EgorBo commented Apr 10, 2023

Btw, I think I saw improvements in TE when I also disabled AVX1-AVX2 and it matches Ben's picture. But I connected that with the fact we mostly work with small inputs..

@tannergooding
Copy link
Member

Is there any way to check that a chip is modern enough?

Not outside model/stepping/revision checks, which we've avoided doing so far. Throttling is a general concern, even for Vector256<T>, its just "less" on newer chips and we've passed the point where its considered beneficial enough on average.

For Vector512<T>, we're probably still a ways out from being able to assume any 64-byte workload is good enough and so we'll probably need to maintain checks for >128-256 bytes for the time being. We still have a lot of benefit available via 128-bit and 256-bit instructions available on AVX-512+ hardware, however. -- The throttling largely comes from using 512-bits, not from using AVX-512 instructions.

@BruceForstall
Copy link
Contributor

It would be nice if we could somehow identify sections in the JIT code where we are making optimization decisions assuming throttling will occur, which might change in the future, or perhaps which isn't a problem at all for newer chips.

It would also be nice if we could measure across a variety of hardware and make decisions based on actual performance, what hardware is most available (especially in Azure), etc.

@damageboy
Copy link
Contributor

@EgorBo

Is there any way to check that a chip is modern enough? I assume just checking processors we have in Azure is not enough or we really need to make an allow-list

@tannergooding

Not outside model/stepping/revision checks, which we've avoided doing so far. Throttling is a general concern, even for Vector256<T>, its just "less" on newer chips and we've passed the point where its considered beneficial enough on average.

Agree with @tannergooding, you can only reasonably do this with checking CPUID information.
From past experience, this issue basically went away once icelake (e.g. Intel 10nm/7nm) was introduced, where the power-envelope was tuned (e.g. much less restrictive to begin with) and some process improvements kicked in. This issue never existed with AMD to begin with.

https://github.com/travisdowns/avx-turbo does a wonderful job at measuring the scaling effect to the extent that it exists.

While the quoted results have not been updated in a while, various PRs/Issue (including my own) report results for modern processors:

Just to be a little more helpful, the way I parse these results/gists is by comparing the measured frequency for scalar_iadd for N executing cores to any of the heavy-weight AVX512 operations such as horizontal permutes and fma, so for example for Tiger-Lake with a single core:

  • scalar_iadd runs at 4684Mhz, while
  • avx512_fma_t (fused multiply+add for double precision) runs at 4505Mhz and
  • avx512_vpermd_t (horizontal 512-bit free-form shuffles) run at 4505Mhz

Granted, 4505Mhz < 4684Mhz, but this is the worst it gets, on the earliest processors where Intel fixed their game, for the worst offender instructions (power-wise). To me this looks like a non-issue from a specific generation onward.

@EgorBo
Copy link
Member Author

EgorBo commented Apr 11, 2023

Thanks for the inputs Everyone!! Since this PR unblocks the TE regression I'm going to merge it and since we have a plan to enable AVX-512 in other places I guess we'll have to come up with some univeral heuristic or something like that.

Presumably the issue is still here but this PR just narrows the range where we use AVX-512 ([64..256] -> [129..256] bytes ) so less chances to be on hot path

@EgorBo EgorBo merged commit 965a1be into dotnet:main Apr 12, 2023
@EgorBo EgorBo deleted the limit-memmove-for-large-data branch April 12, 2023 13:52
@ghost ghost locked as resolved and limited conversation to collaborators May 12, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants