-
Notifications
You must be signed in to change notification settings - Fork 5k
[Arm64] Use SIMD register to zero init frame #46609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Arm64] Use SIMD register to zero init frame #46609
Conversation
@TamarChristinaArm I was surprised to find that on Windows I almost don't see any performance improvement from the change. However, you can notice that the distribution becomes unimodal as opposed to baseline where you can see two trends. Would you say that the results look reasonable? Do you have any other suggestions what else I can try to improve here other than using I figured I don't need to employ all the optimizations as in https://github.com/ARM-software/optimized-routines/blob/0f4ae0c5b561de25acb10130fd5e473ec038f89d/string/aarch64/memset.S since the region size is multiple of 4 (actually, I think it is multiple of 8 but I need to double check this) while for memset it is important to handle all possible sizes. |
Correct me if I'm wrong, but the SQ2 is a Cortex-A76 based big.LITTLE core right? In which case the bimodal behavior of the core is most likely because it didn't enter streaming mode[1] for all the cases. The larger stores of the
No, the only other thing you already have on your roadmap, which is the loop alignment. 16b should be enough for most cases. |
Yeah if on average the |
I don't know how to tell what microarchitecture it's based on. Is there any Windows tool that can do this? The Wikipedia article says it is 4+4 cores that suggests it is big.LITTLE.
Thanks for the link - I am working on supporting |
Updated the first post with new results. Experimentally found that it becomes profitable to use @TamarChristinaArm I have a question about using this instruction on big.LITTLE with different cache line sizes for big and little cores. Is it guaranteed for Would you suggest to use this instruction more broadly - for example, in other places where memory needs to be zeroed (e.g. GC heap)? If so, I can open an issue to track this work. |
Nice speedup! Clearly a gain on both platforms, good work :)
Hmm no not that I'm aware of.. You could get the parts number but unless specific support for it has been added to a public compiler we wouldn't be able to tell what it means. [1] https://www.cpu-monkey.com/en/cpu-qualcomm_snapdragon_microsoft_sq2-1850
Hmm do you actually see a performance degradation with < 256 bytes? If not using ZVA always when it's available and aligned region is >= 64 would simplify your logic a bit wouldn't it?
So there are a few parts to this question. It's theoretically not enforced anywhere that the little and big cores have the same cache line sizes. Practically speaking however, almost all systems have this. However those that don't can't ever use ZVA because you don't know when the kernel migrates you. You could be migrated after you read the register but before you clear it. It's a fairly dangerous setup when the cores don't match. To deal with this, in the Linux kernel
So on Linux/Android the answer is yes [2] (No code in link, just cover letter so should be safe to read). On Windows I can only assume yes as I don't know the behavior of the kernel here. But if it doesn't do something like the above then that would be a bug. [2] http://lists.infradead.org/pipermail/linux-arm-kernel/2016-September/453859.html
Yeah, I think it would be beneficial to use anywhere you need a |
Yes. First, I tried to unconditionally allow using I am not sure I understand how I could simplify the logic. As far as I know, JIT can't tell whether an initialized memory region is aligned to 64 bytes boundaries. The best we can do is to assume that the frame pointer is aligned to 16 bytes boundary and extend this assumption to the region in case when As for R2R scenarios - if we don't know whether |
Thank you for the reply, Tamar! If that is a case, then I guess |
@dotnet/jit-contrib This is ready for review, PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I wonder if the loop versions, especially the larger loop versions, should be activated for smaller sizes under a JIT stress mode.
We're not going to see the DC ZVA instruction except for native hosted JIT scenarios (or maybe SPMI playback of arm64 native PMI/test run collections). Should there be a way to generate this for altjit scenarios?
InstructionSet_Rdm_Arm64=17, | ||
InstructionSet_Sha1_Arm64=18, | ||
InstructionSet_Sha256_Arm64=19, | ||
InstructionSet_Dczva=12, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to change the JIT/EE version GUID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Yeah, You can squeeze it to 160 by using overlapping stores (which is what AoR gets away with) but I have been told that older cores do not like this very much. So I think your approach is correct (and seems to match the threshold where the glibc ifunc with ZVA enabled is used) as I think you guys are tuning for as broad range of cores as possible.
No, I think the confusion I had was my initial look I had misread the remainder check. It does look as optimal as possible which wouldn't inadvertently slow down older cores.
Yeah, that bit would then be set. |
@janvorli maybe you could double-check the VM side changes. And, fyi about the zeroing algorithm change. |
…tFrame in src/coreclr/jit/codegencommon.cpp
…e in src/coreclr/jit/codegencommon.cpp
…mmon.cpp src/coreclr/jit/target.h
…nSet_Dczva flag in src/coreclr/vm/arm64/asmhelpers.S src/coreclr/vm/arm64/asmhelpers.asm src/coreclr/vm/codeman.cpp
…rc/coreclr/tools/Common/Internal/Runtime/ReadyToRunInstructionSetHelper.cs src/coreclr/tools/Common/JitInterface/CorInfoInstructionSet.cs src/coreclr/tools/Common/JitInterface/ThunkGenerator/InstructionSetDesc.txt
…/jit/emitfmtsarm64.h src/coreclr/jit/instrsarm64.h
…lr/jit/emitarm64.cpp
…cpp src/coreclr/jit/target.h
98452f6
to
2352f16
Compare
Had to rebase on top of master in order to update jiteeversionguid.h |
cc @jeffschwMSFT @richlander @jkotas who might be interested in this work |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor feedback and a question.
// 3) if simdRegZeroed is true then 128-bit wide zeroSimdReg contains zeroes. | ||
|
||
const int bytesUseZeroingLoop = 192; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you use #define
for the numeric constants like 192
, 256
, etc.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could, but why that would be more preferred than using const int
? I doubt that having these numbers outside of the algorithm implementation (e.g. in target.h) would be more useful
// add x9, fp, #(untrLclLo-32) | ||
// mov x10, #(bytesToWrite-64) | ||
// | ||
// loop: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you verify if aligning the loop gives any benefits as @TamarChristinaArm pointed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did two experiments with aligning the head of a loop on 16-bytes and 32-bytes boundaries - neither demonstrated measurable performance differences, so I decided not to implement the alignment.
Nice work, @echesakovMSFT. We should be looking at microbenchmark perf early next week. Will make sure to post back here if we see anything there from your change. A couple of things for follow up:
|
@AndyAyersMS No, Architecturally there's nothing that makes any guarantee that |
@AndyAyersMS Yes, I was planning to look into other places where this instruction can be used and give us some performance improvements. In addition, we can improve |
Below are the results of running benchmarks as in #32538
DC ZVA
instruction is prohibitedDC ZVA
instruction is permitted for memory regions greater than or equal to 256 bytes