-
Notifications
You must be signed in to change notification settings - Fork 5k
[RISC-V] Optimize loading 64 bit constant with new algorithm implementation and using emitDataConst
#113250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RISC-V] Optimize loading 64 bit constant with new algorithm implementation and using emitDataConst
#113250
Conversation
RISC-V Release-CLR-VF2: 9465 / 9541 (99.20%)
Release-CLR-VF2.md, Release-CLR-VF2.xml, testclr_output.tar.gz Build information and commandsGIT: RISC-V Release-CLR-QEMU: 9465 / 9541 (99.20%)
Release-CLR-QEMU.md, Release-CLR-QEMU.xml, testclr_output.tar.gz Build information and commandsGIT: RISC-V Release-FX-QEMU: 630930 / 658679 (95.79%)
Release-FX-QEMU.md, Release-FX-QEMU.xml, testfx_output.tar.gz Build information and commandsGIT: RISC-V Release-FX-VF2: 436825 / 464641 (94.01%)
Build information and commandsGIT: |
With -O2, it gets more interesting https://godbolt.org/z/7PojfWjba. |
eb1c702 is being scheduled for building and testingGIT: Release-build FAILEDbuildinfo.json |
emitDataConst
eb1c702
to
3b52234
Compare
RISC-V Release-CLR-VF2: 9464 / 9541 (99.19%)
Release-CLR-VF2.md, Release-CLR-VF2.xml, testclr_output.tar.gz Build information and commandsGIT: RISC-V Release-CLR-QEMU: 9464 / 9541 (99.19%)
Release-CLR-QEMU.md, Release-CLR-QEMU.xml, testclr_output.tar.gz Build information and commandsGIT: RISC-V Release-FX-VF2: 429741 / 457637 (93.90%)
Build information and commandsGIT: RISC-V Release-FX-QEMU: 659033 / 696638 (94.60%)
Release-FX-QEMU.md, Release-FX-QEMU.xml, testfx_output.tar.gz Build information and commandsGIT: |
Hi @tomeksowi , sorry for bothering you on the weekends. I was wondering, is the risc-vv test paused on the weekend or my test simply hangs? Because it's almost 4 hours since it's scheduled for build + test, yet no test result is shown up till now 😅 |
It often takes several hours, especially with the previous builds in a queue. Also, I think this is more in @sirntar's turf. |
Alright, thanks for the info! 👍 |
There was a maintenance power-off in SRPOL office over the weekend, maybe sth didn't restart properly. I'm away so don't have access, we'll check on Monday. |
#113250 (comment) is updated. I noticed that sometimes it takes time (few hours to days) but eventually it updates the comment. |
One of the failures could be relevant:
|
Yup, it caught an edge case I didn't handle properly, working on it now 👍 |
Out of curiosity, is it something like the sign bit not propagated correctly because you omit one of the instructions? (I didn't look at the code, so just wildly guessing.) |
Yes, sometimes I gave +1 to the lui operand, I already made sure the original operand have the expected sign bit to be extended, but for a particular operand value (0x7FFFFFFF), the sign bit changes, which causes unintended sign extension. I've figured out a fix, but I realized something else. clang cleverly use |
I don't know if you're going to cover sequences with a temporary register in this PR but in this case it could detect that the immediate bits can be split into addable halves, in this case: lui temp, 0xABCDA
addi temp, temp, 0xBCD
slli dest, temp, 32
add dest, dest, temp If so, and you have microbenchmarks at hand, it would be worth checking if having > 5 instructions in the general case of the above would still be faster than loading due to a more parallelize-able workload, e.g. for 0x12345'678'98765'432: lui temp, 0x12345
lui dest, 0x98765
addi temp, temp, 0x678
addi dest, dest, 0x432
slli temp, temp, 32
add dest, dest, temp EDIT: GCC does it so probably it is faster. BTW, in the asm examples I forgot to incorporate |
@tomeksowi nice suggestion! But I think I'll leave that to another PR. We can add that path later to replace cases where it would generate |
RISC-V Release-CLR-VF2: 9465 / 9541 (99.20%)
Release-CLR-VF2.md, Release-CLR-VF2.xml, testclr_output.tar.gz Build information and commandsGIT: RISC-V Release-FX-VF2: 429341 / 467750 (91.79%)
Build information and commandsGIT: RISC-V Release-FX-QEMU: 660039 / 687745 (95.97%)
Release-FX-QEMU.md, Release-FX-QEMU.xml, testfx_output.tar.gz Build information and commandsGIT: |
Fine with me. |
RISC-V Release-CLR-VF2: 9465 / 9541 (99.20%)
Release-CLR-VF2.md, Release-CLR-VF2.xml, testclr_output.tar.gz Build information and commandsGIT: RISC-V Release-CLR-QEMU: 9465 / 9541 (99.20%)
Release-CLR-QEMU.md, Release-CLR-QEMU.xml, testclr_output.tar.gz Build information and commandsGIT: RISC-V Release-FX-VF2: 687315 / 712141 (96.51%)
Build information and commandsGIT: RISC-V Release-FX-QEMU: 630219 / 658569 (95.70%)
Release-FX-QEMU.md, Release-FX-QEMU.xml, testfx_output.tar.gz Build information and commandsGIT: |
…heck, apply coding conventions
RISC-V Release-FX-VF2: 0 / 258 (0.00%)
Build information and commandsGIT: RISC-V Release-CLR-VF2: 9468 / 9544 (99.20%)
Release-CLR-VF2.md, Release-CLR-VF2.xml, testclr_output.tar.gz Build information and commandsGIT: RISC-V Release-CLR-QEMU: 9468 / 9544 (99.20%)
Release-CLR-QEMU.md, Release-CLR-QEMU.xml, testclr_output.tar.gz Build information and commandsGIT: RISC-V Release-FX-QEMU: 0 / 258 (0.00%)
Release-FX-QEMU.md, Release-FX-QEMU.xml, testfx_output.tar.gz Build information and commandsGIT: |
RISC-V Release-CLR-VF2: 9524 / 9544 (99.79%)
Release-CLR-VF2.md, Release-CLR-VF2.xml, testclr_output.tar.gz Build information and commandsGIT: RISC-V Release-CLR-QEMU: 9524 / 9544 (99.79%)
Release-CLR-QEMU.md, Release-CLR-QEMU.xml, testclr_output.tar.gz Build information and commandsGIT: RISC-V Release-FX-VF2: 436721 / 465185 (93.88%)
Build information and commandsGIT: |
RISC-V Release-CLR-VF2: 9524 / 9544 (99.79%)
Release-CLR-VF2.md, Release-CLR-VF2.xml, testclr_output.tar.gz Build information and commandsGIT: RISC-V Release-CLR-QEMU: 9524 / 9544 (99.79%)
Release-CLR-QEMU.md, Release-CLR-QEMU.xml, testclr_output.tar.gz Build information and commandsGIT: RISC-V Release-FX-QEMU: 641743 / 665283 (96.46%)
Release-FX-QEMU.md, Release-FX-QEMU.xml, testfx_output.tar.gz Build information and commandsGIT: RISC-V Release-FX-VF2: 435375 / 470663 (92.50%)
Build information and commandsGIT: |
Diffs are based on 12,734 contexts (10,221 MinOpts, 2,513 FullOpts). Overall (-882,648 bytes)
MinOpts (-684,524 bytes)
FullOpts (-198,124 bytes)
Example diffstest.mch-48 (-38.71%) : 198.dasm - System.ConsolePal:InvalidateCachedCursorPosition() (Tier0)@@ -19,36 +19,27 @@ G_M58234_IG01: ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref,
;; size=16 bbWeight=1 PerfScore 9.00
G_M58234_IG02: ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
addi a0, zero, 0xD1FFAB1E
- lui a1, 0xD1FFAB1E
- addiw a1, a1, 0xD1FFAB1E
- slli a1, a1, 11
- addi a1, a1, 0xD1FFAB1E
- slli a1, a1, 5
- addi a1, a1, 0xD1FFAB1E
+ auipc t6, 0xD1FFAB1E
+ ld a1, 0xD1FFAB1E(t6)
sw a0, 0xD1FFAB1E(a1)
- lui a0, 0xD1FFAB1E
- addiw a0, a0, 0xD1FFAB1E
- slli a0, a0, 11
- addi a0, a0, 0xD1FFAB1E
- slli a0, a0, 5
- addi a0, a0, 0xD1FFAB1E
+ auipc t6, 0xD1FFAB1E
+ ld a0, 0xD1FFAB1E(t6)
lw a0, 0xD1FFAB1E(a0)
addiw a0, a0, 0xD1FFAB1E
- lui a1, 0xD1FFAB1E
- addiw a1, a1, 0xD1FFAB1E
- slli a1, a1, 11
- addi a1, a1, 0xD1FFAB1E
- slli a1, a1, 5
- addi a1, a1, 0xD1FFAB1E
+ auipc t6, 0xD1FFAB1E
+ ld a1, 0xD1FFAB1E(t6)
sw a0, 0xD1FFAB1E(a1)
- ;; size=92 bbWeight=1 PerfScore 20.00
+ ;; size=44 bbWeight=1 PerfScore 17.00
G_M58234_IG03: ; bbWeight=1, epilog, nogc, extend
ld ra, 8(sp)
ld fp, 0(sp)
addi sp, sp, 16
ret ;; size=16 bbWeight=1 PerfScore 7.50
+RWD00 dq 00007E19B463B30Ch
+RWD08 dq 00007E19B463B308h
-; Total bytes of code 124, prolog size 16, PerfScore 36.50, instruction count 31, allocated bytes for code 124 (MethodHash=d4221c85) for method System.ConsolePal:InvalidateCachedCursorPosition() (Tier0)
+
+; Total bytes of code 76, prolog size 16, PerfScore 33.50, instruction count 16, allocated bytes for code 76 (MethodHash=d4221c85) for method System.ConsolePal:InvalidateCachedCursorPosition() (Tier0)
; ============================================================
Unwind Info:
@@ -59,7 +50,7 @@ Unwind Info:
E bit : 0
X bit : 0
Vers : 0
- Function Length : 31 (0x0001f) Actual length = 124 (0x00007c)
+ Function Length : 19 (0x00013) Actual length = 76 (0x00004c)
---- Epilog scopes ----
---- Scope 0
Epilog Start Offset : 3523193630 (0xd1ffab1e) Actual offset = 3523193630 (0xd1ffab1e) Offset from main function begin = 3523193630 (0xd1ffab1e) -48 (-36.36%) : 667.dasm - System.ConsolePal:InvalidateTerminalSettings() (FullOpts)@@ -23,38 +23,30 @@ G_M52800_IG01: ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref,
;; size=16 bbWeight=1 PerfScore 9.00
G_M52800_IG02: ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
addi a0, fp, -16
- lui a1, 0xD1FFAB1E
- addiw a1, a1, 0xD1FFAB1E
- slli a1, a1, 11
- addi a1, a1, 0xD1FFAB1E
- slli a1, a1, 5
- addi a1, a1, 0xD1FFAB1E
+ auipc t6, 0xD1FFAB1E
+ ld a1, 0xD1FFAB1E(t6)
jalr a1 // CORINFO_HELP_JIT_REVERSE_PINVOKE_ENTER
addi a0, zero, 0xD1FFAB1E
- lui a1, 0xD1FFAB1E
- addiw a1, a1, 0xD1FFAB1E
- slli a1, a1, 11
- addi a1, a1, 0xD1FFAB1E
- slli a1, a1, 5
- addi a1, a1, 0xD1FFAB1E
+ auipc t6, 0xD1FFAB1E
+ ld a1, 0xD1FFAB1E(t6)
fence 3, 3
sw a0, 0xD1FFAB1E(a1)
addi a0, fp, -16
- lui a1, 0xD1FFAB1E
- addiw a1, a1, 0xD1FFAB1E
- slli a1, a1, 11
- addi a1, a1, 0xD1FFAB1E
- slli a1, a1, 5
- addi a1, a1, 0xD1FFAB1E
+ auipc t6, 0xD1FFAB1E
+ ld a1, 0xD1FFAB1E(t6)
jalr a1 // CORINFO_HELP_JIT_REVERSE_PINVOKE_EXIT
- ;; size=100 bbWeight=1 PerfScore 25.50
+ ;; size=52 bbWeight=1 PerfScore 22.50
G_M52800_IG03: ; bbWeight=1, epilog, nogc, extend
ld ra, 24(sp)
ld fp, 16(sp)
addi sp, sp, 32
ret ;; size=16 bbWeight=1 PerfScore 7.50
+RWD00 dq 00007E1A33AB9A74h
+RWD08 dq 00007E19B463B31Ch
+RWD16 dq 00007E1A33AB9BCCh
-; Total bytes of code 132, prolog size 16, PerfScore 42.00, instruction count 33, allocated bytes for code 132 (MethodHash=bc7731bf) for method System.ConsolePal:InvalidateTerminalSettings() (FullOpts)
+
+; Total bytes of code 84, prolog size 16, PerfScore 39.00, instruction count 18, allocated bytes for code 84 (MethodHash=bc7731bf) for method System.ConsolePal:InvalidateTerminalSettings() (FullOpts)
; ============================================================
Unwind Info:
@@ -65,7 +57,7 @@ Unwind Info:
E bit : 0
X bit : 0
Vers : 0
- Function Length : 33 (0x00021) Actual length = 132 (0x000084)
+ Function Length : 21 (0x00015) Actual length = 84 (0x000054)
---- Epilog scopes ----
---- Scope 0
Epilog Start Offset : 3523193630 (0xd1ffab1e) Actual offset = 3523193630 (0xd1ffab1e) Offset from main function begin = 3523193630 (0xd1ffab1e) -32 (-34.78%) : 11911.dasm - Microsoft.CodeAnalysis.SyntaxNode+ChildSyntaxListEnumeratorStack+<>c:<.cctor>b__12_0():Microsoft.CodeAnalysis.ChildSyntaxList+Enumerator[]:this (Tier0)@@ -20,29 +20,24 @@ G_M43111_IG01: ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref,
sd a0, -8(fp)
;; size=20 bbWeight=1 PerfScore 13.00
G_M43111_IG02: ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
- lui a0, 0xD1FFAB1E
- addiw a0, a0, 0xD1FFAB1E
- slli a0, a0, 11
- addi a0, a0, 0xD1FFAB1E
- slli a0, a0, 5
- addi a0, a0, 0xD1FFAB1E
+ auipc t6, 0xD1FFAB1E
+ ld a0, 0xD1FFAB1E(t6)
addi a1, zero, 0xD1FFAB1E
- lui a2, 0xD1FFAB1E
- addiw a2, a2, 0xD1FFAB1E
- slli a2, a2, 11
- addi a2, a2, 0xD1FFAB1E
- slli a2, a2, 5
- addi a2, a2, 0xD1FFAB1E
+ auipc t6, 0xD1FFAB1E
+ ld a2, 0xD1FFAB1E(t6)
jalr a2 // CORINFO_HELP_NEWARR_1_VC
; gcrRegs +[a0]
- ;; size=56 bbWeight=1 PerfScore 9.50
+ ;; size=24 bbWeight=1 PerfScore 7.50
G_M43111_IG03: ; bbWeight=1, epilog, nogc, extend
ld ra, 24(sp)
ld fp, 16(sp)
addi sp, sp, 32
ret ;; size=16 bbWeight=1 PerfScore 7.50
+RWD00 dq 0000768AA9A1BFB8h
+RWD08 dq 0000768B238B2044h
-; Total bytes of code 92, prolog size 16, PerfScore 30.00, instruction count 23, allocated bytes for code 92 (MethodHash=94ff5798) for method Microsoft.CodeAnalysis.SyntaxNode+ChildSyntaxListEnumeratorStack+<>c:<.cctor>b__12_0():Microsoft.CodeAnalysis.ChildSyntaxList+Enumerator[]:this (Tier0)
+
+; Total bytes of code 60, prolog size 16, PerfScore 28.00, instruction count 13, allocated bytes for code 60 (MethodHash=94ff5798) for method Microsoft.CodeAnalysis.SyntaxNode+ChildSyntaxListEnumeratorStack+<>c:<.cctor>b__12_0():Microsoft.CodeAnalysis.ChildSyntaxList+Enumerator[]:this (Tier0)
; ============================================================
Unwind Info:
@@ -53,7 +48,7 @@ Unwind Info:
E bit : 0
X bit : 0
Vers : 0
- Function Length : 23 (0x00017) Actual length = 92 (0x00005c)
+ Function Length : 15 (0x0000f) Actual length = 60 (0x00003c)
---- Epilog scopes ----
---- Scope 0
Epilog Start Offset : 3523193630 (0xd1ffab1e) Actual offset = 3523193630 (0xd1ffab1e) Offset from main function begin = 3523193630 (0xd1ffab1e) +0 (0.00%) : 12720.dasm - System.Linq.Enumerable+EnumerableSorter`1[System.__Canon]:Sort(System.__Canon[],int):int[]:this (Tier0)@@ -33,9 +33,9 @@ G_M50207_IG02: ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
lw a2, -20(fp)
lui a3, 0xD1FFAB1E
addiw a3, a3, 0xD1FFAB1E
- slli a3, a3, 11
+ slli a3, a3, 12
addi a3, a3, 0xD1FFAB1E
- slli a3, a3, 5
+ slli a3, a3, 4
ld a3, 0xD1FFAB1E(a3)
jalr a3 // <unknown method>
; gcrRegs -[a1] +0 (0.00%) : 12704.dasm - Microsoft.CodeAnalysis.CSharp.VariablesDeclaredWalker:Free():this (Tier0)@@ -24,9 +24,9 @@ G_M15256_IG02: ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
; gcrRegs +[a0]
lui a1, 0xD1FFAB1E
addiw a1, a1, 0xD1FFAB1E
- slli a1, a1, 11
+ slli a1, a1, 14
addi a1, a1, 0xD1FFAB1E
- slli a1, a1, 5
+ slli a1, a1, 2
ld a1, 0xD1FFAB1E(a1)
jalr a1 // <unknown method>
; gcrRegs -[a0] +0 (0.00%) : 12640.dasm - Microsoft.CodeAnalysis.DiagnosticBag:Add(Microsoft.CodeAnalysis.Diagnostic):this (Tier0)@@ -28,9 +28,9 @@ G_M13912_IG02: ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
; gcrRegs +[a0]
lui a1, 0xD1FFAB1E
addiw a1, a1, 0xD1FFAB1E
- slli a1, a1, 11
+ slli a1, a1, 14
addi a1, a1, 0xD1FFAB1E
- slli a1, a1, 5
+ slli a1, a1, 2
ld a1, 0xD1FFAB1E(a1)
jalr a1 // <unknown method>
sd a0, -24(fp)
@@ -39,9 +39,9 @@ G_M13912_IG02: ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
; gcrRegs +[a1]
lui a2, 0xD1FFAB1E
addiw a2, a2, 0xD1FFAB1E
- slli a2, a2, 11
+ slli a2, a2, 14
addi a2, a2, 0xD1FFAB1E
- slli a2, a2, 5
+ slli a2, a2, 2
ld a2, 0xD1FFAB1E(a2)
lw zero, 0xD1FFAB1E(a0)
jalr a2 // <unknown method> DetailsSize improvements/regressions per collection
PerfScore improvements/regressions per collection
Context information
jit-analyze outputReport generated after merging fuad1502@544cf0c to the local branch & diffing with that commit. |
RISC-V Release-CLR-VF2: 9527 / 9547 (99.79%)
Release-CLR-VF2.md, Release-CLR-VF2.xml, testclr_output.tar.gz Build information and commandsGIT: RISC-V Release-CLR-QEMU: 9527 / 9547 (99.79%)
Release-CLR-QEMU.md, Release-CLR-QEMU.xml, testclr_output.tar.gz Build information and commandsGIT: RISC-V Release-FX-VF2: 627539 / 665359 (94.32%)
Build information and commandsGIT: RISC-V Release-FX-QEMU: 622000 / 654961 (94.97%)
Release-FX-QEMU.md, Release-FX-QEMU.xml, testfx_output.tar.gz Build information and commandsGIT: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/* The following algorithm works based on the following equation: | ||
* `imm = high32 + offset1` OR `imm = high32 - offset2` | ||
* | ||
* high32 will be loaded with `lui + addiw`, while offset | ||
* will be loaded with `slli + addi` in 11-bits chunks | ||
* | ||
* First, determine at which position to partition imm into high32 and offset, | ||
* so that it yields the least instruction. | ||
* Where high32 = imm[y:x] and imm[63:y] are all zeroes or all ones. | ||
* | ||
* From the above equation, the value of offset1 & offset2 are: | ||
* -> offset1 = imm[x-1:0] | ||
* -> offset2 = ~(imm[x-1:0] - 1) | ||
* The smaller offset should yield the least instruction. (is this correct?) */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the preferred style of comments: https://github.com/dotnet/runtime/blob/main/docs/coding-guidelines/clr-jit-coding-conventions.md#711-comment-style
Feel free to include as part of a follow-up to avoid rerunning CI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, thank you, I’ll create a follow up PR and make sure to review the coding conventions 👍
/ba-g Azurelinux 3 timeouts |
Notes:
|
@BruceForstall Thank you for the notes.
runtime/src/coreclr/jit/emit.cpp Lines 6804 to 6806 in 4631ece
Therefore, we don’t need to generate relocations and simply use PC relative instructions ( However, refering to the following: runtime/docs/design/coreclr/jit/hot-cold-splitting.md Lines 82 to 86 in 4631ece
If we’re in the cold region, the data section (located in hot code region) might be arbitrarily far away, whereas Then I realized that ARM64 actually generates either To address the particular problem when the code is supposed to be relocatable & we’re in a cold region, I would need to answers to the following:
I’m still reading the codebase to get the answers, but if you have any information that you can share about this, or you already know some of the answers, please do let me know, I would really appreciate it 😄 And sorry if by opening this PR with my currently minimum knowledge on .NET JIT is causing more trouble than it helps, I’ll try to learn more! |
I was under the impression that you are generating relocations when you said that you addressed the issue above: #113250 (comment)
Hmm, it's very possible the AOT compilers never move this data around. If the other backends are also not recording relocations it does not seem like a problem. |
Some more clarity on top of what @fuad1502 wrote above comes from here: runtime/src/coreclr/jit/ee_il_dll.cpp Lines 1143 to 1162 in 1587221
So essentially, for these backends we allocate no data section at all, we just allocate a larger hot code section. |
@fuad1502 Thanks for the analysis. You are correct, for arm64/loongarch64/riscv64, where the read-only data is appended to the hot cold section, if you load it via pc-relative addressing no relocations are required. As for your questions:
|
@BruceForstall Thanks for the answers! So in conclusion, for your second note in the original comment, I only need to address the particular case where currently I load from an absolute address when loading constant from cold section, despite relocation requirement. But since it seems that we can safely use +-2GB as the maximum distance between cold code and constant data in hot code, I’ll create a follow up PR to use PC relative addressing for loading constant, regardless, but generate relocs when loading from cold section. I’ll make sure to add an assertion to check the distance assumption validity. Does this sounds about right? |
That all sounds right. You could also have an assert that there is no hot/cold splitting at all (which I presume there isn't yet), if you want to defer this until later. |
In this PR, a new algorithm was implemented to reduce the number of instructions generated for loading constants to registers. Additionally, when the number instructions still exceed 5, it is instead optimized using
emitDataConst
.See how clang load 64 bits constants in RISC-V with godbolt.
With the following C# function:
Before patch:
After patch:
Note: @tomeksowi point out that there is one additional optimization that GCC is able to do but clang cannot (see the godbolt link above), which uses a temporary register to utilize instruction level parallelism. However, I won't be covering that optimization in this PR.
Part of #84834, cc @dotnet/samsung