[LoongArch64] Add atomic instruction implementation for LoongArch64#129683
[LoongArch64] Add atomic instruction implementation for LoongArch64#129683lawn123 wants to merge 10 commits into
Conversation
|
Hi @shushanhf @LuckyXu-HF @jakobbotsch @jkotas could you please review this PR? Thanks. |
| [MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
| public static int Exchange(ref int location1, int value) | ||
| { | ||
| #if TARGET_X86 || TARGET_AMD64 || TARGET_ARM64 || TARGET_RISCV64 |
There was a problem hiding this comment.
This (and all other cases) should explicitly list supported architectures. Other new architectures (for example ppc64le) are not expanding the intrinsics by default.
There was a problem hiding this comment.
This (and all other cases) should explicitly list supported architectures. Other new architectures (for example ppc64le) are not expanding the intrinsics by default.
This is for architectures where we want to always use the intrinsic impl and not fallback to runtime impl.
| RETAIL_CONFIG_DWORD_INFO(EXTERNAL_EnableRiscV64Zbs, W("EnableRiscV64Zbs"), 1, "Allows RiscV64 Zbs hardware intrinsics to be disabled") | ||
| #elif defined(TARGET_LOONGARCH64) | ||
| RETAIL_CONFIG_DWORD_INFO(EXTERNAL_EnableLoongArch64LAM_BH, W("EnableLoongArch64LAM_BH"), 0, "Allows LoongArch64 LAM_BH+ hardware intrinsics to be disabled") | ||
| RETAIL_CONFIG_DWORD_INFO(EXTERNAL_EnableLoongArch64LAM_CAS, W("EnableLoongArch64LAM_CAS"), 0, "Allows LoongArch64 LAM_BH+ hardware intrinsics to be disabled") |
There was a problem hiding this comment.
The comment here should be LAMCAS, right? Not LAM_BH+ a second time?
There was a problem hiding this comment.
Yes, I will make the changes later.
| BasicBlock* labelRetry = genCreateTempLabel(); | ||
| BasicBlock* fail = genCreateTempLabel(); | ||
| unsigned size = 0; | ||
| emit->emitIns_R_R_I(INS_addi_w, dataSize, tmpReg, REG_R0, -4); |
There was a problem hiding this comment.
Are those explicit small type emulation fallbacks noticeably better than the managed one when it comes to performance?
There was a problem hiding this comment.
I conducted a performance comparison test on CompareExchange(ref byte location1, byte value, byte compare) using performance. Artifacts_Atomic is based on simulation fallbacks, while Artifacts_noAtomic is built on a managed implementation. CompareExchange_byte is when the comparison results are the same, CompareExchange_byte_nomatch is different.The results are as follows:
| Method | Job | Toolchain | Mean | Error | StdDev | Median | Min | Max | Ratio | Allocated | Alloc Ratio |
|----------------------------- |----------- |------------------------------------------------------------------------------------ |---------:|---------:|---------:|---------:|---------:|---------:|------:|----------:|------------:|
| CompareExchange_byte | Job-AZJEDS | /artifacts_Atomic/tests/coreclr/linux.loongarch64.Release/Tests/Core_Root/corerun | 22.46 ns | 0.005 ns | 0.004 ns | 22.45 ns | 22.45 ns | 22.46 ns | 1.00 | - | NA |
| CompareExchange_byte | Job-NEIHQV | /artifacts_noAtomic/tests/coreclr/linux.loongarch64.Release/Tests/Core_Root/corerun | 27.43 ns | 0.004 ns | 0.004 ns | 27.43 ns | 27.42 ns | 27.44 ns | 1.22 | - | NA |
| | | | | | | | | | | | |
| CompareExchange_byte_nomatch | Job-AZJEDS | /artifacts_Atomic/tests/coreclr/linux.loongarch64.Release/Tests/Core_Root/corerun | 22.82 ns | 0.005 ns | 0.004 ns | 22.82 ns | 22.82 ns | 22.83 ns | 1.00 | - | NA |
| CompareExchange_byte_nomatch | Job-NEIHQV | /artifacts_noAtomic/tests/coreclr/linux.loongarch64.Release/Tests/Core_Root/corerun | 28.16 ns | 0.022 ns | 0.020 ns | 28.16 ns | 28.14 ns | 28.20 ns | 1.23 | - | NA |
The corresponding machine instruction sequence is as follows:
managed implementation machine instruction sequence
3ae8e0: addi.d $sp, $sp, -80
3ae8e4: st.d $fp, $sp, 0
3ae8e8: st.d $ra, $sp, 8
3ae8ec: st.d $s0, $sp, 16
3ae8f0: st.d $s1, $sp, 24
3ae8f4: st.d $s2, $sp, 32
3ae8f8: st.d $s3, $sp, 40
3ae8fc: st.d $s4, $sp, 48
3ae900: st.d $s5, $sp, 56
3ae904: st.d $s6, $sp, 64
3ae908: st.d $s7, $sp, 72
3ae90c: ori $fp, $sp, 0
3ae910: ori $t8, $a0, 0
A0 is live
3ae914: andi $t8, $t8, 3
3ae918: sub.d $s0, $a0, $t8
3ae91c: slli.w $a0, $t8, 3
3ae920: andi $s1, $a0, 31
3ae924: addi.w $a0, $zero, 255
3ae928: sll.w $a0, $a0, $s1
A0 is dead
3ae92c: nor $s2, $a0, $zero
A0 is live
3ae930: bstrpick.d $a0, $a1, 7, 0
S0 is dead
3ae934: sll.w $s3, $a0, $s1
3ae938: bstrpick.d $a2, $a2, 7, 0
3ae93c: sll.w $s4, $a2, $s1
3ae940: ld.w $s5, $s0, 0
3ae944: and $s6, $s5, $s2
3ae948: slli.w $s6, $s6, 0
3ae94c: or $a2, $s6, $s4
3ae950: slli.w $a2, $a2, 0
3ae954: or $a1, $s6, $s3
3ae958: slli.w $a1, $a1, 0
3ae95c: b 0x3ae97c
3ae960: and $s6, $s7, $s2
3ae964: slli.w $s6, $s6, 0
3ae968: or $a2, $s6, $s4
3ae96c: slli.w $a2, $a2, 0
3ae970: or $a1, $s6, $s3
3ae974: slli.w $a1, $a1, 0
3ae978: slli.w $s5, $s7, 0
3ae97c: ori $a0, $s0, 0
3ae980: pcalau12i $t8, 3336
3ae984: addi.d $t8, $t8, 1128
3ae988: ld.d $a3, $t8, 0
3ae98c: jirl $ra, $a3, 0 // int System.Threading.Interlocked.CompareExchange32(ref int, int, int) (METHOD_ENTRY_DEF_TOKEN)
3ae990: slli.w $s7, $a0, 0
3ae994: slli.w $ra, $s5, 0
3ae998: slli.w $r21, $s7, 0
3ae99c: bne $r21, $ra, 0x3ae960
3ae9a0: srl.w $a0, $a0, $s1
3ae9a4: bstrpick.d $a0, $a0, 7, 0
3ae9a8: ld.d $s7, $sp, 72
3ae9ac: ld.d $s6, $sp, 64
3ae9b0: ld.d $s5, $sp, 56
3ae9b4: ld.d $s4, $sp, 48
3ae9b8: ld.d $s3, $sp, 40
3ae9bc: ld.d $s2, $sp, 32
3ae9c0: ld.d $s1, $sp, 24
3ae9c4: ld.d $s0, $sp, 16
3ae9c8: ld.d $ra, $sp, 8
3ae9cc: ld.d $fp, $sp, 0
3ae9d0: addi.d $sp, $sp, 80
3ae9d4: ret
simulation fallbacks machine instruction sequence
3c7270: addi.d $sp, $sp, -16
3c7274: st.d $fp, $sp, 0
3c7278: st.d $ra, $sp, 8
3c727c: ori $fp, $sp, 0
3c7280: bstrpick.d $a1, $a1, 7, 0
3c7284: bstrpick.d $a2, $a2, 7, 0
3c7288: addi.w $a7, $zero, -4
3c728c: and $a6, $a0, $a7
3c7290: andi $a7, $a0, 3
3c7294: slli.d $a7, $a7, 3
3c7298: addi.w $a3, $zero, 255
3c729c: sll.w $a4, $a3, $a7
3c72a0: nor $a5, $zero, $a4
3c72a4: bstrpick.w $a3, $a2, 7, 0
3c72a8: sll.w $a2, $a3, $a7
3c72ac: bstrpick.w $a3, $a1, 7, 0
3c72b0: sll.w $a7, $a3, $a7
3c72b4: dbar 0
3c72b8: ll.w $a3, $a6, 0
3c72bc: and $t0, $a3, $a4
3c72c0: bne $t0, $a2, 0x3c72d4
3c72c4: and $a3, $a3, $a5
3c72c8: or $a3, $a3, $a7
3c72cc: sc.w $a3, $a6, 0
3c72d0: beq $a3, $zero, 0x3c72b8
3c72d4: dbar 0
3c72d8: andi $a7, $a0, 3
3c72dc: slli.d $a7, $a7, 3
3c72e0: srl.w $t0, $t0, $a7
3c72e4: ext.w.b $t0, $t0
3c72e8: bstrpick.d $t0, $t0, 7, 0
3c72ec: slli.w $a0, $t0, 0
3c72f0: ld.d $ra, $sp, 8
3c72f4: ld.d $fp, $sp, 0
3c72f8: addi.d $sp, $sp, 16
3c72fc: ret
There was a problem hiding this comment.
3ae98c: jirl $ra, $a3, 0 // int System.Threading.Interlocked.CompareExchange32(ref int, int, int) (METHOD_ENTRY_DEF_TOKEN)
It seems that the CompareExchange isn't expanded properly in the managed fallback, you should probably fix that and benchmark again.
There was a problem hiding this comment.
It seems that the CompareExchange isn't expanded properly in the managed fallback, you should probably fix that and benchmark again.
For LA, atomic instructions related to CompareExchange are not supported on v1.0, so only internal CompareExchange32 can be called here.
117 public static int CompareExchange(ref int location1, int value, int comparand)
118 {
119 #if TARGET_X86 || TARGET_AMD64 || TARGET_ARM64 || TARGET_RISCV64
120 return CompareExchange(ref location1, value, comparand); // Must expand intrinsic
121 #else
122 if (Unsafe.IsNullRef(ref location1))
123 ThrowHelper.ThrowNullReferenceException();
124 return CompareExchange32(ref location1, value, comparand);
125 #endif
126 }
There was a problem hiding this comment.
It seems that the CompareExchange isn't expanded properly in the managed fallback, you should probably fix that and benchmark again.
For LA, atomic instructions related to CompareExchange are not supported on v1.0, so only internal CompareExchange32 can be called here.
117 public static int CompareExchange(ref int location1, int value, int comparand) 118 { 119 #if TARGET_X86 || TARGET_AMD64 || TARGET_ARM64 || TARGET_RISCV64 120 return CompareExchange(ref location1, value, comparand); // Must expand intrinsic 121 #else 122 if (Unsafe.IsNullRef(ref location1)) 123 ThrowHelper.ThrowNullReferenceException(); 124 return CompareExchange32(ref location1, value, comparand); 125 #endif 126 }
@jkotas Does the VM not require lock free cmpxchg for correctness?
There was a problem hiding this comment.
For LA, atomic instructions related to CompareExchange are not supported on v1.0, so only internal CompareExchange32 can be called here.
So why is CompareExchange32 not expanded using ll.w/sc.w in your managed implementation machine instruction sequence?
Does the VM not require lock free cmpxchg for correctness?
Yes. I do not see any problems. LA 1.0 supports 32-bit cmpxchg using ll.w/sc.w.
There was a problem hiding this comment.
So why is CompareExchange32 not expanded using ll.w/sc.w in your managed implementation machine instruction sequence?
This is an issue I encountered during testing, where I completely removed the code from the PR for comparison.Should maintain the LA support added in CoreCLR.cs
There was a problem hiding this comment.
Right. I expect that the software fallback for byte is going to be about as good as the manual expansion in the JIT once you do that.
There was a problem hiding this comment.
Yes, after adding LA support in CoreCLR.cs, the test results were almost consistent, and the test results are as follows:
| Method | Job | Toolchain | Mean | Error | StdDev | Median | Min | Max | Ratio | Allocated | Alloc Ratio |
|----------------------------- |----------- |------------------------------------------------------------------------------------ |---------:|---------:|---------:|---------:|---------:|---------:|------:|----------:|------------:|
| CompareExchange_byte | Job-AZJEDS | /artifacts_Atomic/tests/coreclr/linux.loongarch64.Release/Tests/Core_Root/corerun | 22.44 ns | 0.003 ns | 0.002 ns | 22.44 ns | 22.44 ns | 22.45 ns | 1.00 | - | NA |
| CompareExchange_byte | Job-NEIHQV | /artifacts_noAtomic/tests/coreclr/linux.loongarch64.Release/Tests/Core_Root/corerun | 22.44 ns | 0.002 ns | 0.002 ns | 22.44 ns | 22.44 ns | 22.45 ns | 1.00 | - | NA |
| | | | | | | | | | | | |
| CompareExchange_byte_nomatch | Job-AZJEDS | /artifacts_Atomic/tests/coreclr/linux.loongarch64.Release/Tests/Core_Root/corerun | 22.82 ns | 0.002 ns | 0.001 ns | 22.82 ns | 22.82 ns | 22.82 ns | 1.00 | - | NA |
| CompareExchange_byte_nomatch | Job-NEIHQV | /artifacts_noAtomic/tests/coreclr/linux.loongarch64.Release/Tests/Core_Root/corerun | 22.83 ns | 0.014 ns | 0.011 ns | 22.82 ns | 22.82 ns | 22.86 ns | 1.00 | - | NA |
managed implementation machine instruction sequence
3ad900: addi.d $sp, $sp, -16
3ad904: st.d $fp, $sp, 0
3ad908: st.d $ra, $sp, 8
3ad90c: ori $fp, $sp, 0
3ad910: ori $a3, $a0, 0
3ad914: andi $a3, $a3, 3
3ad918: sub.d $a0, $a0, $a3
3ad91c: slli.w $a3, $a3, 3
3ad920: andi $a3, $a3, 31
3ad924: addi.w $a4, $zero, 255
3ad928: sll.w $a4, $a4, $a3
3ad92c: nor $a4, $a4, $zero
3ad930: bstrpick.d $a1, $a1, 7, 0
3ad934: sll.w $a1, $a1, $a3
3ad938: bstrpick.d $a2, $a2, 7, 0
3ad93c: sll.w $a2, $a2, $a3
3ad940: ld.w $a5, $a0, 0
3ad944: b 0x3ad94c
3ad948: slli.w $a5, $a6, 0
3ad94c: and $a6, $a5, $a4
3ad950: slli.w $a6, $a6, 0
3ad954: or $a7, $a6, $a1
3ad958: slli.w $a7, $a7, 0
3ad95c: or $a6, $a6, $a2
3ad960: slli.w $a6, $a6, 0
3ad964: slli.w $a6, $a6, 0
3ad968: dbar 0
3ad96c: ll.w $t1, $a0, 0
3ad970: bne $t1, $a6, 0x3ad980
3ad974: ori $t0, $a7, 0
3ad978: sc.w $t0, $a0, 0
3ad97c: beq $t0, $zero, 0x3ad96c
3ad980: dbar 0
3ad984: slli.w $a6, $t1, 0
3ad988: slli.w $ra, $a5, 0
3ad98c: slli.w $r21, $a6, 0
3ad990: bne $r21, $ra, 0x3ad948
3ad994: srl.w $a0, $t1, $a3
3ad998: bstrpick.d $a0, $a0, 7, 0
3ad99c: ld.d $ra, $sp, 8
3ad9a0: ld.d $fp, $sp, 0
3ad9a4: addi.d $sp, $sp, 16
3ad9a8: ret
| RETAIL_CONFIG_DWORD_INFO(EXTERNAL_EnableLoongArch64LAM_BH, W("EnableLoongArch64LAM_BH"), 0, "Allows LoongArch64 LAM_BH+ hardware intrinsics to be disabled") | ||
| RETAIL_CONFIG_DWORD_INFO(EXTERNAL_EnableLoongArch64LAM_CAS, W("EnableLoongArch64LAM_CAS"), 0, "Allows LoongArch64 LAM_BH+ hardware intrinsics to be disabled") |
There was a problem hiding this comment.
Why we aren't turning it on by default?
| RETAIL_CONFIG_DWORD_INFO(EXTERNAL_EnableLoongArch64LAM_BH, W("EnableLoongArch64LAM_BH"), 0, "Allows LoongArch64 LAM_BH+ hardware intrinsics to be disabled") | |
| RETAIL_CONFIG_DWORD_INFO(EXTERNAL_EnableLoongArch64LAM_CAS, W("EnableLoongArch64LAM_CAS"), 0, "Allows LoongArch64 LAM_BH+ hardware intrinsics to be disabled") | |
| RETAIL_CONFIG_DWORD_INFO(EXTERNAL_EnableLoongArch64LAM_BH, W("EnableLoongArch64LAM_BH"), 1, "Allows LoongArch64 LAM_BH+ hardware intrinsics to be disabled") | |
| RETAIL_CONFIG_DWORD_INFO(EXTERNAL_EnableLoongArch64LAM_CAS, W("EnableLoongArch64LAM_CAS"), 1, "Allows LoongArch64 LAM_BH+ hardware intrinsics to be disabled") |
There was a problem hiding this comment.
Why we aren't turning it on by default?
There may be a situation on LA where the system CPU supports LAM, but the virtual machine simulated on that machine does not support LAM.
There was a problem hiding this comment.
Are those situations more common, making the parity favor that direction? i.e., would you rather have the regular user with "desired system state" know about DOTNET_EnableLoongArch64LAM_CAS=1, or would you rather make it disablable so people running "unusual" configurations can tweak the environment (like turning off TieredCompilation to work around situations, which we sometimes do)?
There was a problem hiding this comment.
There may be a situation on LA where the system CPU supports LAM, but the virtual machine simulated on that machine does not support LAM.
If the virtual machine does not support LAM, LAM should be reported as unsupported by minipal_getcpufeatures.
There was a problem hiding this comment.
LAM should be reported as unsupported by minipal_getcpufeatures
The implementation of LA in the minipal_getcpufeatures is read through the cpucfg instruction, not a system function, so I am not sure if it can be correctly distinguished on the virtual machine because it has not been verified. To ensure correctness, the default switch is turned off.
Are those situations more common
This is not common, we have not currently used such an environment. But I can't be sure if there are other customers using it.
There was a problem hiding this comment.
If you cannot reliably detect whether CAS is supported by the machine you are running on, the opportunistic light-up for native aot does not make sense and it should be reverted. The opportunistic light-up for native aot makes sense only if this switch is on by default for regular CoreCLR.
There was a problem hiding this comment.
I just asked my colleague about the virtual machine situation, which is also a hypothetical situation, so here we still default to turning on the switch.
| /// <summary>Provides access to the LoongArch64 Atomic hardware instructions via intrinsics.</summary> | ||
| [Intrinsic] | ||
| [CLSCompliant(false)] | ||
| public abstract class LAM |
There was a problem hiding this comment.
These helper types should be internal. I do not think we need to expose them as public APIs.
There was a problem hiding this comment.
Yes, I have already made the modifications.
| StoreL3NonTemporal = 13 | ||
| }; | ||
| } | ||
| namespace System.Runtime.Intrinsics.LoongArch |
There was a problem hiding this comment.
Changes in this file should be reverted.
There was a problem hiding this comment.
I meant this PR should not be changing this file at all.
|
|
||
| result = NI_IsSupported_Type; | ||
| } | ||
| #if defined(TARGET_LOONGARCH64) |
There was a problem hiding this comment.
I think this should be done by enabling FEATURE_HW_INTRINSICS for loongarch64 instead of this one-off check.
There was a problem hiding this comment.
I think this should be done by enabling
FEATURE_HW_INTRINSICSfor loongarch64 instead of this one-off check.
I remember there is a strong coupling between FEATURE-HW-INTRINSICS and FEATURE_SIMD. Opening FEATURE-HW-INTRINSICS separately may introduce more problems. This is just a solution. After supporting SIMD on LA, we will modify it to directly use FEATURE-HW-INRINSICS.
| #ifdef TARGET_LOONGARCH64 | ||
| InstructionSet_LAM_BH=1, | ||
| InstructionSet_LAM_CAS=2, | ||
| // TODO-LA-SIMD-other: add other ISA-features. |
There was a problem hiding this comment.
This file should be exactly what the auto-generator produced, without any manual changes.
| /// <para>AMSWAP[_DB].H rd, rk, rj</para> | ||
| /// <para>This is the newly added atomic instruction on ISA1.1</para> | ||
| /// </summary> | ||
| internal static ushort Exchange(ref ushort location1, ushort value) => Exchange(ref location1, value); |
There was a problem hiding this comment.
How does Exchange work for int and long that it is missing here?
There was a problem hiding this comment.
For int and long types Exchange supports both V1.0 and V1.1.
7b904f6 to
e4cdfee
Compare
Change-Id: I3b7757ecfeff9aba28407508b196d55b4e21b69d
Change-Id: Ib826fc3ffb9057835357b68a51e20da66f9cebd5
Change-Id: Id3be930d58440bbc39b307a211a9fcb444f1e3ca
…on and opportunistics light-up for NAOT Change-Id: I419341c499d0c4e955a60e5f4ad6fdeef74f9277
…rmatting issues Change-Id: I1b3d1010ac3642624ec1f66eb26dc11220460d7a
…efault and revert System.Runtime.Intrinsics.cs changes. Change-Id: If9f10ec7908aaf706b1d2ad9f2e34db073543d36
Change-Id: I2c1e8d43106616da47e231670e557f508330e3d8
…n.sh Change-Id: I3240c5d2413950981b4f1446d837d9c0de4e9a65
Change-Id: I51fe7faac05749bc38344e2068bb2efccc1ce99a
This reverts commit 622f190.
Add atomic instruction for LoongArch64. Supports atomic instructions on ISA1.0 and ISA1.1
Implementation of #122745.