Description
Hey JIT compiler friends, 😊
Vector3 is quite convenient and popular to manipulate e.g position/speed, used for storage and calculations. The problem comes that loading/storing is requiring several instructions to store the elements separately (unlike for a Vector4).
One technique that is usually used is to load a Vector4 and downcasting it to a Vector3, this operation should effectively zero out the .w
component of the SIMD register. Usually, the load is made on safe data boundaries were you know that loading w from memory is ok (even with garbage) and is safe (e.g no page fault access).
The problem is that while upcasting from a Vector3
to a Vector4
has a proper intrinsics, downcasting doesn't and generates always stack spilling. I haven't found a way to workaround this, so I usually have to replace all Vector3 to Vector4, which is not ideal.
For example the following code (on sharplab.io)
private static void TestVector4ToVector3v1(Span<Vector4> span, out Vector3 output) {
Vector3 result = Vector3.Zero;
for(int i = 0; i < span.Length; i++) {
result += span[i].AsVector128().AsVector3();
}
output = result;
}
will generate the following code:
C.TestVector4ToVector3v1(System.Span`1<System.Numerics.Vector4>, System.Numerics.Vector3 ByRef)
L0000: sub rsp, 0x18
L0004: vzeroupper
L0007: mov rax, [rcx]
L000a: mov ecx, [rcx+8]
L000d: vxorps xmm0, xmm0, xmm0
L0012: xor r8d, r8d
L0015: test ecx, ecx
L0017: jle short L004e
L0019: nop [rax]
L0020: mov r9d, r8d
L0023: shl r9, 4
L0027: vmovupd xmm1, [rax+r9]
<<<<<<<<<<<<<< stack spilling and reload - begin
L002d: vmovapd [rsp], xmm1
L0032: vmovss xmm1, [rsp+8]
L0038: vmovsd xmm2, [rsp]
<<<<<<<<<<<<<< stack spilling and reload - end
L003d: vshufps xmm2, xmm2, xmm1, 0x44
L0042: vaddps xmm0, xmm0, xmm2
L0046: inc r8d
L0049: cmp r8d, ecx
L004c: jl short L0020
L004e: vmovsd [rdx], xmm0
L0052: vpshufd xmm1, xmm0, 2
L0057: vmovss [rdx+8], xmm1
L005c: add rsp, 0x18
L0060: ret
Trying to workaround it via the following doesn't work either which generates a code similar to the code above. See the sharplab link above. For which I was more surprised, as the Vector3(float, float, float)
constructor is marked as an intrinsic...
private static void TestVector4ToVector3v2(Span<Vector4> span, out Vector3 output) {
Vector3 result = Vector3.Zero;
for(int i = 0; i < span.Length; i++) {
var v4 = span[i];
var v3 = new Vector3(v4.X, v4.Y, v4.Z);
result += v3;
}
output = result;
}
Instead, a downcast should be able to generate a similar code to what we could get with an upcast by setting directly to 0.0f the .w
component like this:
L0027: vxorps xmm0, xmm0, xmm0
L002b: vinsertps xmm0, xmm1, xmm0, 0x30
Maybe I have missed something in the API that is providing such conversion but I failed to find it... 🤔
Would it possible to optimize this conversion as proposed here?
Thanks!
(Edit: consequently, this applies to any downcast e.g so to Vector2 as well, or from Vector3 to Vector2)