Skip to content

Cannot convert efficiently a System.Numerics.Vector4 to a Vector3 #86220

Closed
@xoofx

Description

@xoofx

Hey JIT compiler friends, 😊

Vector3 is quite convenient and popular to manipulate e.g position/speed, used for storage and calculations. The problem comes that loading/storing is requiring several instructions to store the elements separately (unlike for a Vector4).

One technique that is usually used is to load a Vector4 and downcasting it to a Vector3, this operation should effectively zero out the .w component of the SIMD register. Usually, the load is made on safe data boundaries were you know that loading w from memory is ok (even with garbage) and is safe (e.g no page fault access).

The problem is that while upcasting from a Vector3 to a Vector4 has a proper intrinsics, downcasting doesn't and generates always stack spilling. I haven't found a way to workaround this, so I usually have to replace all Vector3 to Vector4, which is not ideal.

For example the following code (on sharplab.io)

    private static void TestVector4ToVector3v1(Span<Vector4> span, out Vector3 output) {
        Vector3 result = Vector3.Zero;
        for(int i = 0; i < span.Length; i++) {
            result += span[i].AsVector128().AsVector3();
        }

        output = result;
    }

will generate the following code:

C.TestVector4ToVector3v1(System.Span`1<System.Numerics.Vector4>, System.Numerics.Vector3 ByRef)
    L0000: sub rsp, 0x18
    L0004: vzeroupper
    L0007: mov rax, [rcx]
    L000a: mov ecx, [rcx+8]
    L000d: vxorps xmm0, xmm0, xmm0
    L0012: xor r8d, r8d
    L0015: test ecx, ecx
    L0017: jle short L004e
    L0019: nop [rax]
    L0020: mov r9d, r8d
    L0023: shl r9, 4
    L0027: vmovupd xmm1, [rax+r9]
    <<<<<<<<<<<<<< stack spilling and reload - begin
    L002d: vmovapd [rsp], xmm1
    L0032: vmovss xmm1, [rsp+8]
    L0038: vmovsd xmm2, [rsp]
    <<<<<<<<<<<<<< stack spilling and reload - end
    L003d: vshufps xmm2, xmm2, xmm1, 0x44
    L0042: vaddps xmm0, xmm0, xmm2
    L0046: inc r8d
    L0049: cmp r8d, ecx
    L004c: jl short L0020
    L004e: vmovsd [rdx], xmm0
    L0052: vpshufd xmm1, xmm0, 2
    L0057: vmovss [rdx+8], xmm1
    L005c: add rsp, 0x18
    L0060: ret

Trying to workaround it via the following doesn't work either which generates a code similar to the code above. See the sharplab link above. For which I was more surprised, as the Vector3(float, float, float) constructor is marked as an intrinsic...

    private static void TestVector4ToVector3v2(Span<Vector4> span, out Vector3 output) {
        Vector3 result = Vector3.Zero;
        for(int i = 0; i < span.Length; i++) {
            var v4 = span[i];
            var v3 = new Vector3(v4.X, v4.Y, v4.Z);
            result += v3;
        }

        output = result;
    }    

Instead, a downcast should be able to generate a similar code to what we could get with an upcast by setting directly to 0.0f the .w component like this:

    L0027: vxorps xmm0, xmm0, xmm0
    L002b: vinsertps xmm0, xmm1, xmm0, 0x30

Maybe I have missed something in the API that is providing such conversion but I failed to find it... 🤔

Would it possible to optimize this conversion as proposed here?

Thanks!

(Edit: consequently, this applies to any downcast e.g so to Vector2 as well, or from Vector3 to Vector2)

Metadata

Metadata

Assignees

No one assigned

    Labels

    area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMItenet-performancePerformance related issue

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions