Skip to content

[RyuJIT] lack of escape analysis makes high GC overhead in SoA SIMD programs #10760

Open
@fiigii

Description

@fiigii

According to the VTune characterization dotnet/coreclr#18839 (comment), SoA SIMD programs have higher GC overhead than AoS and scalar programs because of temp object allocation.

SoA SIMD programs use VectorPacket as the primitive data type (Note, here VectorPacket is a reference type class)

class VectorPacket256
{
    public Vector256<float> Xs;
    public Vector256<float> Ys;
    public Vector256<float> Zs;
}

And each VectorPacket operation is immutable that returns a new VectorPacket as the result.

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static VectorPacket256 operator -(VectorPacket256 left, VectorPacket256 right)
{
    return new VectorPacket256(Subtract(left.Xs, right.Xs), Subtract(left.Ys, right.Ys), Subtract(left.Zs, right.Zs));
}

This semantic makes a lot of temp object allocations, for example, there are two VectorPacket operations in the code segment below

    private ColorPacket256 GetNaturalColor(Vector256<int> things, VectorPacket256 pos, VectorPacket256 norms, VectorPacket256 rds, Scene scene)
    {
        var colors = ColorPacket256Helper.DefaultColor;
        for (int i = 0; i < scene.Lights.Length; i++)
        {
            var lights = scene.Lights[i];
            var zero = SetZeroVector256<float>();
            var colorPacket = lights.Colors;
            VectorPacket256 ldis = lights.Positions - pos;   // VectorPacket256 operation
            VectorPacket256 livec = ldis.Normalize();        // VectorPacket256 operation
            var neatIsectDis = TestRay(new RayPacket256(pos, livec), scene);

These two lines will be compiled by RyuJIT to

vextractf128 xmm7, ymm6, 0x1		
call CORINFO_HELP_NEWSFAST  ;;; allocate the object	
vinsertf128 ymm6, ymm6, xmm7, 0x1
		
mov rcx, qword ptr [rsp+0x58]		
vmovupd ymm0, ymmword ptr [rcx+0x8]
vsubps ymm0, ymm0, ymmword ptr [rbx+0x8]	
vmovupd ymm1, ymmword ptr [rcx+0x28]	
vsubps ymm1, ymm1, ymmword ptr [rbx+0x28]		
vmovupd ymm2, ymmword ptr [rcx+0x48]	
vsubps ymm2, ymm2, ymmword ptr [rbx+0x48]	

vmovupd ymmword ptr [rax+0x8], ymm0	
vmovupd ymmword ptr [rax+0x28], ymm1	
vmovupd ymmword ptr [rax+0x48], ymm2	;;; Assigning the Subtract results to the new object

vmovupd ymm0, ymmword ptr [rax+0x8]		
vmulps ymm0, ymm0, ymmword ptr [rax+0x8]	
vmovupd ymm1, ymmword ptr [rax+0x28]	
vmulps ymm1, ymm1, ymmword ptr [rax+0x28]		
vmovupd ymm2, ymmword ptr [rax+0x48]	
mov qword ptr [rsp+0x50], rax		
vmulps ymm2, ymm2, ymmword ptr [rax+0x48]		
vaddps ymm0, ymm0, ymm1	
vaddps ymm0, ymm0, ymm2
vsqrtps ymm7, ymm0

However, the two commented blocks are unnecessary, and the ideal codegen could be

;;; No memory allocation for the intermediate object
mov rcx, qword ptr [rsp+0x58]		
vmovupd ymm0, ymmword ptr [rcx+0x8]
vsubps ymm0, ymm0, ymmword ptr [rbx+0x8]	
vmovupd ymm1, ymmword ptr [rcx+0x28]	
vsubps ymm1, ymm1, ymmword ptr [rbx+0x28]		
vmovupd ymm2, ymmword ptr [rcx+0x48]	
vsubps ymm2, ymm2, ymmword ptr [rbx+0x48]	
vmulps ymm0, ymm0, ymm0		
vmulps ymm1, ymm1, ymm1	
vmulps ymm2, ymm2, ymm2	
vaddps ymm0, ymm0, ymm1	
vaddps ymm0, ymm0, ymm2
vsqrtps ymm7, ymm0

So introducing escape analysis https://github.com/dotnet/coreclr/issues/1784 and unwarping the local VectorPacket objects will significantly reduce the GC overhead of SIMD programs.

Additionally, the current struct promotion also does not work with VectorPacket, so if changing VectorPacket to struct from class, that will generate so much memory copies and get worse performance.

category:cq
theme:vector-codegen
skill-level:expert
cost:large
impact:medium

Metadata

Metadata

Assignees

No one assigned

    Labels

    area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMIenhancementProduct code improvement that does NOT require public API changes/additionsoptimization

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions