Description
According to the VTune characterization dotnet/coreclr#18839 (comment), SoA SIMD programs have higher GC overhead than AoS and scalar programs because of temp object allocation.
SoA SIMD programs use VectorPacket
as the primitive data type (Note, here VectorPacket
is a reference type class
)
class VectorPacket256
{
public Vector256<float> Xs;
public Vector256<float> Ys;
public Vector256<float> Zs;
}
And each VectorPacket
operation is immutable that returns a new VectorPacket
as the result.
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static VectorPacket256 operator -(VectorPacket256 left, VectorPacket256 right)
{
return new VectorPacket256(Subtract(left.Xs, right.Xs), Subtract(left.Ys, right.Ys), Subtract(left.Zs, right.Zs));
}
This semantic makes a lot of temp object allocations, for example, there are two VectorPacket
operations in the code segment below
private ColorPacket256 GetNaturalColor(Vector256<int> things, VectorPacket256 pos, VectorPacket256 norms, VectorPacket256 rds, Scene scene)
{
var colors = ColorPacket256Helper.DefaultColor;
for (int i = 0; i < scene.Lights.Length; i++)
{
var lights = scene.Lights[i];
var zero = SetZeroVector256<float>();
var colorPacket = lights.Colors;
VectorPacket256 ldis = lights.Positions - pos; // VectorPacket256 operation
VectorPacket256 livec = ldis.Normalize(); // VectorPacket256 operation
var neatIsectDis = TestRay(new RayPacket256(pos, livec), scene);
These two lines will be compiled by RyuJIT to
vextractf128 xmm7, ymm6, 0x1
call CORINFO_HELP_NEWSFAST ;;; allocate the object
vinsertf128 ymm6, ymm6, xmm7, 0x1
mov rcx, qword ptr [rsp+0x58]
vmovupd ymm0, ymmword ptr [rcx+0x8]
vsubps ymm0, ymm0, ymmword ptr [rbx+0x8]
vmovupd ymm1, ymmword ptr [rcx+0x28]
vsubps ymm1, ymm1, ymmword ptr [rbx+0x28]
vmovupd ymm2, ymmword ptr [rcx+0x48]
vsubps ymm2, ymm2, ymmword ptr [rbx+0x48]
vmovupd ymmword ptr [rax+0x8], ymm0
vmovupd ymmword ptr [rax+0x28], ymm1
vmovupd ymmword ptr [rax+0x48], ymm2 ;;; Assigning the Subtract results to the new object
vmovupd ymm0, ymmword ptr [rax+0x8]
vmulps ymm0, ymm0, ymmword ptr [rax+0x8]
vmovupd ymm1, ymmword ptr [rax+0x28]
vmulps ymm1, ymm1, ymmword ptr [rax+0x28]
vmovupd ymm2, ymmword ptr [rax+0x48]
mov qword ptr [rsp+0x50], rax
vmulps ymm2, ymm2, ymmword ptr [rax+0x48]
vaddps ymm0, ymm0, ymm1
vaddps ymm0, ymm0, ymm2
vsqrtps ymm7, ymm0
However, the two commented blocks are unnecessary, and the ideal codegen could be
;;; No memory allocation for the intermediate object
mov rcx, qword ptr [rsp+0x58]
vmovupd ymm0, ymmword ptr [rcx+0x8]
vsubps ymm0, ymm0, ymmword ptr [rbx+0x8]
vmovupd ymm1, ymmword ptr [rcx+0x28]
vsubps ymm1, ymm1, ymmword ptr [rbx+0x28]
vmovupd ymm2, ymmword ptr [rcx+0x48]
vsubps ymm2, ymm2, ymmword ptr [rbx+0x48]
vmulps ymm0, ymm0, ymm0
vmulps ymm1, ymm1, ymm1
vmulps ymm2, ymm2, ymm2
vaddps ymm0, ymm0, ymm1
vaddps ymm0, ymm0, ymm2
vsqrtps ymm7, ymm0
So introducing escape analysis https://github.com/dotnet/coreclr/issues/1784 and unwarping the local VectorPacket
objects will significantly reduce the GC overhead of SIMD programs.
Additionally, the current struct promotion also does not work with VectorPacket
, so if changing VectorPacket
to struct
from class
, that will generate so much memory copies and get worse performance.
category:cq
theme:vector-codegen
skill-level:expert
cost:large
impact:medium