Skip to content

Conversation

@snickolls-arm
Copy link
Contributor

@snickolls-arm snickolls-arm commented Oct 27, 2025

This branch introduces a new type (TYP_SIMDSV) to the JIT for supporting scalable vectors, registers whose size is determined by hardware at runtime but remains constant for the duration of a process. For ARM64, this means we have vectors sized in powers of 2 from 128 bits up to 2048 bits depending on hardware implementation, with an instruction available to query this size for compiler use. I've also adjusted the implementation of TYP_MASK to scale with the vector length on ARM64 in a similar manner.

This builds on and borrows much of Kunal's work in: #115948.

This PR focuses on enabling scalable type awareness as a foundation for future vector length agnostic code generation. The new type is allowing the JIT to emit SVE register moves, loads and stores for Vector<T> etc. but doesn't change the implementation of the Vector<T> API surface. It still emits NEON for arithmetic operations, logical operations, floating point operations and so on.

The compiler resolves the sizes for TYP_SIMDSV and TYP_MASK early in initialization and updates the type system data with this information. For now, the vector length is still fixed at 128 bits in order to preserve the functionality, as there is a mixed state of NEON and SVE instructions. Once full SVE codegen is implemented for these types, it will be possible to let their sizes scale with the underlying hardware vector length.

With this type being passed around, we can begin implementing vector length agnostic code in subsequent work, as we can now test for a TYP_SIMDSV and distinguish it from a fixed size TYP_SIMD16.

Testing

Importing Vector<T> as the new type is gated behind DOTNET_JitUseScalableVectorT, meaning TYP_SIMDSV will not appear in compilation unless that variable is set. Likewise, the VM will not report use of the HFA type CORINFO_HFA_ELEM_VECTORT unless this variable is set. With the variable set, testing is validating the implementation of the new type. Without the variable set, testing is validating that TYP_SIMD16 behavior remains stable under these changes.

SuperPMI method contexts are currently out of sync with the updated JIT-EE interface, which causes mismatches in return values between the JIT and EE. I don't expect this to stop until DOTNET_JitUseScalableVectorT is removed and standardized, so this feature will need to be tested with some specially generated MCH files while being developed.

Future

We will be able to remove DOTNET_JitUseScalableVectorT once Vector<T> is working in AOT compilation. This will require all phases to be aware that TYP_SIMDSV is dynamically sized and make the choice if the pass can run or not depending on this.

For a transitional approach, we could allow specifying a fixed target vector length for AOT compilation while broader vector-length agnostic support is implemented. In some cases we might be able to take advantage of knowing the vector size at compilation time, so having both approaches available might be advantageous for JIT mode.

Code Example

static void SimdAdd(ReadOnlySpan<float> a, ReadOnlySpan<float> b, Span<float> c)
{
    int len = Math.Min(Math.Min(a.Length, b.Length), c.Length);
    int i = 0;

    if (Vector.IsHardwareAccelerated)
    {
        int width = Vector<float>.Count;
        for (; i <= len - width; i += width)
        {
            var va = new Vector<float>(a.Slice(i, width));
            var vb = new Vector<float>(b.Slice(i, width));
            (va + vb).CopyTo(c.Slice(i, width));
        }
    }

   // Scalar tail (or full path if no HW acceleration)
   for (; i < len; i++)
       c[i] = a[i] + b[i];
}

With DOTNET_JitUseScalableVectorT=1, the main vector loop body compiles to:

G_M23455_IG03:        ; offs=0x000020, size=0x0044, bbWeight=4, PerfScore 100.00, gcrefRegs=0000 {}, byrefRegs=0015 {x0 x2 x4}, BB03 [0002], BB20 [0018], BB32 [0033], BB44 [0048], byref, isz

IN000b: 000020      mov     w8, w7
IN000c: 000024      add     x9, x8, #4
IN000d: 000028      cmp     x9, w1, UXTW
IN000e: 00002C      bhi     G_M23455_IG10
IN000f: 000030      lsl     x8, x8, #2
IN0010: 000034      add     x10, x0, x8
IN0011: 000038      ldr     z24, [x10]                                 ;; was ldr q24, [x10]
IN0012: 00003C      cmp     x9, w3, UXTW
IN0013: 000040      bhi     G_M23455_IG10
IN0014: 000044      add     x10, x2, x8
IN0015: 000048      ldr     z25, [x10]                                 ;; was ldr q25, [x10]
IN0016: 00004C      fadd    v24.4s, v24.4s, v25.4s
IN0017: 000050      cmp     x9, w5, UXTW
IN0018: 000054      bhi     G_M23455_IG10
IN0019: 000058      add     x8, x4, x8
IN001a: 00005C      str     z24, [x8]                                  ;; was str q24, [x8]
IN001b: 000060      add     w7, w7, #4

G_M23455_IG04:        ; offs=0x000064, size=0x000C, bbWeight=8, PerfScore 16.00, gcrefRegs=0000 {}, byrefRegs=0015 {x0 x2 x4}, loop=IG03, BB04 [0003], byref, isz

IN001c: 000064      sub     w8, w6, #4
IN001d: 000068      cmp     w8, w7
IN001e: 00006C      bge     G_M23455_IG03

Contributing towards #120599

Current implementation derives a size based on the identified SIMD type,
and then uses the size to derive the node type. It should instead
directly derive the node type from the identified SIMD type, because
some SIMD types will not have a statically known size, and this size may
conflict with other SIMD types.
The function that determines the size of a local variable needs
to have access to compiler state at runtime to handle variable types
with sizes that depend on some runtime value, for example Vector<T>
when backed by ARM64 scalable vectors.
The function that determines the size of an indirection needs to
have access to compiler state at runtime to handle variable types
with sizes that depend on some runtime value, for example Vector<T>
when backed by ARM64 scalable vectors.
Create a new type designed to support Vector<T> with size evaluated at
runtime. Adds a new HFA type to the VM to support passing Vector<T> as
a scalable vector register on ARM64. Both types are experimental and
locked behind the DOTNET_JitUseScalableVectorT configuration option.

This first patch implements SVE codegen for Vector<T>, mainly for managing
Vector<T> as a data structure that can be placed in a Z register.  When
DOTNET_JitUseScalableVectorT=1, the JIT will move the type around using
SVE instructions operating on Z registers. It does not yet unlock longer
vector lengths or implement operations from the Vector<T> API surface using
SVE. This API still generates NEON code, which is functionally equivalent
so long as Vector<T>.Count is limited to 128-bits.

When DOTNET_JitUseScalableVectorT=0 the code generated for Vector<T> should
have zero functional difference but may have some cosmetic differences as some
refactoring has been done on general SIMD codegen to support the new type.
@github-actions github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Oct 27, 2025
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Oct 27, 2025
@snickolls-arm
Copy link
Contributor Author

@dotnet/arm64-contrib @a74nh @tannergooding

This is the outcome of the investigation I've been doing into supporting scalable vector types in the JIT. I've worked through a few problems and test failures but I still consider this early experimentation, feedback or suggestions on direction are appreciated.

@jkotas jkotas added the arm-sve Work related to arm64 SVE/SVE2 support label Oct 27, 2025
The length of VectorT is now a constant that is set on VM initialization,
and the DAC needs to be able to replicate this state without access to the
original SVE VL value.
return result;
}

unsigned getSizeOfType(var_types type);
Copy link
Member

@jakobbotsch jakobbotsch Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a lot of churn without good reason. In the eventual future, it won't be possible to get the size of TYP_SIMDSV as a constant anyway, and we will want to assert that TYP_SIMDSV is never passed to genTypeSize.

If the goal is to have a transitional period where TYP_SIMDSV has constant (but unknown) size then can we instead make genTypeSize get the size from a non-constant table and set the size of TYP_SIMDSV in that table from jitStartup (probably Compiler::compStartup)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the eventual future, it won't be possible to get the size of TYP_SIMDSV as a constant anyway

This is not certain yet in my mind, do we want to be able to query and make use of the vector length when we can (JIT mode)? We could use this information to save emitting rdvl, and directly embed the constant in codegen instead.

can we instead make genTypeSize get the size from a non-constant table and set the size of TYP_SIMDSV in that table from jitStartup (probably Compiler::compStartup)?

Yes I think this would reduce a lot of churn, but only if we answer 'no' to the above decision.

Otherwise, the vector length would need to be obtained from the EE via ICorJitInfo, because the compiler can't execute rdvl to obtain the length in all compilation modes. I didn't think this could be done from a static context because the info handle belongs to the Compiler instance, but correct me if there's a good way of doing this (I could use JitTls::GetCompiler?).

Copy link
Member

@jakobbotsch jakobbotsch Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not certain yet in my mind, do we want to be able to query and make use of the vector length when we can (JIT mode)? We could use this information to save emitting rdvl, and directly embed the constant in codegen instead.

I think we should prefer to translate to the TYP_SIMD16, TYP_SIMD32 etc. types if we want to do something special for these cases (what Kunal was doing). Otherwise this introduces yet another combination of things to test for...

Otherwise, the vector length would need to be obtained from the EE via ICorJitInfo, because the compiler can't execute rdvl to obtain the length in all compilation modes. I didn't think this could be done from a static context because the info handle belongs to the Compiler instance, but correct me if there's a good way of doing this (I could use JitTls::GetCompiler?).

I think for the transitional period it would be fine to get it via ICorJitHost::getIntConfigValue (i.e. through JitConfig) or by introducing a new method on ICorJitHost. ICorJitHost is passed to jitStartup.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for the transitional period it would be fine to get it via ICorJitHost::getIntConfigValue (i.e. through JitConfig) or by introducing a new method on ICorJitHost. ICorJitHost is passed to jitStartup.

Or ask the EE for Vector<T> size?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds reasonable to me too. I guess we already have Compiler::getVectorTByteLength, we would just put its value into genTypeSizes table as soon as possible (and validate on subsequent JITs that the value is consistent)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should prefer to translate to the TYP_SIMD16, TYP_SIMD32 etc. types if we want to do something special for these cases (what Kunal was doing). Otherwise this introduces yet another combination of things to test for...

👍. I would prefer we keep things generally the same between NAOT and JIT scenarios.

A whole lot of stuff gets simpler if we only have to consider that TYP_SIMDSV means "unknown size" and TYP_SIMD16 means 16-bytes. It also reduces risk of bugs or other breaks due to mishandling of TYP_SIMDSV.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NAOT and JIT scenarios.

Nit: I expect we will use the unknown size for R2R too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you all for the comments.

I'll split out the refactoring related to getSizeOfType and revert it, and implement the genTypeSizes table changes instead.

I can leave implementing the actual query to the EE for the length until later, but I'll refer back here when doing so.

DEF_TP(SIMD32 ,"simd32" , TYP_SIMD32, 32,32, 32, 8,16, VTR_FLOAT, availableDoubleRegs, RBM_FLT_CALLEE_SAVED, RBM_FLT_CALLEE_TRASH, VTF_S|VTF_VEC)
DEF_TP(SIMD64 ,"simd64" , TYP_SIMD64, 64,64, 64, 16,16, VTR_FLOAT, availableDoubleRegs, RBM_FLT_CALLEE_SAVED, RBM_FLT_CALLEE_TRASH, VTF_S|VTF_VEC)
#elif defined(TARGET_ARM64)
DEF_TP(SIMDSV ,"simdsv" , TYP_SIMDSV, 0,EAS,EAS, 0,16, VTR_FLOAT, availableDoubleRegs, RBM_FLT_CALLEE_SAVED, RBM_FLT_CALLEE_TRASH, VTF_S|VTF_VEC)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd have a preference towards just TYP_SIMD as the name.

That then fits with the mapping of Vector<T> not having a suffix but Vector64/128/256/512 having one and also doesn't tie the concept down to Arm64 specific terminology -- other CPU architectures have similar concepts of vectors or matrices whose size aren't statically known, but are rather hardware dependent. So I imagine this functionality will be repurposed and shared over time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, should be easily changed at any point with search and replace. I've had in mind that this type may be repurposed as 'the best possible vector implementation for the architecture'.

#define BRS EA_BYREF
#define EPS EA_PTRSIZE
#ifdef TARGET_ARM64
#define EAS EA_SCALABLE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, I wonder if SCALABLE is the right term here.

It would seem like the existing EA_UNKNOWN might be sufficient and would trigger relevant asserts in places that aren't correctly equipped to handle it. Alternatively, a different term might be a better fit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I used this change to reveal the areas in the Emitter that needed to support SVE instructions as well as NEON instructions for the type. I agree EA_UNKNOWN would be better for making this apply more widely.

I'll need to find a way to remap EA_UNKNOWN back to EA_SCALABLE for ARM64 as EA_SCALABLE is the main marker we're using to choose SVE instructions. There should be plenty of information available to do this.


DEFINE_DACVAR(DWORD, dac__g_gcNotificationFlags, g_gcNotificationFlags)
DEFINE_DACVAR(DWORD, dac__g_vectorTByteLength, ::g_vectorTByteLength)
DEFINE_DACVAR(BOOL, dac__g_vectorTIsScalable, ::g_vectorTIsScalable)
Copy link
Member

@jkotas jkotas Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it not enough for the DAC to know about Vector<T> size? (It knows that already.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite an unfamiliar area of the code for me, am I understanding correctly that the VM is going to serialize the Vector<T> MethodTable for the DAC rather than rely on the DAC to reconstruct it from type information? Therefore we don't need this extra variable.

The boolean flag is interesting because it encodes the value of the environment variable. Would the DAC have access to this as well? The setting is only relevant when sizeof(TYP_SIMDSV) == sizeof(TYP_SIMD16), we need to manually choose whether to emit NEON or SVE because the sizes are the same. But maybe there's no need for it unless the debugger wants to generate code itself.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The boolean flag is interesting because it encodes the value of the environment variable. Would the DAC have access to this as well?

Why does the DAC need to know about that?

The debugger/DAC related changes should start with explanation of the debugger feature. Also, it would be best to submit any debugger related changes separately from JIT changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the DAC need to know about that?

I thought I needed to make some of the member variables to EEJitManager that I added compatible with the DAC to fix some compilation errors, but I'll look again. I'm not intending to introduce a feature new here.

Is CoreLibBinder::GetClass(CLASS__VECTORT) the correct way to obtain the MethodTable for Vector<T> in this situation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix some compilation errors

You should be able to fix the compilation errors by ifdefing out the code using DACCESS_COMPILE.

Is CoreLibBinder::GetClass(CLASS__VECTORT) the correct way to obtain the MethodTable for Vector

Yes, but I do not think you need that for this PR.

Copy link
Contributor Author

@snickolls-arm snickolls-arm Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up having to get the size from the MethodTable, because GetSizeOfVectorT was not DAC compatible and I used it in calling convention code, which seems to be required in the DAC. So now I set the size once when constructing the MethodTable and use this as the source of truth everywhere else.

CORINFO_HFA_ELEM_DOUBLE,
CORINFO_HFA_ELEM_VECTOR64,
CORINFO_HFA_ELEM_VECTOR128,
CORINFO_HFA_ELEM_VECTORT,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've noted down to revisit this on the issue, as technically a Vector<T> would be a 'Pure Scalable Type' not a HFA/HVA. It looks very similar in principal but there may be some subtle differences.

genTypeSize is now sufficient for this purpose
#endif

if (numInstanceFieldBytes != 16)
if (vectorTSize > 0 && vectorTSize != 16)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (vectorTSize > 0 && vectorTSize != 16)
if (vectorTSize > 0)

Nit: vectorTSize != 16 check should not be needed.

@jkotas
Copy link
Member

jkotas commented Nov 7, 2025

@snickolls-arm I would recommend submitting more smaller PRs. Smaller PRs are easier review and make things faster to move forward. IMHO, this PR can be split into at least 3 PRs:

  • Add TYP_SIMDSV in the JIT
  • Add JitUseScalableVectorT config and respect it in the type loader (methodtablebuilder.cpp)
  • Recognize HFAs (PSTs) composed from scalable vectors

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI arm-sve Work related to arm64 SVE/SVE2 support community-contribution Indicates that the PR has been added by a community member

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants