Skip to content

Arm64/Sve: Some optimizations around loop scenario #101885

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
May 6, 2024

Conversation

kunalspathak
Copy link
Member

@kunalspathak kunalspathak commented May 5, 2024

  • Enable some of the loop optimization paths around TYP_MASK for Arm64
  • In Morph, eliminate ConvertVectorToMask(ConvertMaskToVector(...)) or ConvertMaskToVector(ConvertVectorToMask(...))
  • Enable support of enregisteration of TYP_MASK
  • Start supporting INS_sve_mov for predicate registers

With these optimizations, the code generated matches the one generated by clang .

public static unsafe int sum_sve(ref int* srcBytes, int length)
{
  Vector<int> total = new Vector<int>(0);
  int* src = srcBytes;
  int elems = (int)Sve.Count32BitElements();
  
  for (int i = 0; i < length + elems; i += elems)
  {
          // Cannot do this yet because we do not optimize the cases where TYP_MASK is stored in lclvar and use it around
	  //Vector<int> pred = Sve.CreateWhileLessThanMask32Bit(i, length);
  
	  Vector<int> vec = Sve.LoadVector((Vector<int>)Sve.CreateWhileLessThanMask32Bit(i, length), src);
  
	  total = Sve.ConditionalSelect((Vector<int>)Sve.CreateWhileLessThanMask32Bit(i, length), Sve.Add(total, vec), total);
  
	  src += elems;
  }
  
  return (int)Sve.AddAcross(total).ToScalar();
}

image

@ghost ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 5, 2024
@kunalspathak
Copy link
Member Author

@dotnet/arm64-contrib

Comment on lines +10682 to +10707
case NI_Sve_ConvertMaskToVector:
{
GenTree* op1 = node->Op(1);

if (!op1->OperIsHWIntrinsic(NI_Sve_ConvertVectorToMask))
{
break;
}

unsigned simdBaseTypeSize = genTypeSize(node->GetSimdBaseType());
GenTreeHWIntrinsic* cvtOp1 = op1->AsHWIntrinsic();

if ((genTypeSize(cvtOp1->GetSimdBaseType()) != simdBaseTypeSize))
{
// We need the operand to be the same kind of mask; otherwise
// the bitwise operation can differ in how it performs
break;
}

GenTree* vectorNode = op1->AsHWIntrinsic()->Op(1);

DEBUG_DESTROY_NODE(op1, node);
INDEBUG(vectorNode->gtDebugFlags |= GTF_DEBUG_NODE_MORPHED);

return vectorNode;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general logic here, except for the intrinsicId, is identical to the xarch path (and likely always will be since its purely an internal helper).

Is it worth sharing most of the code between them?

GenTree* op1 = node->Op(1);
GenTree* op2 = node->Op(2);

if (!op1->OperIsHWIntrinsic(NI_Sve_CreateTrueMaskAll) &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand why this one needs to be handled?

CreateTrueMaskAll produces a TYP_MASK and ConvertVectorToMask requires a TYP_SIMD input, so encountering it as the op1 of ConvertVectorToMask would be representative of malformed IR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me get rid of CreateTrueMaskAll in follow-up PR and then I can combine the logic along with x64.

}

unsigned simdBaseTypeSize = genTypeSize(node->GetSimdBaseType());
GenTreeHWIntrinsic* cvtOp2 = op2->AsHWIntrinsic();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this op2?

I would have expected that ConvertVectorToMask only needs a single input, much as it does on xarch.

Copy link
Member Author

@kunalspathak kunalspathak May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, currently we create ConvertVectorToMask using ConvertVectorToMask (AllTrue, op) and I want to do the work in a separate PR of moving it to lowering. Hence, we are getting op2 instead of op1.

GenTree* Compiler::gtNewSimdConvertVectorToMaskNode(var_types type,
GenTree* node,
CorInfoType simdBaseJitType,
unsigned simdSize)
{
assert(varTypeIsSIMD(node));
// ConvertVectorToMask uses cmpne which requires an embedded mask.
GenTree* trueMask = gtNewSimdAllTrueMaskNode(simdBaseJitType, simdSize);
return gtNewSimdHWIntrinsicNode(TYP_MASK, trueMask, node, NI_Sve_ConvertVectorToMask, simdBaseJitType, simdSize);
}

@@ -10678,6 +10678,63 @@ GenTree* Compiler::fgOptimizeHWIntrinsic(GenTreeHWIntrinsic* node)
INDEBUG(node->gtDebugFlags |= GTF_DEBUG_NODE_MORPHED);
return node;
}
#if defined(TARGET_ARM64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an issue tracking mirroring of the other mask related handling that morph handles for xarch?

In particular, the recognition of And(MaskToVector(x), MaskToVector(y)) and converting it to AndMask(x, y) (as well as similar transforms for other trivial operations where the operation is identical).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kunalspathak
Copy link
Member Author

/ba-g the failure seems to be timeout in osx/x64

@kunalspathak kunalspathak merged commit dc42750 into dotnet:main May 6, 2024
102 of 107 checks passed
@kunalspathak kunalspathak deleted the optimize-loop branch May 6, 2024 22:21
michaelgsharp pushed a commit to michaelgsharp/runtime that referenced this pull request May 9, 2024
* Add the missing else

* Max, MaxAcross, MaxNumber, MaxNumberAcross, Min, MinAcross, MinNumber, MinNumberAcross

* Map APIs to instruction

* Add test cases

* Remove the space

* fix the test case

* Add handling of delay free

* fix some errors

* wip: morph optimization

* Track TYP_MASK for arm64

* Enable mov predicate registers

* jit format
Ruihan-Yin pushed a commit to Ruihan-Yin/runtime that referenced this pull request May 30, 2024
* Add the missing else

* Max, MaxAcross, MaxNumber, MaxNumberAcross, Min, MinAcross, MinNumber, MinNumberAcross

* Map APIs to instruction

* Add test cases

* Remove the space

* fix the test case

* Add handling of delay free

* fix some errors

* wip: morph optimization

* Track TYP_MASK for arm64

* Enable mov predicate registers

* jit format
@github-actions github-actions bot locked and limited conversation to collaborators Jun 6, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI arm-sve Work related to arm64 SVE/SVE2 support
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants