Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SelectionDAG] Add a new ISD Node for vector saturating truncation #85903

Closed
sun-jacobi opened this issue Mar 20, 2024 · 24 comments · Fixed by #99418
Closed

[SelectionDAG] Add a new ISD Node for vector saturating truncation #85903

sun-jacobi opened this issue Mar 20, 2024 · 24 comments · Fixed by #99418

Comments

@sun-jacobi
Copy link
Member

sun-jacobi commented Mar 20, 2024

Based on the #73424. There were several patches and issues on the vector saturating truncation operation on different platform.

For instance, #68466 for x86 and #75145 for RISC-V vector extension. I think it is better for us to do a target-independent combine for this operation, which could be expressed as a sequence of LLVM IR like:

define void @trunc_sat_i8i16(ptr %x, ptr %y) {
  %1 = load <8 x i16>, ptr %x, align 16
  %2 = tail call <8 x i16> @llvm.smax.v8i16(<8 x i16> %1, <8 x i16> <i16 -128, i16 -128, i16 -128, i16 -128, i16 -128, i16 -128, i16 -128, i16 -128>)
  %3 = tail call <8 x i16> @llvm.smin.v8i16(<8 x i16> %2, <8 x i16> <i16 127, i16 127, i16 127, i16 127, i16 127, i16 127, i16 127, i16 127>)
  %4 = trunc <8 x i16> %3 to <8 x i8>
  store <8 x i8> %4, ptr %y, align 8
  ret void
}

Especially, I think this combine could be done at the SelectionDAG, like the other saturating operation such as ISD::SADDSAT. This could remove complex TableGen pattern matching and refactor each platform's CodeGen path for this operation.

@llvmbot
Copy link
Collaborator

llvmbot commented Mar 20, 2024

@llvm/issue-subscribers-backend-x86

Author: Chia (sun-jacobi)

Based on the #73424. There were several patches and issues on the vector saturating truncation operation on different platform.

For instance, #68466 for x86, #75145 for RISC-V vector extension. I think it is better for us to do a target-independent combine for this operation, which could expressed as a sequence of LLVM IR like:

define void @<!-- -->trunc_sat_i8i16(ptr %x, ptr %y) {
  %1 = load &lt;8 x i16&gt;, ptr %x, align 16
  %2 = tail call &lt;8 x i16&gt; @<!-- -->llvm.smax.v8i16(&lt;8 x i16&gt; %1, &lt;8 x i16&gt; &lt;i16 -128, i16 -128, i16 -128, i16 -128, i16 -128, i16 -128, i16 -128, i16 -128&gt;)
  %3 = tail call &lt;8 x i16&gt; @<!-- -->llvm.smin.v8i16(&lt;8 x i16&gt; %2, &lt;8 x i16&gt; &lt;i16 127, i16 127, i16 127, i16 127, i16 127, i16 127, i16 127, i16 127&gt;)
  %4 = trunc &lt;8 x i16&gt; %3 to &lt;8 x i8&gt;
  store &lt;8 x i8&gt; %4, ptr %y, align 8
  ret void
}

Especially, I think this combine could be done at the SelectionDAG, like the other saturating operation such as ISD::SADDSAT. This could remove complex TableGen pattern matching for this operation.

@llvmbot
Copy link
Collaborator

llvmbot commented Mar 20, 2024

@llvm/issue-subscribers-backend-risc-v

Author: Chia (sun-jacobi)

Based on the #73424. There were several patches and issues on the vector saturating truncation operation on different platform.

For instance, #68466 for x86, #75145 for RISC-V vector extension. I think it is better for us to do a target-independent combine for this operation, which could expressed as a sequence of LLVM IR like:

define void @<!-- -->trunc_sat_i8i16(ptr %x, ptr %y) {
  %1 = load &lt;8 x i16&gt;, ptr %x, align 16
  %2 = tail call &lt;8 x i16&gt; @<!-- -->llvm.smax.v8i16(&lt;8 x i16&gt; %1, &lt;8 x i16&gt; &lt;i16 -128, i16 -128, i16 -128, i16 -128, i16 -128, i16 -128, i16 -128, i16 -128&gt;)
  %3 = tail call &lt;8 x i16&gt; @<!-- -->llvm.smin.v8i16(&lt;8 x i16&gt; %2, &lt;8 x i16&gt; &lt;i16 127, i16 127, i16 127, i16 127, i16 127, i16 127, i16 127, i16 127&gt;)
  %4 = trunc &lt;8 x i16&gt; %3 to &lt;8 x i8&gt;
  store &lt;8 x i8&gt; %4, ptr %y, align 8
  ret void
}

Especially, I think this combine could be done at the SelectionDAG, like the other saturating operation such as ISD::SADDSAT. This could remove complex TableGen pattern matching for this operation.

@llvmbot
Copy link
Collaborator

llvmbot commented Mar 20, 2024

@llvm/issue-subscribers-backend-aarch64

Author: Chia (sun-jacobi)

Based on the #73424. There were several patches and issues on the vector saturating truncation operation on different platform.

For instance, #68466 for x86, #75145 for RISC-V vector extension. I think it is better for us to do a target-independent combine for this operation, which could expressed as a sequence of LLVM IR like:

define void @<!-- -->trunc_sat_i8i16(ptr %x, ptr %y) {
  %1 = load &lt;8 x i16&gt;, ptr %x, align 16
  %2 = tail call &lt;8 x i16&gt; @<!-- -->llvm.smax.v8i16(&lt;8 x i16&gt; %1, &lt;8 x i16&gt; &lt;i16 -128, i16 -128, i16 -128, i16 -128, i16 -128, i16 -128, i16 -128, i16 -128&gt;)
  %3 = tail call &lt;8 x i16&gt; @<!-- -->llvm.smin.v8i16(&lt;8 x i16&gt; %2, &lt;8 x i16&gt; &lt;i16 127, i16 127, i16 127, i16 127, i16 127, i16 127, i16 127, i16 127&gt;)
  %4 = trunc &lt;8 x i16&gt; %3 to &lt;8 x i8&gt;
  store &lt;8 x i8&gt; %4, ptr %y, align 8
  ret void
}

Especially, I think this combine could be done at the SelectionDAG, like the other saturating operation such as ISD::SADDSAT. This could remove complex TableGen pattern matching for this operation.

@dtcxzyw
Copy link
Member

dtcxzyw commented Mar 20, 2024

cc @topperc @RKSimon @preames @asb

@RKSimon
Copy link
Collaborator

RKSimon commented Mar 20, 2024

I'm sure this has come up before as a phab review but I can't find it right now (@davemgreen can you remember?).

What would be awesome is if there was a way to handle the 2 SSE PACK instruction signed saturations:

/// PACKSS (normal signed sat)
/// (truncate (smin ((smax (x, signed_min_of_dest_type)),
///                  signed_max_of_dest_type)) to dest_type)
/// or:
/// (truncate (smax ((smin (x, signed_max_of_dest_type)),
///                  signed_min_of_dest_type)) to dest_type).
///
/// PACKUS (signed sat of unsigned range) - NOT the same as USAT
/// the smax/smin range is [0, unsigned_max_of_dest_type].

@sun-jacobi
Copy link
Member Author

@RKSimon Thank you for the information. Do you mean we also need to add a new ISD Node for signed sat of unsigned range?

@sun-jacobi sun-jacobi changed the title [SelectionDAG] Add new ISD Node for vector saturating truncation [SelectionDAG] Add a new ISD Node for vector saturating truncation Mar 20, 2024
@RKSimon
Copy link
Collaborator

RKSimon commented Mar 20, 2024

I don't know yet - I think what you were proposing was we'd have a ISD::TRUNCSSAT node that just used the signed smin/smax clamp limits of the destination type (and ISD::TRUNCUSAT that had a umax limit?).

What I'm wondering is how useful it would be to add extra args describing the limits - I just don't want to end up having a target node that has to match the generic nodes, especially if we end up with more complex folds for cases such as sat-trunc-store patterns.

@davemgreen
Copy link
Collaborator

For MVE this is spelled ARMISD::VQMOVNs/ARMISD::VQMOVNu, but as it uses the top/bottom instructions the new ISD might not be a good fit.

It would be better for AArch64/Neon (the SQXTN/UQXTN instructions) - there I think I added tablegen patterns to keep the nodes generic as long as possible, but so long as the new nodes handled known/demanded bits and whatnot then it should be a good replacement.

@ParkHanbum
Copy link
Contributor

Can a beginner take over this issue? if may, I'm very glad to take.

@topperc
Copy link
Collaborator

topperc commented Jul 4, 2024

The RISC-V code was further modified in #93596 #93728 #93756 #93752. There are no longer complex tablegen patterns, but there is a bunch of target specific DAG combine code.

@ParkHanbum
Copy link
Contributor

ParkHanbum commented Jul 4, 2024

@topperc Thanks for letting me know. If there are any architectures that have not been patched yet, I will refer to the link you provided.

@topperc
Copy link
Collaborator

topperc commented Jul 4, 2024

@topperc Thanks for letting us know. If there are any architectures that have not been patched yet, I will refer to the link you provided.

I think the RISC-V specific DAGCombine code could be removed if we had a generic saturation node. We'd still need custom lowering to split something like a vXi32->vXi8 saturating truncate into multiple vnclip instructions since we can only truncate vXi32->vXi16 or vXi16->vXi8 in a single instruction.

@sun-jacobi
Copy link
Member Author

sun-jacobi commented Jul 4, 2024

@topperc Thanks for letting me know. If there are any architectures that have not been patched yet, I will refer to the link you provided.

@ParkHanbum What I proposed is actually a generic saturation node (just as @topperc said). The proposed node is somewhat difficult to work with x86, but AArch64/Neon and RISCV-V would benefit.

If you are interested in this, feel free to try it :)

@RKSimon
Copy link
Collaborator

RKSimon commented Jul 4, 2024

Starting with adding ISD::TRUNCSSAT would make sense - cheers.

@ParkHanbum
Copy link
Contributor

I'm on duty for this. as @RKSimon guided, implemented (is it correctly?) truncate_ssat first.

Legalized selection DAG: %bb.0 'saturate2:'
SelectionDAG has 7 nodes:
  t0: ch,glue = EntryToken
      t2: v8i16,ch = CopyFromReg t0, Register:v8i16 %0
    t13: v8i8 = truncate_ssat t2
  t11: ch,glue = CopyToReg t0, Register:v8i8 $d0, t13
  t12: ch = AArch64ISD::RET_GLUE t11, Register:v8i8 $d0, t11:1

ParkHanbum@f88709c#diff-d7065626b3d269e24241429ce037d51fc91d5ead5896d67fcc038aefc1111fd2

I should start implementing architecture-specific DAGToDAGISel now, right? 😄

@inclyc
Copy link
Member

inclyc commented Jul 8, 2024

Hi @ParkHanbum,

I should start implementing architecture-specific DAGToDAGISel now, right?

I think maybe modifications on llvm/lib/Target/*.td files would be sufficient. You can start looking there.

@ParkHanbum
Copy link
Contributor

Hi @inclyc thanks for advice. I'll find what can I do!

@RKSimon
Copy link
Collaborator

RKSimon commented Jul 8, 2024

@ParkHanbum You might want to use something like 2887f14 as reference about introducing a new ISD node, although in that case this was a conversion of an existing aarch64 opcode

@ParkHanbum
Copy link
Contributor

@RKSimon thanks!! it is very helpful!!

@ParkHanbum
Copy link
Contributor

ParkHanbum commented Jul 9, 2024

can I ask a question?
is we cannot support i64 type for this issue in aarch64 and x86?

because I found it does not supported in aarch64&x86 but supported in riscv.

in this testcase :

define <2 x i32> @vqmovni64_smaxmin(<2 x i64> %s0) {
entry:
  %c1 = icmp slt <2 x i64> %s0, <i64 2147483647, i64 2147483647>
  %s1 = select <2 x i1> %c1, <2 x i64> %s0, <2 x i64> <i64 2147483647, i64 2147483647>
  %c2 = icmp sgt <2 x i64> %s1, <i64 -2147483648, i64 -2147483648>
  %s2 = select <2 x i1> %c2, <2 x i64> %s1, <2 x i64> <i64 -2147483648, i64 -2147483648>
  %t = trunc <2 x i64> %s2 to <2 x i32>
  ret <2 x i32> %t
}

result of aarch64:

vqmovni64_smaxmin:                      // @vqmovni64_smaxmin
        mov     w8, #2147483647                 // =0x7fffffff
        dup     v1.2d, x8
        mov     x8, #-2147483648                // =0xffffffff80000000
        cmgt    v2.2d, v1.2d, v0.2d
        bif     v0.16b, v1.16b, v2.16b
        dup     v1.2d, x8
        cmgt    v2.2d, v0.2d, v1.2d
        bif     v0.16b, v1.16b, v2.16b
        xtn     v0.2s, v0.2d
        ret

result of riscv:

vqmovni64_smaxmin:                      # @vqmovni64_smaxmin
        vsetivli        zero, 2, e32, mf2, ta, ma
        vnclip.wi       v8, v8, 0
        ret

link : https://godbolt.org/z/7W6baE83c

and as I tested result, that trunc at @vqmovni64_smaxmin is saturated.


Optimized legalized selection DAG: %bb.0 'vqmovni64_smaxmin:entry'
SelectionDAG has 7 nodes:
  t0: ch,glue = EntryToken
      t2: v2i64,ch = CopyFromReg t0, Register:v2i64 %0
    t17: v2i32 = truncate_ssat t2
  t15: ch,glue = CopyToReg t0, Register:v2i32 $d0, t17
  t16: ch = AArch64ISD::RET_GLUE t15, Register:v2i32 $d0, t15:1


@RKSimon
Copy link
Collaborator

RKSimon commented Jul 9, 2024

I expect you will have to add legalizer support as well - not just handle known legal instructions.

Search for ISD::TRUNCATE in the llvm\lib\CodeGen\SelectionDAG\Legalize* files - its likely you will just have duplicate some of this code with a suitable smin/smax clamping before truncation.

@ParkHanbum
Copy link
Contributor

Thanks @RKSimon, the commits you shared earlier gave me what I have to do.

The reason I asked is that I wanted to make sure that the cases that currently appear to be unsupported are still unsupportable.

ParkHanbum added a commit to ParkHanbum/llvm-project that referenced this issue Jul 17, 2024
`truncate` is `saturated` if no additional conversion is required
between the target and return values. if the target is `saturated`
when attempting to crop from a `vector`, there is an opportunity
to optimize it.

previously, each architecture had an attemping optimization, so there
was redundant code.

this patch implements common logic by adding `ISD::TRUNCATE_[US]SAT`
to indicate saturated truncate.

Fixes llvm#85903
@ParkHanbum
Copy link
Contributor

I've implemented truncate_[us]at for aarch64, riscv. x86 has a complicated implementation and I think I need to do more work on it, can I send a PR with the implemented one first?

@RKSimon
Copy link
Collaborator

RKSimon commented Jul 17, 2024

Please raise a PR, keep it as draft if you want (but people will comment whatever :))

ParkHanbum added a commit to ParkHanbum/llvm-project that referenced this issue Jul 24, 2024
`truncate` is `saturated` if no additional conversion is required
between the target and return values. if the target is `saturated`
when attempting to crop from a `vector`, there is an opportunity
to optimize it.

previously, each architecture had an attemping optimization, so there
was redundant code.

this patch implements common logic by adding `ISD::TRUNCATE_[US]SAT`
to indicate saturated truncate.

Fixes llvm#85903
ParkHanbum added a commit to ParkHanbum/llvm-project that referenced this issue Jul 25, 2024
`truncate` is `saturated` if no additional conversion is required
between the target and return values. if the target is `saturated`
when attempting to crop from a `vector`, there is an opportunity
to optimize it.

previously, each architecture had an attemping optimization, so there
was redundant code.

this patch implements common logic by adding `ISD::TRUNCATE_[US]SAT`
to indicate saturated truncate.

Fixes llvm#85903
ParkHanbum added a commit to ParkHanbum/llvm-project that referenced this issue Jul 25, 2024
`truncate` is `saturated` if no additional conversion is required
between the target and return values. if the target is `saturated`
when attempting to crop from a `vector`, there is an opportunity
to optimize it.

previously, each architecture had an attemping optimization, so there
was redundant code.

this patch implements common logic by adding `ISD::TRUNCATE_[US]SAT`
to indicate saturated truncate.

Fixes llvm#85903
ParkHanbum added a commit to ParkHanbum/llvm-project that referenced this issue Aug 9, 2024
A truncate is considered saturated if no additional conversion is
required between the target and return values. If the target is
saturated when attempting to truncate from a vector, there is an
opportunity to optimize it.

Previously, each architecture had its own attempt at optimization,
leading to redundant code. This patch implements common logic by
introducing three new ISDs:

 `ISD::TRUNCATE_SSAT_S`: When the operand is a signed value and
 the range of values matches the range of signed values of the
 destination type.

 `ISD::TRUNCATE_SSAT_U`: When the operand is a signed value and
 the range of values matches the range of unsigned values of the
 destination type.

 `ISD::TRUNCATE_USAT_U`: When the operand is an unsigned value and
 the range of values matches the range of unsigned values of the
 destination type.

These ISDs indicate a saturated truncate.

Fixes llvm#85903
bwendling pushed a commit to bwendling/llvm-project that referenced this issue Aug 15, 2024
A truncate is considered saturated if no additional conversion is required between the target and return values. If the target is saturated when attempting to truncate from a vector, there is an opportunity to optimize it.

Previously, each architecture had its own attempt at optimization, leading to redundant code. This patch implements common logic by introducing three new ISDs:

`ISD::TRUNCATE_SSAT_S`: When the operand is a signed value and  the range of values matches the range of signed values of the  destination type.

`ISD::TRUNCATE_SSAT_U`: When the operand is a signed value and the range of values matches the range of unsigned values of the destination type.

`ISD::TRUNCATE_USAT_U`: When the operand is an unsigned value and the range of values matches the range of unsigned values of the destination type.

These ISDs indicate a saturated truncate.

Fixes llvm#85903
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants