Skip to content

[LoongArch] Fix fp_to_uint/fp_to_sint conversion errors for lasx #137129

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 7, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 14 additions & 14 deletions llvm/lib/Target/LoongArch/LoongArchLASXInstrInfo.td
Original file line number Diff line number Diff line change
Expand Up @@ -1815,24 +1815,24 @@ def : Pat<(v4f32 (uint_to_fp v4i64:$vj)),
// XVFTINTRZ_{W_S/L_D}
def : Pat<(v8i32 (fp_to_sint v8f32:$vj)), (XVFTINTRZ_W_S v8f32:$vj)>;
def : Pat<(v4i64 (fp_to_sint v4f64:$vj)), (XVFTINTRZ_L_D v4f64:$vj)>;
def : Pat<(v4i64 (fp_to_sint v4f32:$vj)),
(VEXT2XV_D_W (SUBREG_TO_REG (i64 0), (VFTINTRZ_W_S v4f32:$vj),
sub_128))>;
def : Pat<(v4i32 (fp_to_sint (v4f64 LASX256:$vj))),
(EXTRACT_SUBREG (XVFTINTRZ_W_S (XVFCVT_S_D (XVPERMI_D v4f64:$vj, 238),
v4f64:$vj)),
sub_128)>;
def : Pat<(v4i64(fp_to_sint v4f32:$vj)), (VEXT2XV_D_W(SUBREG_TO_REG(i64 0),
(VFTINTRZ_W_S v4f32:$vj),
sub_128))>;
def : Pat<(v4i32(fp_to_sint v4f64:$vj)),
(EXTRACT_SUBREG(XVPICKEV_W(XVPERMI_D(XVFTINTRZ_L_D v4f64:$vj), 238),
(XVFTINTRZ_L_D v4f64:$vj)),
sub_128)>;

// XVFTINTRZ_{W_SU/L_DU}
def : Pat<(v8i32 (fp_to_uint v8f32:$vj)), (XVFTINTRZ_WU_S v8f32:$vj)>;
def : Pat<(v4i64 (fp_to_uint v4f64:$vj)), (XVFTINTRZ_LU_D v4f64:$vj)>;
def : Pat<(v4i64 (fp_to_uint v4f32:$vj)),
(VEXT2XV_DU_WU (SUBREG_TO_REG (i64 0), (VFTINTRZ_WU_S v4f32:$vj),
sub_128))>;
def : Pat<(v4i32 (fp_to_uint (v4f64 LASX256:$vj))),
(EXTRACT_SUBREG (XVFTINTRZ_W_S (XVFCVT_S_D (XVPERMI_D v4f64:$vj, 238),
v4f64:$vj)),
sub_128)>;
def : Pat<(v4i64(fp_to_uint v4f32:$vj)), (VEXT2XV_DU_WU(SUBREG_TO_REG(i64 0),
(VFTINTRZ_WU_S v4f32:$vj),
sub_128))>;
def : Pat<(v4i32(fp_to_uint v4f64:$vj)),
(EXTRACT_SUBREG(XVPICKEV_W(XVPERMI_D(XVFTINTRZ_LU_D v4f64:$vj), 238),
(XVFTINTRZ_LU_D v4f64:$vj)),
sub_128)>;

// XVPERMI_Q
foreach vt = [v32i8, v16i16, v8i32, v4i64, v8f32, v4f64] in
Expand Down
4 changes: 2 additions & 2 deletions llvm/test/CodeGen/LoongArch/lasx/ir-instruction/fptosi.ll
Original file line number Diff line number Diff line change
Expand Up @@ -31,9 +31,9 @@ define void @fptosi_v4f64_v4i32(ptr %res, ptr %in){
; CHECK-LABEL: fptosi_v4f64_v4i32:
; CHECK: # %bb.0:
; CHECK-NEXT: xvld $xr0, $a1, 0
; CHECK-NEXT: xvftintrz.l.d $xr0, $xr0
; CHECK-NEXT: xvpermi.d $xr1, $xr0, 238
; CHECK-NEXT: xvfcvt.s.d $xr0, $xr1, $xr0
; CHECK-NEXT: xvftintrz.w.s $xr0, $xr0
; CHECK-NEXT: xvpickev.w $xr0, $xr1, $xr0
; CHECK-NEXT: vst $vr0, $a0, 0
Copy link
Member

@heiher heiher Apr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous impl appears to convert f64 to f32 before performing the integer conversion, which introduces a loss of precision at the f64 -> f32 step.

d: float64
s: float32
w: int32

*a1: d0, d1, d2, d3
xr0: d3, d2, d1, d0 // xvld $xr0, $a1, 0
xr1: d3, d2, d3, d2 // xvpermi.d $xr1, $xr0, 238
xr0: s3, s2, s3, s2, s3, s2, s1, s0 // xvfcvt.s.d $xr0, $xr1, $xr0
                                       ^
                                       +-- Loss of precision at this step
xr0: w3, w2, w3, w2, w3, w2, w1, w0 // xvftintrz.w.s $xr0, $xr0
vr0: w3, w2, w1, w0 // vst $vr0, $a0, 0
*a0: w0, w1, w2, w3

In the updated version, it seems the f64 values are first converted to i64 and then truncated to u32. This can produce incorrect results when the original f64 values exceed the range representable by (signed) i32.

Would it make sense to go with a direct f64 -> i32 conversion instead?

*a1: d0, d1, d2, d3
xr0: d3, d2, d1, d0 // xvld $xr0, $a1, 0
xr1: d3, d2, d3, d2 // xvpermi.d $xr1, $xr0, 238
xr0: w3, w2, w3, w2, w3, w2, w1, w0 // xvftintrz.w.d $xr0, $xr1, $xr0
vr0: w3, w2, w1, w0 // vst $vr0, $a0, 0
*a0: w0, w1, w2, w3

Copy link
Contributor Author

@tangaac tangaac Apr 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3A6000 latency throughput
xvftintrz.w.d 5 2
xvftintrz.l.d 4 4
xvpickev.w 1 4

xvftintrz.w.d or xvftintrz.l.d + xvpickev.w
which one is faster?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to correct my earlier comment - it was inaccurate and has been deleted. :P

According to the LLVM IR fptosi spec:

If the value cannot fit in ty2, the result is a poison value.

So unless overflow/invalid recording is required, using xvftintrz.l.d + xvpickev.w is also valid, and this sequence may offer better throughput.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ref: https://reviews.llvm.org/D128900#3630052 LGTM.

Just a note: On the LoongArch target, -ftrapping-math and -fno-trapping-math currently produce identical LLVM IR. This implies that floating-point exception behavior is not yet supported, which is a separate issue worth addressing.

; CHECK-NEXT: ret
%v0 = load <4 x double>, ptr %in
Expand Down
4 changes: 2 additions & 2 deletions llvm/test/CodeGen/LoongArch/lasx/ir-instruction/fptoui.ll
Original file line number Diff line number Diff line change
Expand Up @@ -31,9 +31,9 @@ define void @fptoui_v4f64_v4i32(ptr %res, ptr %in){
; CHECK-LABEL: fptoui_v4f64_v4i32:
; CHECK: # %bb.0:
; CHECK-NEXT: xvld $xr0, $a1, 0
; CHECK-NEXT: xvftintrz.lu.d $xr0, $xr0
; CHECK-NEXT: xvpermi.d $xr1, $xr0, 238
; CHECK-NEXT: xvfcvt.s.d $xr0, $xr1, $xr0
; CHECK-NEXT: xvftintrz.w.s $xr0, $xr0
; CHECK-NEXT: xvpickev.w $xr0, $xr1, $xr0
; CHECK-NEXT: vst $vr0, $a0, 0
Copy link
Member

@heiher heiher Apr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

d: float64
l: int64

*a1: d0, d1, d2, d3
xr0: d3, d2, d1, d0 // xvld $xr0, $a1, 0
xr0: l3, l2, l1, l0 // xvftintrz.lu.d $xr0, $xr0
xr0: {l3h,l3l}, {l2h,l2l}, {l1h,l1l}, {l0h,l0l}
xr1: l3, l2, l3, l2 // xvpermi.d $xr1, $xr0, 238
xr1: {l3h,l3l}, {l2h,l2l}, {l3h,l3l}, {l2h,l2l}
xr0: l3l, l2l, l3l, l2l, l3l, l2l, l1l, l0l // xvpickev.w $xr0, $xr1, $xr0
vr0: l3l, l2l, l1l, l0l // vst $vr0, $a0, 0
*a0: l0l, l1l, l2l, l3l

; CHECK-NEXT: ret
%v0 = load <4 x double>, ptr %in
Expand Down
Loading