Skip to content

[AMDGPU] Fix SIFoldOperandsImpl::canUseImmWithOpSel() for VOP3 packed {B}F16 imms. #142142

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 30, 2025

Conversation

dfukalov
Copy link
Collaborator

VOP3 instructions ignore opsel source modifiers, so a constant that contains two different {B}F16 imms cannot be encoded into instruction with an src opsel.

E.g. without the fix the following instructions

s_mov_b32 s0, 0x40003c00 // <half 1.0, half 2.0>
v_cvt_scalef32_pk_fp8_f16 v0, s0, v2

lose 2.0 imm and are folded into

v_cvt_scalef32_pk_fp8_f16 v1, 1.0, 1.0

Fixes SWDEV-531672

… {B}F16 imms.

VOP 3 instructions ignore opsel source modifiers, so a constant that contains
two different {B}F16 imms cannot be encoded into instruction with an src opsel.

E.g. without the fix the following instructions

  s_mov_b32 s0, 0x40003c00 // <half 1.0, half 2.0>
  v_cvt_scalef32_pk_fp8_f16 v0, s0, v2

loose 2.0 imm and are folded into

  v_cvt_scalef32_pk_fp8_f16 v1, 1.0, 1.0
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses an issue with VOP3 packed instructions where distinct half-precision immediate values were improperly folded, leading to incorrect constant encodings.

  • Added multiple test cases for different conversion instructions to verify correct handling of immediate values.
  • Updated SIFoldOperandsImpl::canUseImmWithOpSel to reject immediates with differing high and low half-words.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
llvm/test/CodeGen/AMDGPU/llvm.amdgcn.cvt.scalef32.pk.gfx950.ll Added test functions to verify immediate encoding in various scenarios.
llvm/lib/Target/AMDGPU/SIFoldOperands.cpp Modified operand folding logic to ensure distinct half values in immediates are not erroneously folded.
Comments suppressed due to low confidence (1)

llvm/lib/Target/AMDGPU/SIFoldOperands.cpp:380

  • Consider adding explicit parentheses around the shift operation for clarity (e.g., static_cast<uint16_t>(Fold.ImmToFold >> 16)) to ensure the intended operator precedence is clear to future maintainers.
static_cast<uint16_t>(Fold.ImmToFold) != static_cast<uint16_t>(Fold.ImmToFold >> 16)

@llvmbot
Copy link
Member

llvmbot commented May 30, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Daniil Fukalov (dfukalov)

Changes

VOP3 instructions ignore opsel source modifiers, so a constant that contains two different {B}F16 imms cannot be encoded into instruction with an src opsel.

E.g. without the fix the following instructions

s_mov_b32 s0, 0x40003c00 // &lt;half 1.0, half 2.0&gt;
v_cvt_scalef32_pk_fp8_f16 v0, s0, v2

lose 2.0 imm and are folded into

v_cvt_scalef32_pk_fp8_f16 v1, 1.0, 1.0

Fixes SWDEV-531672


Full diff: https://github.com/llvm/llvm-project/pull/142142.diff

2 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SIFoldOperands.cpp (+6)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.cvt.scalef32.pk.gfx950.ll (+152)
diff --git a/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp b/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
index cc18d6b4aba10..4f51febe8f92e 100644
--- a/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
+++ b/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
@@ -374,6 +374,12 @@ bool SIFoldOperandsImpl::canUseImmWithOpSel(FoldCandidate &Fold) const {
   case AMDGPU::OPERAND_REG_INLINE_C_V2FP16:
   case AMDGPU::OPERAND_REG_INLINE_C_V2BF16:
   case AMDGPU::OPERAND_REG_INLINE_C_V2INT16:
+    // VOP3 packed instructions ignore op_sel source modifiers, we cannot encode
+    // two different constants.
+    if ((TSFlags & SIInstrFlags::VOP3) && !(TSFlags & SIInstrFlags::VOP3P) &&
+        static_cast<uint16_t>(Fold.ImmToFold) !=
+            static_cast<uint16_t>(Fold.ImmToFold >> 16))
+      return false;
     break;
   }
 
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.cvt.scalef32.pk.gfx950.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.cvt.scalef32.pk.gfx950.ll
index 1ab27337632b6..4b113d80dd0e9 100644
--- a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.cvt.scalef32.pk.gfx950.ll
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.cvt.scalef32.pk.gfx950.ll
@@ -601,6 +601,34 @@ define <2 x i16> @test_cvt_scalef32_pk_fp8_f16_word1(<2 x i16> %old, <2 x half>
   ret <2 x i16> %ret
 }
 
+define <2 x i16> @test_cvt_scalef32_pk_fp8_f16_imm1(<2 x i16> %old, float %scale) {
+; GCN-LABEL: test_cvt_scalef32_pk_fp8_f16_imm1:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GCN-NEXT:    v_cvt_scalef32_pk_fp8_f16 v0, 4.0, v1
+; GCN-NEXT:    s_setpc_b64 s[30:31]
+  %ret = tail call <2 x i16> @llvm.amdgcn.cvt.scalef32.pk.fp8.f16(<2 x i16> %old, <2 x half> <half 4.0, half 4.0>, float %scale, i1 false)
+  ret <2 x i16> %ret
+}
+
+define <2 x i16> @test_cvt_scalef32_pk_fp8_f16_imm2(<2 x i16> %old, float %scale) {
+; GFX950-SDAG-LABEL: test_cvt_scalef32_pk_fp8_f16_imm2:
+; GFX950-SDAG:       ; %bb.0:
+; GFX950-SDAG-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX950-SDAG-NEXT:    s_mov_b32 s0, 0x40004400
+; GFX950-SDAG-NEXT:    v_cvt_scalef32_pk_fp8_f16 v0, s0, v1
+; GFX950-SDAG-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX950-GISEL-LABEL: test_cvt_scalef32_pk_fp8_f16_imm2:
+; GFX950-GISEL:       ; %bb.0:
+; GFX950-GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX950-GISEL-NEXT:    v_mov_b32_e32 v2, 0x40004400
+; GFX950-GISEL-NEXT:    v_cvt_scalef32_pk_fp8_f16 v0, v2, v1
+; GFX950-GISEL-NEXT:    s_setpc_b64 s[30:31]
+  %ret = tail call <2 x i16> @llvm.amdgcn.cvt.scalef32.pk.fp8.f16(<2 x i16> %old, <2 x half> <half 4.0, half 2.0>, float %scale, i1 false)
+  ret <2 x i16> %ret
+}
+
 define <2 x i16> @test_cvt_scalef32_pk_fp8_bf16_word0(<2 x i16> %old, <2 x bfloat> %src, float %scale) {
 ; GCN-LABEL: test_cvt_scalef32_pk_fp8_bf16_word0:
 ; GCN:       ; %bb.0:
@@ -621,6 +649,27 @@ define <2 x i16> @test_cvt_scalef32_pk_fp8_bf16_word1(<2 x i16> %old, <2 x bfloa
   ret <2 x i16> %ret
 }
 
+define <2 x i16> @test_cvt_scalef32_pk_fp8_bf16_imm1(<2 x i16> %old, float %scale) {
+; GCN-LABEL: test_cvt_scalef32_pk_fp8_bf16_imm1:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GCN-NEXT:    v_cvt_scalef32_pk_fp8_bf16 v0, 4.0, v1
+; GCN-NEXT:    s_setpc_b64 s[30:31]
+  %ret = tail call <2 x i16> @llvm.amdgcn.cvt.scalef32.pk.fp8.bf16(<2 x i16> %old, <2 x bfloat> <bfloat 4.0, bfloat 4.0>, float %scale, i1 false)
+  ret <2 x i16> %ret
+}
+
+define <2 x i16> @test_cvt_scalef32_pk_fp8_bf16_imm2(<2 x i16> %old, float %scale) {
+; GCN-LABEL: test_cvt_scalef32_pk_fp8_bf16_imm2:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GCN-NEXT:    s_mov_b32 s0, 0x40004080
+; GCN-NEXT:    v_cvt_scalef32_pk_fp8_bf16 v0, s0, v1
+; GCN-NEXT:    s_setpc_b64 s[30:31]
+  %ret = tail call <2 x i16> @llvm.amdgcn.cvt.scalef32.pk.fp8.bf16(<2 x i16> %old, <2 x bfloat> <bfloat 4.0, bfloat 2.0>, float %scale, i1 false)
+  ret <2 x i16> %ret
+}
+
 define <2 x i16> @test_cvt_scalef32_pk_bf8_f16_word0(<2 x i16> %old, <2 x half> %src, float %scale) {
 ; GCN-LABEL: test_cvt_scalef32_pk_bf8_f16_word0:
 ; GCN:       ; %bb.0:
@@ -641,6 +690,34 @@ define <2 x i16> @test_cvt_scalef32_pk_bf8_f16_word1(<2 x i16> %old, <2 x half>
   ret <2 x i16> %ret
 }
 
+define <2 x i16> @test_cvt_scalef32_pk_bf8_f16_imm1(<2 x i16> %old, float %scale) {
+; GCN-LABEL: test_cvt_scalef32_pk_bf8_f16_imm1:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GCN-NEXT:    v_cvt_scalef32_pk_bf8_f16 v0, 4.0, v1
+; GCN-NEXT:    s_setpc_b64 s[30:31]
+  %ret = tail call <2 x i16> @llvm.amdgcn.cvt.scalef32.pk.bf8.f16(<2 x i16> %old, <2 x half> <half 4.0, half 4.0>, float %scale, i1 false)
+  ret <2 x i16> %ret
+}
+
+define <2 x i16> @test_cvt_scalef32_pk_bf8_f16_imm2(<2 x i16> %old, float %scale) {
+; GFX950-SDAG-LABEL: test_cvt_scalef32_pk_bf8_f16_imm2:
+; GFX950-SDAG:       ; %bb.0:
+; GFX950-SDAG-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX950-SDAG-NEXT:    s_mov_b32 s0, 0x40004400
+; GFX950-SDAG-NEXT:    v_cvt_scalef32_pk_bf8_f16 v0, s0, v1
+; GFX950-SDAG-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX950-GISEL-LABEL: test_cvt_scalef32_pk_bf8_f16_imm2:
+; GFX950-GISEL:       ; %bb.0:
+; GFX950-GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX950-GISEL-NEXT:    v_mov_b32_e32 v2, 0x40004400
+; GFX950-GISEL-NEXT:    v_cvt_scalef32_pk_bf8_f16 v0, v2, v1
+; GFX950-GISEL-NEXT:    s_setpc_b64 s[30:31]
+  %ret = tail call <2 x i16> @llvm.amdgcn.cvt.scalef32.pk.bf8.f16(<2 x i16> %old, <2 x half> <half 4.0, half 2.0>, float %scale, i1 false)
+  ret <2 x i16> %ret
+}
+
 define <2 x i16> @test_cvt_scalef32_pk_bf8_bf16_word0(<2 x i16> %old, <2 x bfloat> %src, float %scale) {
 ; GCN-LABEL: test_cvt_scalef32_pk_bf8_bf16_word0:
 ; GCN:       ; %bb.0:
@@ -661,6 +738,27 @@ define <2 x i16> @test_cvt_scalef32_pk_bf8_bf16_word1(<2 x i16> %old, <2 x bfloa
   ret <2 x i16> %ret
 }
 
+define <2 x i16> @test_cvt_scalef32_pk_bf8_bf16_imm1(<2 x i16> %old, float %scale) {
+; GCN-LABEL: test_cvt_scalef32_pk_bf8_bf16_imm1:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GCN-NEXT:    v_cvt_scalef32_pk_bf8_bf16 v0, 4.0, v1
+; GCN-NEXT:    s_setpc_b64 s[30:31]
+  %ret = tail call <2 x i16> @llvm.amdgcn.cvt.scalef32.pk.bf8.bf16(<2 x i16> %old, <2 x bfloat> <bfloat 4.0, bfloat 4.0>, float %scale, i1 false)
+  ret <2 x i16> %ret
+}
+
+define <2 x i16> @test_cvt_scalef32_pk_bf8_bf16_imm2(<2 x i16> %old, float %scale) {
+; GCN-LABEL: test_cvt_scalef32_pk_bf8_bf16_imm2:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GCN-NEXT:    s_mov_b32 s0, 0x40004080
+; GCN-NEXT:    v_cvt_scalef32_pk_bf8_bf16 v0, s0, v1
+; GCN-NEXT:    s_setpc_b64 s[30:31]
+  %ret = tail call <2 x i16> @llvm.amdgcn.cvt.scalef32.pk.bf8.bf16(<2 x i16> %old, <2 x bfloat> <bfloat 4.0, bfloat 2.0>, float %scale, i1 false)
+  ret <2 x i16> %ret
+}
+
 define <2 x float> @test_cvt_scale_f32_fp4_byte0(i32 %src, float %scale) {
 ; GCN-LABEL: test_cvt_scale_f32_fp4_byte0:
 ; GCN:       ; %bb.0:
@@ -1236,6 +1334,37 @@ define i32 @test_cvt_scalef32_fp4_f16_byte3(<2 x half> %src0, float %scale, i32
   ret i32 %ret
 }
 
+define i32 @test_cvt_scalef32_fp4_f16_imm1(float %scale, i32 %old) {
+; GCN-LABEL: test_cvt_scalef32_fp4_f16_imm1:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GCN-NEXT:    v_cvt_scalef32_pk_fp4_f16 v1, 4.0, v0
+; GCN-NEXT:    v_mov_b32_e32 v0, v1
+; GCN-NEXT:    s_setpc_b64 s[30:31]
+  %ret = tail call i32 @llvm.amdgcn.cvt.scalef32.pk.fp4.f16(i32 %old, <2 x half> <half 4.0, half 4.0>, float %scale, i32 0)
+  ret i32 %ret
+}
+
+define i32 @test_cvt_scalef32_fp4_f16_imm2(float %scale, i32 %old) {
+; GFX950-SDAG-LABEL: test_cvt_scalef32_fp4_f16_imm2:
+; GFX950-SDAG:       ; %bb.0:
+; GFX950-SDAG-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX950-SDAG-NEXT:    s_mov_b32 s0, 0x40004400
+; GFX950-SDAG-NEXT:    v_cvt_scalef32_pk_fp4_f16 v1, s0, v0
+; GFX950-SDAG-NEXT:    v_mov_b32_e32 v0, v1
+; GFX950-SDAG-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX950-GISEL-LABEL: test_cvt_scalef32_fp4_f16_imm2:
+; GFX950-GISEL:       ; %bb.0:
+; GFX950-GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX950-GISEL-NEXT:    v_mov_b32_e32 v2, 0x40004400
+; GFX950-GISEL-NEXT:    v_cvt_scalef32_pk_fp4_f16 v1, v2, v0
+; GFX950-GISEL-NEXT:    v_mov_b32_e32 v0, v1
+; GFX950-GISEL-NEXT:    s_setpc_b64 s[30:31]
+  %ret = tail call i32 @llvm.amdgcn.cvt.scalef32.pk.fp4.f16(i32 %old, <2 x half> <half 4.0, half 2.0>, float %scale, i32 0)
+  ret i32 %ret
+}
+
 define i32 @test_cvt_scalef32_fp4_bf16_byte0(<2 x bfloat> %src0, float %scale, i32 %old) {
 ; GCN-LABEL: test_cvt_scalef32_fp4_bf16_byte0:
 ; GCN:       ; %bb.0:
@@ -1283,6 +1412,29 @@ define i32 @test_cvt_scalef32_fp4_bf16_byte3(<2 x bfloat> %src0, float %scale, i
   ret i32 %ret
 }
 
+define i32 @test_cvt_scalef32_fp4_bf16_imm1(float %scale, i32 %old) {
+; GCN-LABEL: test_cvt_scalef32_fp4_bf16_imm1:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GCN-NEXT:    v_cvt_scalef32_pk_fp4_bf16 v1, 4.0, v0
+; GCN-NEXT:    v_mov_b32_e32 v0, v1
+; GCN-NEXT:    s_setpc_b64 s[30:31]
+  %ret = tail call i32 @llvm.amdgcn.cvt.scalef32.pk.fp4.bf16(i32 %old, <2 x bfloat> <bfloat 4.0, bfloat 4.0>, float %scale, i32 0)
+  ret i32 %ret
+}
+
+define i32 @test_cvt_scalef32_fp4_bf16_imm2(float %scale, i32 %old) {
+; GCN-LABEL: test_cvt_scalef32_fp4_bf16_imm2:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GCN-NEXT:    s_mov_b32 s0, 0x40004080
+; GCN-NEXT:    v_cvt_scalef32_pk_fp4_bf16 v1, s0, v0
+; GCN-NEXT:    v_mov_b32_e32 v0, v1
+; GCN-NEXT:    s_setpc_b64 s[30:31]
+  %ret = tail call i32 @llvm.amdgcn.cvt.scalef32.pk.fp4.bf16(i32 %old, <2 x bfloat> <bfloat 4.0, bfloat 2.0>, float %scale, i32 0)
+  ret i32 %ret
+}
+
 define amdgpu_ps void @test_scalef32_pk32_fp6_f32_vv_inreg_src(<16 x float> inreg %src, float %scale, ptr addrspace(1) %out) {
 ; GFX950-SDAG-LABEL: test_scalef32_pk32_fp6_f32_vv_inreg_src:
 ; GFX950-SDAG:       ; %bb.0:

@dfukalov dfukalov merged commit 5208f72 into llvm:main May 30, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants