-
Notifications
You must be signed in to change notification settings - Fork 13.6k
[AMDGPU][True16][CodeGen] legalize 16bit and 32bit use-def chain for moveToVALU in si-fix-sgpr-lowering #138734
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
59fc3e5
to
1135e8b
Compare
1135e8b
to
a245abd
Compare
@llvm/pr-subscribers-backend-amdgpu Author: Brox Chen (broxigarchen) ChangesPatch is 54.80 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/138734.diff 3 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index e6d54860df221..0cabf09ec7f21 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -7235,24 +7235,44 @@ bool SIInstrWorklist::isDeferred(MachineInstr *MI) {
return DeferredList.contains(MI);
}
-// 16bit SALU use sgpr32. If a 16bit SALU get lowered to VALU in true16 mode,
-// sgpr32 is replaced to vgpr32 which is illegal in t16 inst. Need to add
-// subreg access properly. This can be removed after we have sgpr16 in place
-void SIInstrInfo::legalizeOperandsVALUt16(MachineInstr &Inst,
+// legalize operand between 16bit and 32bit registers in v2s copy
+// lowering (change spgr to vgpr).
+// This is mainly caused by 16bit SALU and 16bit VALU using reg with different
+// size. Need to legalize the size of the operands during the vgpr lowering
+// chain. This can be removed after we have sgpr16 in place
+void SIInstrInfo::legalizeOperandsVALUt16(MachineInstr &MI,
MachineRegisterInfo &MRI) const {
- unsigned Opcode = Inst.getOpcode();
- if (!AMDGPU::isTrue16Inst(Opcode) || !ST.useRealTrue16Insts())
+ if (!ST.useRealTrue16Insts())
return;
- for (MachineOperand &Op : Inst.explicit_operands()) {
+ unsigned Opcode = MI.getOpcode();
+ MachineBasicBlock *MBB = MI.getParent();
+
+ // legalize operands and check for size mismatch
+ for (MachineOperand &Op : MI.explicit_operands()) {
unsigned OpIdx = Op.getOperandNo();
if (!OpIdx)
continue;
- if (Op.isReg() && RI.isVGPR(MRI, Op.getReg())) {
+ if (Op.isReg() && Op.getReg().isVirtual() && RI.isVGPR(MRI, Op.getReg())) {
unsigned RCID = get(Opcode).operands()[OpIdx].RegClass;
- const TargetRegisterClass *RC = RI.getRegClass(RCID);
- if (RI.getRegSizeInBits(*RC) == 16) {
+ const TargetRegisterClass *ExpectedRC = RI.getRegClass(RCID);
+ const TargetRegisterClass *RC = MRI.getRegClass(Op.getReg());
+ if (32 == RI.getRegSizeInBits(*RC) &&
+ 16 == RI.getRegSizeInBits(*ExpectedRC)) {
Op.setSubReg(AMDGPU::lo16);
+ } else if (16 == RI.getRegSizeInBits(*RC) &&
+ 32 == RI.getRegSizeInBits(*ExpectedRC)) {
+ const DebugLoc &DL = MI.getDebugLoc();
+ Register NewDstReg =
+ MRI.createVirtualRegister(&AMDGPU::VGPR_32RegClass);
+ Register Undef = MRI.createVirtualRegister(&AMDGPU::VGPR_16RegClass);
+ BuildMI(*MBB, MI, DL, get(AMDGPU::IMPLICIT_DEF), Undef);
+ BuildMI(*MBB, MI, DL, get(AMDGPU::REG_SEQUENCE), NewDstReg)
+ .addReg(Op.getReg())
+ .addImm(AMDGPU::lo16)
+ .addReg(Undef)
+ .addImm(AMDGPU::hi16);
+ Op.setReg(NewDstReg);
}
}
}
@@ -7793,8 +7813,19 @@ void SIInstrInfo::moveToVALUImpl(SIInstrWorklist &Worklist,
.add(Inst.getOperand(1))
.add(MachineOperand::CreateImm(AMDGPU::lo16));
Inst.eraseFromParent();
-
MRI.replaceRegWith(DstReg, NewDstReg);
+ // legalize useMI with mismatched size
+ for (MachineRegisterInfo::use_iterator I = MRI.use_begin(NewDstReg),
+ E = MRI.use_end();
+ I != E; ++I) {
+ MachineInstr &UseMI = *I->getParent();
+ unsigned UseMIOpcode = UseMI.getOpcode();
+ if (AMDGPU::isTrue16Inst(UseMIOpcode) &&
+ (16 ==
+ RI.getRegSizeInBits(*getOpRegClass(UseMI, I.getOperandNo())))) {
+ I->setSubReg(AMDGPU::lo16);
+ }
+ }
addUsersToMoveToVALUWorklist(NewDstReg, MRI, Worklist);
return;
}
diff --git a/llvm/test/CodeGen/AMDGPU/fix-sgpr-copies-f16-true16.mir b/llvm/test/CodeGen/AMDGPU/fix-sgpr-copies-f16-true16.mir
index 6e24d9afa2bbc..518a28ebe6539 100644
--- a/llvm/test/CodeGen/AMDGPU/fix-sgpr-copies-f16-true16.mir
+++ b/llvm/test/CodeGen/AMDGPU/fix-sgpr-copies-f16-true16.mir
@@ -54,6 +54,72 @@ body: |
%4:vgpr_16 = V_CVT_F16_U16_t16_e64 0, %3:sreg_32, 0, 0, 0, implicit $mode, implicit $exec
...
+---
+name: salu16_usedby_salu32
+body: |
+ bb.0:
+ ; GCN-LABEL: name: salu16_usedby_salu32
+ ; GCN: [[DEF:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
+ ; GCN-NEXT: [[DEF1:%[0-9]+]]:sreg_32 = IMPLICIT_DEF
+ ; GCN-NEXT: [[V_TRUNC_F16_t16_e64_:%[0-9]+]]:vgpr_16 = V_TRUNC_F16_t16_e64 0, [[DEF]].lo16, 0, 0, 0, implicit $mode, implicit $exec
+ ; GCN-NEXT: [[DEF2:%[0-9]+]]:vgpr_16 = IMPLICIT_DEF
+ ; GCN-NEXT: [[REG_SEQUENCE:%[0-9]+]]:vgpr_32 = REG_SEQUENCE [[V_TRUNC_F16_t16_e64_]], %subreg.lo16, [[DEF2]], %subreg.hi16
+ ; GCN-NEXT: [[V_XOR_B32_e64_:%[0-9]+]]:vgpr_32 = V_XOR_B32_e64 [[REG_SEQUENCE]], [[DEF]], implicit $exec
+ %0:vgpr_32 = IMPLICIT_DEF
+ %1:sreg_32 = COPY %0:vgpr_32
+ %2:sreg_32 = S_TRUNC_F16 %1:sreg_32, implicit $mode
+ %3:sreg_32 = S_XOR_B32 %2:sreg_32, %1:sreg_32, implicit-def $scc
+...
+
+---
+name: salu32_usedby_salu16
+body: |
+ bb.0:
+ ; GCN-LABEL: name: salu32_usedby_salu16
+ ; GCN: [[DEF:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
+ ; GCN-NEXT: [[DEF1:%[0-9]+]]:sreg_32 = IMPLICIT_DEF
+ ; GCN-NEXT: [[V_XOR_B32_e64_:%[0-9]+]]:vgpr_32 = V_XOR_B32_e64 [[DEF]], [[DEF]], implicit $exec
+ ; GCN-NEXT: [[V_TRUNC_F16_t16_e64_:%[0-9]+]]:vgpr_16 = V_TRUNC_F16_t16_e64 0, [[V_XOR_B32_e64_]].lo16, 0, 0, 0, implicit $mode, implicit $exec
+ %0:vgpr_32 = IMPLICIT_DEF
+ %1:sreg_32 = COPY %0:vgpr_32
+ %2:sreg_32 = S_XOR_B32 %1:sreg_32, %1:sreg_32, implicit-def $scc
+ %3:sreg_32 = S_TRUNC_F16 %2:sreg_32, implicit $mode
+...
+
+---
+name: sgpr16_to_spgr32
+body: |
+ bb.0:
+ ; GCN-LABEL: name: sgpr16_to_spgr32
+ ; GCN: [[DEF:%[0-9]+]]:vgpr_16 = IMPLICIT_DEF
+ ; GCN-NEXT: [[DEF1:%[0-9]+]]:sgpr_lo16 = IMPLICIT_DEF
+ ; GCN-NEXT: [[SUBREG_TO_REG:%[0-9]+]]:vgpr_32 = SUBREG_TO_REG 0, [[DEF]], %subreg.lo16
+ ; GCN-NEXT: [[SUBREG_TO_REG1:%[0-9]+]]:vgpr_32 = SUBREG_TO_REG 0, [[DEF]], %subreg.lo16
+ ; GCN-NEXT: [[V_FMAC_F16_t16_e64_:%[0-9]+]]:vgpr_16 = V_FMAC_F16_t16_e64 0, [[SUBREG_TO_REG1]].lo16, 0, [[SUBREG_TO_REG1]].lo16, 0, [[SUBREG_TO_REG]].lo16, 0, 0, 0, implicit $mode, implicit $exec
+ %0:vgpr_16 = IMPLICIT_DEF
+ %1:sgpr_lo16 = COPY %0:vgpr_16
+ %2:sreg_32 = COPY %0:vgpr_16
+ %3:sreg_32 = COPY %1:sgpr_lo16
+ %4:sreg_32 = S_FMAC_F16 %3:sreg_32, %3:sreg_32, %2:sreg_32, implicit $mode
+...
+
+---
+name: sgpr32_to_spgr16
+body: |
+ bb.0:
+ ; GCN-LABEL: name: sgpr32_to_spgr16
+ ; GCN: [[DEF:%[0-9]+]]:vgpr_16 = IMPLICIT_DEF
+ ; GCN-NEXT: [[SUBREG_TO_REG:%[0-9]+]]:vgpr_32 = SUBREG_TO_REG 0, [[DEF]], %subreg.lo16
+ ; GCN-NEXT: [[COPY:%[0-9]+]]:vgpr_16 = COPY [[SUBREG_TO_REG]]
+ ; GCN-NEXT: [[SUBREG_TO_REG1:%[0-9]+]]:vgpr_32 = SUBREG_TO_REG 0, [[COPY]], %subreg.lo16
+ ; GCN-NEXT: [[V_FMAC_F16_t16_e64_:%[0-9]+]]:vgpr_16 = V_FMAC_F16_t16_e64 0, [[SUBREG_TO_REG1]].lo16, 0, [[SUBREG_TO_REG1]].lo16, 0, [[SUBREG_TO_REG]].lo16, 0, 0, 0, implicit $mode, implicit $exec
+ %0:vgpr_16 = IMPLICIT_DEF
+ %1:sreg_32 = COPY %0:vgpr_16
+ %2:sgpr_lo16 = COPY %1:sreg_32
+ %3:sreg_32 = COPY %2:sgpr_lo16
+ %4:sreg_32 = S_FMAC_F16 %3:sreg_32, %3:sreg_32, %1:sreg_32, implicit $mode
+...
+
---
name: vgpr16_to_spgr32
body: |
diff --git a/llvm/test/CodeGen/AMDGPU/frem.ll b/llvm/test/CodeGen/AMDGPU/frem.ll
index 125d009429cbf..872b3afb4fa70 100644
--- a/llvm/test/CodeGen/AMDGPU/frem.ll
+++ b/llvm/test/CodeGen/AMDGPU/frem.ll
@@ -6,7 +6,8 @@
; RUN: llc -amdgpu-scalarize-global-loads=false -enable-misched=0 -mtriple=amdgcn -mcpu=gfx1010 -verify-machineinstrs < %s | FileCheck --check-prefix=GFX10 %s
; RUN: llc -amdgpu-scalarize-global-loads=false -enable-misched=0 -mtriple=amdgcn -mcpu=gfx1100 -mattr=+real-true16 -verify-machineinstrs < %s | FileCheck --check-prefixes=GFX11,GFX11-TRUE16 %s
; RUN: llc -amdgpu-scalarize-global-loads=false -enable-misched=0 -mtriple=amdgcn -mcpu=gfx1100 -mattr=-real-true16 -verify-machineinstrs < %s | FileCheck --check-prefixes=GFX11,GFX11-FAKE16 %s
-; RUN: llc -amdgpu-scalarize-global-loads=false -enable-misched=0 -mtriple=amdgcn -mcpu=gfx1150 -verify-machineinstrs < %s | FileCheck --check-prefix=GFX1150 %s
+; RUN: llc -amdgpu-scalarize-global-loads=false -enable-misched=0 -mtriple=amdgcn -mcpu=gfx1150 -mattr=+real-true16 -verify-machineinstrs < %s | FileCheck --check-prefixes=GFX1150,GFX1150-TRUE16 %s
+; RUN: llc -amdgpu-scalarize-global-loads=false -enable-misched=0 -mtriple=amdgcn -mcpu=gfx1150 -mattr=-real-true16 -verify-machineinstrs < %s | FileCheck --check-prefixes=GFX1150,GFX1150-FAKE16 %s
define amdgpu_kernel void @frem_f16(ptr addrspace(1) %out, ptr addrspace(1) %in1,
; SI-LABEL: frem_f16:
@@ -255,42 +256,81 @@ define amdgpu_kernel void @frem_f16(ptr addrspace(1) %out, ptr addrspace(1) %in1
; GFX11-FAKE16-NEXT: global_store_b16 v0, v1, s[0:1]
; GFX11-FAKE16-NEXT: s_endpgm
;
-; GFX1150-LABEL: frem_f16:
-; GFX1150: ; %bb.0:
-; GFX1150-NEXT: s_clause 0x1
-; GFX1150-NEXT: s_load_b128 s[0:3], s[4:5], 0x24
-; GFX1150-NEXT: s_load_b64 s[4:5], s[4:5], 0x34
-; GFX1150-NEXT: v_mov_b32_e32 v0, 0
-; GFX1150-NEXT: s_waitcnt lgkmcnt(0)
-; GFX1150-NEXT: s_clause 0x1
-; GFX1150-NEXT: global_load_u16 v1, v0, s[2:3]
-; GFX1150-NEXT: global_load_u16 v2, v0, s[4:5] offset:8
-; GFX1150-NEXT: s_waitcnt vmcnt(1)
-; GFX1150-NEXT: v_cvt_f32_f16_e32 v3, v1
-; GFX1150-NEXT: s_waitcnt vmcnt(0)
-; GFX1150-NEXT: v_cvt_f32_f16_e32 v4, v2
-; GFX1150-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(TRANS32_DEP_1)
-; GFX1150-NEXT: v_rcp_f32_e32 v4, v4
-; GFX1150-NEXT: v_mul_f32_e32 v3, v3, v4
-; GFX1150-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1150-NEXT: v_fma_mix_f32 v5, -v2, v3, v1 op_sel_hi:[1,0,1]
-; GFX1150-NEXT: v_fmac_f32_e32 v3, v5, v4
-; GFX1150-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1150-NEXT: v_fma_mix_f32 v5, -v2, v3, v1 op_sel_hi:[1,0,1]
-; GFX1150-NEXT: v_mul_f32_e32 v4, v5, v4
-; GFX1150-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1150-NEXT: v_and_b32_e32 v4, 0xff800000, v4
-; GFX1150-NEXT: v_add_f32_e32 v3, v4, v3
-; GFX1150-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1150-NEXT: v_cvt_f16_f32_e32 v3, v3
-; GFX1150-NEXT: v_div_fixup_f16 v3, v3, v2, v1
-; GFX1150-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1150-NEXT: v_trunc_f16_e32 v3, v3
-; GFX1150-NEXT: v_xor_b32_e32 v3, 0x8000, v3
-; GFX1150-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX1150-NEXT: v_fmac_f16_e32 v1, v3, v2
-; GFX1150-NEXT: global_store_b16 v0, v1, s[0:1]
-; GFX1150-NEXT: s_endpgm
+; GFX1150-TRUE16-LABEL: frem_f16:
+; GFX1150-TRUE16: ; %bb.0:
+; GFX1150-TRUE16-NEXT: s_clause 0x1
+; GFX1150-TRUE16-NEXT: s_load_b128 s[0:3], s[4:5], 0x24
+; GFX1150-TRUE16-NEXT: s_load_b64 s[4:5], s[4:5], 0x34
+; GFX1150-TRUE16-NEXT: v_mov_b32_e32 v2, 0
+; GFX1150-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
+; GFX1150-TRUE16-NEXT: s_clause 0x1
+; GFX1150-TRUE16-NEXT: global_load_d16_b16 v0, v2, s[2:3]
+; GFX1150-TRUE16-NEXT: global_load_d16_b16 v1, v2, s[4:5] offset:8
+; GFX1150-TRUE16-NEXT: s_waitcnt vmcnt(1)
+; GFX1150-TRUE16-NEXT: v_cvt_f32_f16_e32 v3, v0.l
+; GFX1150-TRUE16-NEXT: s_waitcnt vmcnt(0)
+; GFX1150-TRUE16-NEXT: v_cvt_f32_f16_e32 v4, v1.l
+; GFX1150-TRUE16-NEXT: v_mov_b16_e32 v5.l, v1.l
+; GFX1150-TRUE16-NEXT: v_mov_b16_e32 v6.l, v0.l
+; GFX1150-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(TRANS32_DEP_1)
+; GFX1150-TRUE16-NEXT: v_rcp_f32_e32 v4, v4
+; GFX1150-TRUE16-NEXT: v_mul_f32_e32 v3, v3, v4
+; GFX1150-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1150-TRUE16-NEXT: v_fma_mix_f32 v7, -v5, v3, v6 op_sel_hi:[1,0,1]
+; GFX1150-TRUE16-NEXT: v_fmac_f32_e32 v3, v7, v4
+; GFX1150-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1150-TRUE16-NEXT: v_fma_mix_f32 v5, -v5, v3, v6 op_sel_hi:[1,0,1]
+; GFX1150-TRUE16-NEXT: v_mul_f32_e32 v4, v5, v4
+; GFX1150-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1150-TRUE16-NEXT: v_and_b32_e32 v4, 0xff800000, v4
+; GFX1150-TRUE16-NEXT: v_add_f32_e32 v3, v4, v3
+; GFX1150-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1150-TRUE16-NEXT: v_cvt_f16_f32_e32 v0.h, v3
+; GFX1150-TRUE16-NEXT: v_div_fixup_f16 v0.h, v0.h, v1.l, v0.l
+; GFX1150-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1150-TRUE16-NEXT: v_trunc_f16_e32 v3.l, v0.h
+; GFX1150-TRUE16-NEXT: v_xor_b32_e32 v3, 0x8000, v3
+; GFX1150-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX1150-TRUE16-NEXT: v_fmac_f16_e32 v0.l, v3.l, v1.l
+; GFX1150-TRUE16-NEXT: global_store_b16 v2, v0, s[0:1]
+; GFX1150-TRUE16-NEXT: s_endpgm
+;
+; GFX1150-FAKE16-LABEL: frem_f16:
+; GFX1150-FAKE16: ; %bb.0:
+; GFX1150-FAKE16-NEXT: s_clause 0x1
+; GFX1150-FAKE16-NEXT: s_load_b128 s[0:3], s[4:5], 0x24
+; GFX1150-FAKE16-NEXT: s_load_b64 s[4:5], s[4:5], 0x34
+; GFX1150-FAKE16-NEXT: v_mov_b32_e32 v0, 0
+; GFX1150-FAKE16-NEXT: s_waitcnt lgkmcnt(0)
+; GFX1150-FAKE16-NEXT: s_clause 0x1
+; GFX1150-FAKE16-NEXT: global_load_u16 v1, v0, s[2:3]
+; GFX1150-FAKE16-NEXT: global_load_u16 v2, v0, s[4:5] offset:8
+; GFX1150-FAKE16-NEXT: s_waitcnt vmcnt(1)
+; GFX1150-FAKE16-NEXT: v_cvt_f32_f16_e32 v3, v1
+; GFX1150-FAKE16-NEXT: s_waitcnt vmcnt(0)
+; GFX1150-FAKE16-NEXT: v_cvt_f32_f16_e32 v4, v2
+; GFX1150-FAKE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(TRANS32_DEP_1)
+; GFX1150-FAKE16-NEXT: v_rcp_f32_e32 v4, v4
+; GFX1150-FAKE16-NEXT: v_mul_f32_e32 v3, v3, v4
+; GFX1150-FAKE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1150-FAKE16-NEXT: v_fma_mix_f32 v5, -v2, v3, v1 op_sel_hi:[1,0,1]
+; GFX1150-FAKE16-NEXT: v_fmac_f32_e32 v3, v5, v4
+; GFX1150-FAKE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1150-FAKE16-NEXT: v_fma_mix_f32 v5, -v2, v3, v1 op_sel_hi:[1,0,1]
+; GFX1150-FAKE16-NEXT: v_mul_f32_e32 v4, v5, v4
+; GFX1150-FAKE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1150-FAKE16-NEXT: v_and_b32_e32 v4, 0xff800000, v4
+; GFX1150-FAKE16-NEXT: v_add_f32_e32 v3, v4, v3
+; GFX1150-FAKE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1150-FAKE16-NEXT: v_cvt_f16_f32_e32 v3, v3
+; GFX1150-FAKE16-NEXT: v_div_fixup_f16 v3, v3, v2, v1
+; GFX1150-FAKE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1150-FAKE16-NEXT: v_trunc_f16_e32 v3, v3
+; GFX1150-FAKE16-NEXT: v_xor_b32_e32 v3, 0x8000, v3
+; GFX1150-FAKE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX1150-FAKE16-NEXT: v_fmac_f16_e32 v1, v3, v2
+; GFX1150-FAKE16-NEXT: global_store_b16 v0, v1, s[0:1]
+; GFX1150-FAKE16-NEXT: s_endpgm
ptr addrspace(1) %in2) #0 {
%gep2 = getelementptr half, ptr addrspace(1) %in2, i32 4
%r0 = load half, ptr addrspace(1) %in1, align 4
@@ -456,26 +496,47 @@ define amdgpu_kernel void @fast_frem_f16(ptr addrspace(1) %out, ptr addrspace(1)
; GFX11-FAKE16-NEXT: global_store_b16 v0, v1, s[0:1]
; GFX11-FAKE16-NEXT: s_endpgm
;
-; GFX1150-LABEL: fast_frem_f16:
-; GFX1150: ; %bb.0:
-; GFX1150-NEXT: s_clause 0x1
-; GFX1150-NEXT: s_load_b128 s[0:3], s[4:5], 0x24
-; GFX1150-NEXT: s_load_b64 s[4:5], s[4:5], 0x34
-; GFX1150-NEXT: v_mov_b32_e32 v0, 0
-; GFX1150-NEXT: s_waitcnt lgkmcnt(0)
-; GFX1150-NEXT: s_clause 0x1
-; GFX1150-NEXT: global_load_u16 v1, v0, s[2:3]
-; GFX1150-NEXT: global_load_u16 v2, v0, s[4:5] offset:8
-; GFX1150-NEXT: s_waitcnt vmcnt(0)
-; GFX1150-NEXT: v_rcp_f16_e32 v3, v2
-; GFX1150-NEXT: s_delay_alu instid0(TRANS32_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1150-NEXT: v_mul_f16_e32 v3, v1, v3
-; GFX1150-NEXT: v_trunc_f16_e32 v3, v3
-; GFX1150-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1150-NEXT: v_xor_b32_e32 v3, 0x8000, v3
-; GFX1150-NEXT: v_fmac_f16_e32 v1, v3, v2
-; GFX1150-NEXT: global_store_b16 v0, v1, s[0:1]
-; GFX1150-NEXT: s_endpgm
+; GFX1150-TRUE16-LABEL: fast_frem_f16:
+; GFX1150-TRUE16: ; %bb.0:
+; GFX1150-TRUE16-NEXT: s_clause 0x1
+; GFX1150-TRUE16-NEXT: s_load_b128 s[0:3], s[4:5], 0x24
+; GFX1150-TRUE16-NEXT: s_load_b64 s[4:5], s[4:5], 0x34
+; GFX1150-TRUE16-NEXT: v_mov_b32_e32 v2, 0
+; GFX1150-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
+; GFX1150-TRUE16-NEXT: s_clause 0x1
+; GFX1150-TRUE16-NEXT: global_load_d16_b16 v0, v2, s[2:3]
+; GFX1150-TRUE16-NEXT: global_load_d16_hi_b16 v0, v2, s[4:5] offset:8
+; GFX1150-TRUE16-NEXT: s_waitcnt vmcnt(0)
+; GFX1150-TRUE16-NEXT: v_rcp_f16_e32 v1.l, v0.h
+; GFX1150-TRUE16-NEXT: s_delay_alu instid0(TRANS32_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1150-TRUE16-NEXT: v_mul_f16_e32 v1.l, v0.l, v1.l
+; GFX1150-TRUE16-NEXT: v_trunc_f16_e32 v1.l, v1.l
+; GFX1150-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1150-TRUE16-NEXT: v_xor_b32_e32 v1, 0x8000, v1
+; GFX1150-TRUE16-NEXT: v_fmac_f16_e32 v0.l, v1.l, v0.h
+; GFX1150-TRUE16-NEXT: global_store_b16 v2, v0, s[0:1]
+; GFX1150-TRUE16-NEXT: s_endpgm
+;
+; GFX1150-FAKE16-LABEL: fast_frem_f16:
+; GFX1150-FAKE16: ; %bb.0:
+; GFX1150-FAKE16-NEXT: s_clause 0x1
+; GFX1150-FAKE16-NEXT: s_load_b128 s[0:3], s[4:5], 0x24
+; GFX1150-FAKE16-NEXT: s_load_b64 s[4:5], s[4:5], 0x34
+; GFX1150-FAKE16-NEXT: v_mov_b32_e32 v0, 0
+; GFX1150-FAKE16-NEXT: s_waitcnt lgkmcnt(0)
+; GFX1150-FAKE16-NEXT: s_clause 0x1
+; GFX1150-FAKE16-NEXT: global_load_u16 v1, v0, s[2:3]
+; GFX1150-FAKE16-NEXT: global_load_u16 v2, v0, s[4:5] offset:8
+; GFX1150-FAKE16-NEXT: s_waitcnt vmcnt(0)
+; GFX1150-FAKE16-NEXT: v_rcp_f16_e32 v3, v2
+; GFX1150-FAKE16-NEXT: s_delay_alu instid0(TRANS32_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1150-FAKE16-NEXT: v_mul_f16_e32 v3, v1, v3
+; GFX1150-FAKE16-NEXT: v_trunc_f16_e32 v3, v3
+; GFX1150-FAKE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1150-FAKE16-NEXT: v_xor_b32_e32 v3, 0x8000, v3
+; GFX1150-FAKE16-NEXT: v_fmac_f16_e32 v1, v3, v2
+; GFX1150-FAKE16-NEXT: global_store_b16 v0, v1, s[0:1]
+; GFX1150-FAKE16-NEXT: s_endpgm
ptr addrspace(1) %in2) #0 {
%gep2 = getelementptr half, ptr addrspace(1) %in2, i32 4
%r0 = load half, ptr addrspace(1) %in1, align 4
@@ -641,26 +702,47 @@ define amdgpu_kernel void @unsafe_frem_f16(ptr addrspace(1) %out, ptr addrspace(
; GFX11-FAKE16-NEXT: global_store_b16 v0, v1, s[0:1]
; GFX11-FAKE16-NEXT: s_endpgm
;
-; GFX1150-LABEL: unsafe_frem_f16:
-; GFX1150: ; %bb.0:
-; GFX1150-NEXT: s_clause 0x1
-; GFX1150-NEXT: s_load_b128 s[0:3], s[4:5], 0x24
-; GFX1150-NEXT: s_load_b64 s[4:5], s[4:5], 0x34
-; GFX1150-NEXT: v_mov_b32_e32 v0, 0
-; GFX1150-NEXT: s_waitcnt lgkmcnt(0)
-; GFX1150-NEXT: s_clause 0x1
-; GFX1150-NEXT: global_load_u16 v1, v0, s[2:3]
-; GFX1150-NEXT: global_load_u16 v2, v0, s[4:5] offset:8
-; GFX1150-NEXT: s_waitcnt vmcnt(0)
-; GFX1150-NEXT: v_rcp_f16_e32 v3, v2
-; GFX1150-NEXT: s_delay_alu instid0(TRANS32_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1150-NEXT: v_mul_f16_e32 v3, v1, v3
-; GFX1150-NEXT: v_trunc_f16_e32 v3, v3
-; GFX1150-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1150-NEXT: v_xor_b32_e32 v3, 0x8000, v3
-; GFX1150-NEXT: v_fmac_f16_e32 v1, v3, v2
-; GFX1150-NEXT: global_store_b16 v0, v1, s[0:1]
-; GFX1150-NEXT: s_endpgm
+; GFX1150-TRUE16-LABEL: unsafe_frem_f16:
+; GFX1150-TRUE16: ; %bb.0:
+; GFX1150-TRUE16-NEXT: s_clause 0x1
+; GFX1150-TRUE16-NEXT: s_load_b128 s[0:3], s[4:5], 0x24
+; GFX1150-TRUE16-NEXT: s_load_...
[truncated]
|
a245abd
to
8b0b8c3
Compare
8b0b8c3
to
ca8c783
Compare
CI error is not related |
ca8c783
to
ba87f1d
Compare
rebased |
// legalize useMI with mismatched size | ||
for (MachineRegisterInfo::use_iterator I = MRI.use_begin(NewDstReg), | ||
E = MRI.use_end(); | ||
I != E; ++I) { | ||
MachineInstr &UseMI = *I->getParent(); | ||
unsigned UseMIOpcode = UseMI.getOpcode(); | ||
if (AMDGPU::isTrue16Inst(UseMIOpcode) && | ||
(16 == | ||
RI.getRegSizeInBits(*getOpRegClass(UseMI, I.getOperandNo())))) { | ||
I->setSubReg(AMDGPU::lo16); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this, shouldn't really need to have an extra use list scan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two things:
- We mainly have problem with the
replaceRegWith()
call. If we don't have size mismatch issue, we can simply replace the reg to Equivalent VGPR class. But when we have size mismatch issue, this is a problem because we might replace a 16bit reg into a 32bit reg and vice versa. and thus we need to transverse user list and fix size accordingly. - We replace
COPY
like inst first and then process useMI in sequence. However for multi-operand t16 inst, we might have useMI being lowered before all its operands being processed. i.e.
(1) %0:vgpr_16 = IMPLICIT_DEF
(2) %1:sgpr_lo16 = COPY %0:vgpr_16
(3) %2:sreg_32 = COPY %0:vgpr_16
(4) %3:sreg_32 = COPY %1:sgpr_lo16
(5) %4:sreg_32 = S_FMAC_F16 %3:sreg_32, %3:sreg_32, %2:sreg_32, implicit $mode
The order of lowering goes from (3)->(2)->(5)->(4). And thus, this hit a problem as lowering (4) putting a VGPR32 to a t16 insts. Thus we need to check useMI again when we lower (4).
There are multiple ways to fix this. What I am currently doing is to add use list check when we process mismatch size inst or t16 insts in moveToVALU process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of trying to hack up the uses, just insert whatever copies are necessary to make the types match. You shouldn't need to do any use analysis
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Matt. I am not quite understand how the copies help here. These mismatch are mainly caused by salu16 used by salu32, or the reverse. In sgpr they are fine, but when moved to valu it become a problem. Thus I think a use analysis is necessary here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We did the same use list scan where we process general inst. In this place we needed an additional one is because this is a early return routine which process mismatched size copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to fix this with iteration order? If (5) is indeed processed by moveToVALUImpl before all of it's operands, we need to do the use list scan when we replace any register that could be used in a true16 instruction. But if we could guarantee the arguments would be processed before the useMI, then I think the call to legalizeOperandsVALUt16 would fix up these cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing the iteration order requires to rewrite the fix-sgpr-copy sequences and it might takes a while to address all the side-effects.
I think the user analysis is necesssary, and I found that there is a cleaner solution as we can just move the use analysis into the addUserToMoveVALU to avoid all the redundant scan and code
ba87f1d
to
e5650fd
Compare
e5650fd
to
e5ebb74
Compare
// legalize useMI with mismatched size | ||
for (MachineRegisterInfo::use_iterator I = MRI.use_begin(NewDstReg), | ||
E = MRI.use_end(); | ||
I != E; ++I) { | ||
MachineInstr &UseMI = *I->getParent(); | ||
unsigned UseMIOpcode = UseMI.getOpcode(); | ||
if (AMDGPU::isTrue16Inst(UseMIOpcode) && | ||
(16 == | ||
RI.getRegSizeInBits(*getOpRegClass(UseMI, I.getOperandNo())))) { | ||
I->setSubReg(AMDGPU::lo16); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to fix this with iteration order? If (5) is indeed processed by moveToVALUImpl before all of it's operands, we need to do the use list scan when we replace any register that could be used in a true16 instruction. But if we could guarantee the arguments would be processed before the useMI, then I think the call to legalizeOperandsVALUt16 would fix up these cases.
e5ebb74
to
a921e8b
Compare
✅ With the latest revision this PR passed the C/C++ code formatter. |
a921e8b
to
63ad0f3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, it looks a lot better! Besides my small new comments LGTM.
63ad0f3
to
30cbdd6
Compare
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/206/builds/1304 Here is the relevant piece of the build log for the reference
|
Two changes in this patch:
Turn on frem test with true16 mode for gfx1150 which is failing before this patch. A few bitcast tests also impacted by this change with some v_mov being replaced to dual mov