-
Notifications
You must be signed in to change notification settings - Fork 14.7k
[AArch64][ISel] Select constructive EXT_ZZZI pseudo instruction #152554
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: users/gbossu.vector.extract.movprfx.2
Are you sure you want to change the base?
[AArch64][ISel] Select constructive EXT_ZZZI pseudo instruction #152554
Conversation
The patch changes existing patterns to select the EXT_ZZZI pseudo instead of the EXT_ZZI destructive instruction for vector_splice. Given that registers aren't tied anymore, this gives the register allocator more freedom and a lot of MOVs get replaced with MOVPRFX. In some cases however, we could have just chosen the same input and output register, but regalloc preferred not to. This means we end up with some test cases now having more instructions: there is now a MOVPRFX while no MOV was previously needed.
@llvm/pr-subscribers-backend-aarch64 Author: Gaëtan Bossu (gbossu) ChangesThe patch changes existing patterns to select the EXT_ZZZI pseudo Given that registers aren't tied anymore, this gives the register In some cases however, we could have just chosen the same input and Patch is 154.60 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/152554.diff 21 Files Affected:
diff --git a/llvm/lib/Target/AArch64/AArch64PostCoalescerPass.cpp b/llvm/lib/Target/AArch64/AArch64PostCoalescerPass.cpp
index cdf2822f3ed9d..b7d69b68af4ee 100644
--- a/llvm/lib/Target/AArch64/AArch64PostCoalescerPass.cpp
+++ b/llvm/lib/Target/AArch64/AArch64PostCoalescerPass.cpp
@@ -53,9 +53,6 @@ bool AArch64PostCoalescer::runOnMachineFunction(MachineFunction &MF) {
if (skipFunction(MF.getFunction()))
return false;
- AArch64FunctionInfo *FuncInfo = MF.getInfo<AArch64FunctionInfo>();
- if (!FuncInfo->hasStreamingModeChanges())
- return false;
MRI = &MF.getRegInfo();
LIS = &getAnalysis<LiveIntervalsWrapperPass>().getLIS();
@@ -86,6 +83,13 @@ bool AArch64PostCoalescer::runOnMachineFunction(MachineFunction &MF) {
Changed = true;
break;
}
+ case AArch64::EXT_ZZZI:
+ Register DstReg = MI.getOperand(0).getReg();
+ Register SrcReg1 = MI.getOperand(1).getReg();
+ if (SrcReg1 != DstReg) {
+ MRI->setRegAllocationHint(DstReg, 0, SrcReg1);
+ }
+ break;
}
}
}
diff --git a/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td b/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
index 85e647af6684c..a3ca0cb73cd43 100644
--- a/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
+++ b/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
@@ -2135,19 +2135,19 @@ let Predicates = [HasSVE_or_SME] in {
// Splice with lane bigger or equal to 0
foreach VT = [nxv16i8] in
def : Pat<(VT (vector_splice VT:$Z1, VT:$Z2, (i64 (sve_ext_imm_0_255 i32:$index)))),
- (EXT_ZZI ZPR:$Z1, ZPR:$Z2, imm0_255:$index)>;
+ (EXT_ZZZI ZPR:$Z1, ZPR:$Z2, imm0_255:$index)>;
foreach VT = [nxv8i16, nxv8f16, nxv8bf16] in
def : Pat<(VT (vector_splice VT:$Z1, VT:$Z2, (i64 (sve_ext_imm_0_127 i32:$index)))),
- (EXT_ZZI ZPR:$Z1, ZPR:$Z2, imm0_255:$index)>;
+ (EXT_ZZZI ZPR:$Z1, ZPR:$Z2, imm0_255:$index)>;
foreach VT = [nxv4i32, nxv4f16, nxv4f32, nxv4bf16] in
def : Pat<(VT (vector_splice VT:$Z1, VT:$Z2, (i64 (sve_ext_imm_0_63 i32:$index)))),
- (EXT_ZZI ZPR:$Z1, ZPR:$Z2, imm0_255:$index)>;
+ (EXT_ZZZI ZPR:$Z1, ZPR:$Z2, imm0_255:$index)>;
foreach VT = [nxv2i64, nxv2f16, nxv2f32, nxv2f64, nxv2bf16] in
def : Pat<(VT (vector_splice VT:$Z1, VT:$Z2, (i64 (sve_ext_imm_0_31 i32:$index)))),
- (EXT_ZZI ZPR:$Z1, ZPR:$Z2, imm0_255:$index)>;
+ (EXT_ZZZI ZPR:$Z1, ZPR:$Z2, imm0_255:$index)>;
defm CMPHS_PPzZZ : sve_int_cmp_0<0b000, "cmphs", SETUGE, SETULE>;
defm CMPHI_PPzZZ : sve_int_cmp_0<0b001, "cmphi", SETUGT, SETULT>;
diff --git a/llvm/test/CodeGen/AArch64/sve-fixed-length-extract-subvector.ll b/llvm/test/CodeGen/AArch64/sve-fixed-length-extract-subvector.ll
index 800f95d97af4c..7b438743487e1 100644
--- a/llvm/test/CodeGen/AArch64/sve-fixed-length-extract-subvector.ll
+++ b/llvm/test/CodeGen/AArch64/sve-fixed-length-extract-subvector.ll
@@ -50,7 +50,7 @@ define void @extract_v32i8_halves(ptr %in, ptr %out, ptr %out2) #0 vscale_range(
; CHECK-LABEL: extract_v32i8_halves:
; CHECK: // %bb.0: // %entry
; CHECK-NEXT: ldr z0, [x0]
-; CHECK-NEXT: mov z1.d, z0.d
+; CHECK-NEXT: movprfx z1, z0
; CHECK-NEXT: ext z1.b, z1.b, z0.b, #16
; CHECK-NEXT: str q1, [x1]
; CHECK-NEXT: str q0, [x2]
@@ -68,7 +68,7 @@ define void @extract_v32i8_half_unaligned(ptr %in, ptr %out) #0 vscale_range(2,2
; CHECK-LABEL: extract_v32i8_half_unaligned:
; CHECK: // %bb.0: // %entry
; CHECK-NEXT: ldr z0, [x0]
-; CHECK-NEXT: mov z1.d, z0.d
+; CHECK-NEXT: movprfx z1, z0
; CHECK-NEXT: ext z1.b, z1.b, z0.b, #16
; CHECK-NEXT: ext v0.16b, v0.16b, v1.16b, #4
; CHECK-NEXT: str q0, [x1]
@@ -84,15 +84,16 @@ define void @extract_v32i8_quarters(ptr %in, ptr %out, ptr %out2, ptr %out3, ptr
; CHECK-LABEL: extract_v32i8_quarters:
; CHECK: // %bb.0: // %entry
; CHECK-NEXT: ldr z0, [x0]
-; CHECK-NEXT: mov z1.d, z0.d
-; CHECK-NEXT: mov z2.d, z0.d
+; CHECK-NEXT: movprfx z1, z0
; CHECK-NEXT: ext z1.b, z1.b, z0.b, #16
+; CHECK-NEXT: movprfx z2, z0
; CHECK-NEXT: ext z2.b, z2.b, z0.b, #24
+; CHECK-NEXT: movprfx z3, z0
+; CHECK-NEXT: ext z3.b, z3.b, z0.b, #8
; CHECK-NEXT: str d1, [x1]
; CHECK-NEXT: str d2, [x2]
; CHECK-NEXT: str d0, [x3]
-; CHECK-NEXT: ext z0.b, z0.b, z0.b, #8
-; CHECK-NEXT: str d0, [x4]
+; CHECK-NEXT: str d3, [x4]
; CHECK-NEXT: ret
entry:
%b = load <32 x i8>, ptr %in
@@ -126,7 +127,7 @@ define void @extract_v64i8_halves(ptr %in, ptr %out, ptr %out2) #0 vscale_range(
; CHECK: // %bb.0: // %entry
; CHECK-NEXT: ldr z0, [x0]
; CHECK-NEXT: ptrue p0.b, vl32
-; CHECK-NEXT: mov z1.d, z0.d
+; CHECK-NEXT: movprfx z1, z0
; CHECK-NEXT: ext z1.b, z1.b, z0.b, #32
; CHECK-NEXT: st1b { z1.b }, p0, [x1]
; CHECK-NEXT: st1b { z0.b }, p0, [x2]
@@ -207,7 +208,7 @@ define void @extract_v16i16_halves(ptr %in, ptr %out, ptr %out2) #0 vscale_range
; CHECK-LABEL: extract_v16i16_halves:
; CHECK: // %bb.0: // %entry
; CHECK-NEXT: ldr z0, [x0]
-; CHECK-NEXT: mov z1.d, z0.d
+; CHECK-NEXT: movprfx z1, z0
; CHECK-NEXT: ext z1.b, z1.b, z0.b, #16
; CHECK-NEXT: str q1, [x1]
; CHECK-NEXT: str q0, [x2]
@@ -240,7 +241,7 @@ define void @extract_v32i16_halves(ptr %in, ptr %out, ptr %out2) #0 vscale_range
; CHECK: // %bb.0: // %entry
; CHECK-NEXT: ldr z0, [x0]
; CHECK-NEXT: ptrue p0.h, vl16
-; CHECK-NEXT: mov z1.d, z0.d
+; CHECK-NEXT: movprfx z1, z0
; CHECK-NEXT: ext z1.b, z1.b, z0.b, #32
; CHECK-NEXT: st1h { z1.h }, p0, [x1]
; CHECK-NEXT: st1h { z0.h }, p0, [x2]
@@ -322,7 +323,7 @@ define void @extract_v8i32_halves(ptr %in, ptr %out, ptr %out2) #0 vscale_range(
; CHECK-LABEL: extract_v8i32_halves:
; CHECK: // %bb.0: // %entry
; CHECK-NEXT: ldr z0, [x0]
-; CHECK-NEXT: mov z1.d, z0.d
+; CHECK-NEXT: movprfx z1, z0
; CHECK-NEXT: ext z1.b, z1.b, z0.b, #16
; CHECK-NEXT: str q1, [x1]
; CHECK-NEXT: str q0, [x2]
@@ -355,7 +356,7 @@ define void @extract_v16i32_halves(ptr %in, ptr %out, ptr %out2) #0 vscale_range
; CHECK: // %bb.0: // %entry
; CHECK-NEXT: ldr z0, [x0]
; CHECK-NEXT: ptrue p0.s, vl8
-; CHECK-NEXT: mov z1.d, z0.d
+; CHECK-NEXT: movprfx z1, z0
; CHECK-NEXT: ext z1.b, z1.b, z0.b, #32
; CHECK-NEXT: st1w { z1.s }, p0, [x1]
; CHECK-NEXT: st1w { z0.s }, p0, [x2]
@@ -426,7 +427,7 @@ define void @extract_v4i64_halves(ptr %in, ptr %out, ptr %out2) #0 vscale_range(
; CHECK-LABEL: extract_v4i64_halves:
; CHECK: // %bb.0: // %entry
; CHECK-NEXT: ldr z0, [x0]
-; CHECK-NEXT: mov z1.d, z0.d
+; CHECK-NEXT: movprfx z1, z0
; CHECK-NEXT: ext z1.b, z1.b, z0.b, #16
; CHECK-NEXT: str q1, [x1]
; CHECK-NEXT: str q0, [x2]
@@ -459,7 +460,7 @@ define void @extract_v8i64_halves(ptr %in, ptr %out, ptr %out2) #0 vscale_range(
; CHECK: // %bb.0: // %entry
; CHECK-NEXT: ldr z0, [x0]
; CHECK-NEXT: ptrue p0.d, vl4
-; CHECK-NEXT: mov z1.d, z0.d
+; CHECK-NEXT: movprfx z1, z0
; CHECK-NEXT: ext z1.b, z1.b, z0.b, #32
; CHECK-NEXT: st1d { z1.d }, p0, [x1]
; CHECK-NEXT: st1d { z0.d }, p0, [x2]
@@ -553,7 +554,7 @@ define void @extract_v16half_halves(ptr %in, ptr %out, ptr %out2) #0 vscale_rang
; CHECK-LABEL: extract_v16half_halves:
; CHECK: // %bb.0: // %entry
; CHECK-NEXT: ldr z0, [x0]
-; CHECK-NEXT: mov z1.d, z0.d
+; CHECK-NEXT: movprfx z1, z0
; CHECK-NEXT: ext z1.b, z1.b, z0.b, #16
; CHECK-NEXT: str q1, [x1]
; CHECK-NEXT: str q0, [x2]
@@ -586,7 +587,7 @@ define void @extract_v32half_halves(ptr %in, ptr %out, ptr %out2) #0 vscale_rang
; CHECK: // %bb.0: // %entry
; CHECK-NEXT: ldr z0, [x0]
; CHECK-NEXT: ptrue p0.h, vl16
-; CHECK-NEXT: mov z1.d, z0.d
+; CHECK-NEXT: movprfx z1, z0
; CHECK-NEXT: ext z1.b, z1.b, z0.b, #32
; CHECK-NEXT: st1h { z1.h }, p0, [x1]
; CHECK-NEXT: st1h { z0.h }, p0, [x2]
@@ -668,7 +669,7 @@ define void @extract_v8float_halves(ptr %in, ptr %out, ptr %out2) #0 vscale_rang
; CHECK-LABEL: extract_v8float_halves:
; CHECK: // %bb.0: // %entry
; CHECK-NEXT: ldr z0, [x0]
-; CHECK-NEXT: mov z1.d, z0.d
+; CHECK-NEXT: movprfx z1, z0
; CHECK-NEXT: ext z1.b, z1.b, z0.b, #16
; CHECK-NEXT: str q1, [x1]
; CHECK-NEXT: str q0, [x2]
@@ -701,7 +702,7 @@ define void @extract_v16float_halves(ptr %in, ptr %out, ptr %out2) #0 vscale_ran
; CHECK: // %bb.0: // %entry
; CHECK-NEXT: ldr z0, [x0]
; CHECK-NEXT: ptrue p0.s, vl8
-; CHECK-NEXT: mov z1.d, z0.d
+; CHECK-NEXT: movprfx z1, z0
; CHECK-NEXT: ext z1.b, z1.b, z0.b, #32
; CHECK-NEXT: st1w { z1.s }, p0, [x1]
; CHECK-NEXT: st1w { z0.s }, p0, [x2]
@@ -772,7 +773,7 @@ define void @extract_v4double_halves(ptr %in, ptr %out, ptr %out2) #0 vscale_ran
; CHECK-LABEL: extract_v4double_halves:
; CHECK: // %bb.0: // %entry
; CHECK-NEXT: ldr z0, [x0]
-; CHECK-NEXT: mov z1.d, z0.d
+; CHECK-NEXT: movprfx z1, z0
; CHECK-NEXT: ext z1.b, z1.b, z0.b, #16
; CHECK-NEXT: str q1, [x1]
; CHECK-NEXT: str q0, [x2]
@@ -805,7 +806,7 @@ define void @extract_v8double_halves(ptr %in, ptr %out, ptr %out2) #0 vscale_ran
; CHECK: // %bb.0: // %entry
; CHECK-NEXT: ldr z0, [x0]
; CHECK-NEXT: ptrue p0.d, vl4
-; CHECK-NEXT: mov z1.d, z0.d
+; CHECK-NEXT: movprfx z1, z0
; CHECK-NEXT: ext z1.b, z1.b, z0.b, #32
; CHECK-NEXT: st1d { z1.d }, p0, [x1]
; CHECK-NEXT: st1d { z0.d }, p0, [x2]
@@ -908,7 +909,7 @@ define void @extract_subvector_legalization_v8i32() vscale_range(2,2) #0 {
; CHECK-NEXT: add x8, x8, :lo12:.LCPI59_0
; CHECK-NEXT: ptrue p1.d
; CHECK-NEXT: ldr z0, [x8]
-; CHECK-NEXT: mov z1.d, z0.d
+; CHECK-NEXT: movprfx z1, z0
; CHECK-NEXT: ext z1.b, z1.b, z0.b, #16
; CHECK-NEXT: cmeq v0.4s, v0.4s, #0
; CHECK-NEXT: cmeq v1.4s, v1.4s, #0
diff --git a/llvm/test/CodeGen/AArch64/sve-fixed-length-fp-to-int.ll b/llvm/test/CodeGen/AArch64/sve-fixed-length-fp-to-int.ll
index af54b146c5b66..c8f6d98f5a63f 100644
--- a/llvm/test/CodeGen/AArch64/sve-fixed-length-fp-to-int.ll
+++ b/llvm/test/CodeGen/AArch64/sve-fixed-length-fp-to-int.ll
@@ -150,13 +150,14 @@ define void @fcvtzu_v16f16_v16i32(ptr %a, ptr %b) #0 {
; VBITS_GE_256-NEXT: mov x8, #8 // =0x8
; VBITS_GE_256-NEXT: ld1h { z0.h }, p0/z, [x0]
; VBITS_GE_256-NEXT: ptrue p0.s, vl8
-; VBITS_GE_256-NEXT: uunpklo z1.s, z0.h
-; VBITS_GE_256-NEXT: ext z0.b, z0.b, z0.b, #16
+; VBITS_GE_256-NEXT: movprfx z1, z0
+; VBITS_GE_256-NEXT: ext z1.b, z1.b, z0.b, #16
; VBITS_GE_256-NEXT: uunpklo z0.s, z0.h
-; VBITS_GE_256-NEXT: fcvtzu z1.s, p0/m, z1.h
+; VBITS_GE_256-NEXT: uunpklo z1.s, z1.h
; VBITS_GE_256-NEXT: fcvtzu z0.s, p0/m, z0.h
-; VBITS_GE_256-NEXT: st1w { z1.s }, p0, [x1]
-; VBITS_GE_256-NEXT: st1w { z0.s }, p0, [x1, x8, lsl #2]
+; VBITS_GE_256-NEXT: fcvtzu z1.s, p0/m, z1.h
+; VBITS_GE_256-NEXT: st1w { z0.s }, p0, [x1]
+; VBITS_GE_256-NEXT: st1w { z1.s }, p0, [x1, x8, lsl #2]
; VBITS_GE_256-NEXT: ret
;
; VBITS_GE_512-LABEL: fcvtzu_v16f16_v16i32:
@@ -551,13 +552,14 @@ define void @fcvtzu_v8f32_v8i64(ptr %a, ptr %b) #0 {
; VBITS_GE_256-NEXT: mov x8, #4 // =0x4
; VBITS_GE_256-NEXT: ld1w { z0.s }, p0/z, [x0]
; VBITS_GE_256-NEXT: ptrue p0.d, vl4
-; VBITS_GE_256-NEXT: uunpklo z1.d, z0.s
-; VBITS_GE_256-NEXT: ext z0.b, z0.b, z0.b, #16
+; VBITS_GE_256-NEXT: movprfx z1, z0
+; VBITS_GE_256-NEXT: ext z1.b, z1.b, z0.b, #16
; VBITS_GE_256-NEXT: uunpklo z0.d, z0.s
-; VBITS_GE_256-NEXT: fcvtzu z1.d, p0/m, z1.s
+; VBITS_GE_256-NEXT: uunpklo z1.d, z1.s
; VBITS_GE_256-NEXT: fcvtzu z0.d, p0/m, z0.s
-; VBITS_GE_256-NEXT: st1d { z1.d }, p0, [x1]
-; VBITS_GE_256-NEXT: st1d { z0.d }, p0, [x1, x8, lsl #3]
+; VBITS_GE_256-NEXT: fcvtzu z1.d, p0/m, z1.s
+; VBITS_GE_256-NEXT: st1d { z0.d }, p0, [x1]
+; VBITS_GE_256-NEXT: st1d { z1.d }, p0, [x1, x8, lsl #3]
; VBITS_GE_256-NEXT: ret
;
; VBITS_GE_512-LABEL: fcvtzu_v8f32_v8i64:
@@ -1043,13 +1045,14 @@ define void @fcvtzs_v16f16_v16i32(ptr %a, ptr %b) #0 {
; VBITS_GE_256-NEXT: mov x8, #8 // =0x8
; VBITS_GE_256-NEXT: ld1h { z0.h }, p0/z, [x0]
; VBITS_GE_256-NEXT: ptrue p0.s, vl8
-; VBITS_GE_256-NEXT: uunpklo z1.s, z0.h
-; VBITS_GE_256-NEXT: ext z0.b, z0.b, z0.b, #16
+; VBITS_GE_256-NEXT: movprfx z1, z0
+; VBITS_GE_256-NEXT: ext z1.b, z1.b, z0.b, #16
; VBITS_GE_256-NEXT: uunpklo z0.s, z0.h
-; VBITS_GE_256-NEXT: fcvtzs z1.s, p0/m, z1.h
+; VBITS_GE_256-NEXT: uunpklo z1.s, z1.h
; VBITS_GE_256-NEXT: fcvtzs z0.s, p0/m, z0.h
-; VBITS_GE_256-NEXT: st1w { z1.s }, p0, [x1]
-; VBITS_GE_256-NEXT: st1w { z0.s }, p0, [x1, x8, lsl #2]
+; VBITS_GE_256-NEXT: fcvtzs z1.s, p0/m, z1.h
+; VBITS_GE_256-NEXT: st1w { z0.s }, p0, [x1]
+; VBITS_GE_256-NEXT: st1w { z1.s }, p0, [x1, x8, lsl #2]
; VBITS_GE_256-NEXT: ret
;
; VBITS_GE_512-LABEL: fcvtzs_v16f16_v16i32:
@@ -1444,13 +1447,14 @@ define void @fcvtzs_v8f32_v8i64(ptr %a, ptr %b) #0 {
; VBITS_GE_256-NEXT: mov x8, #4 // =0x4
; VBITS_GE_256-NEXT: ld1w { z0.s }, p0/z, [x0]
; VBITS_GE_256-NEXT: ptrue p0.d, vl4
-; VBITS_GE_256-NEXT: uunpklo z1.d, z0.s
-; VBITS_GE_256-NEXT: ext z0.b, z0.b, z0.b, #16
+; VBITS_GE_256-NEXT: movprfx z1, z0
+; VBITS_GE_256-NEXT: ext z1.b, z1.b, z0.b, #16
; VBITS_GE_256-NEXT: uunpklo z0.d, z0.s
-; VBITS_GE_256-NEXT: fcvtzs z1.d, p0/m, z1.s
+; VBITS_GE_256-NEXT: uunpklo z1.d, z1.s
; VBITS_GE_256-NEXT: fcvtzs z0.d, p0/m, z0.s
-; VBITS_GE_256-NEXT: st1d { z1.d }, p0, [x1]
-; VBITS_GE_256-NEXT: st1d { z0.d }, p0, [x1, x8, lsl #3]
+; VBITS_GE_256-NEXT: fcvtzs z1.d, p0/m, z1.s
+; VBITS_GE_256-NEXT: st1d { z0.d }, p0, [x1]
+; VBITS_GE_256-NEXT: st1d { z1.d }, p0, [x1, x8, lsl #3]
; VBITS_GE_256-NEXT: ret
;
; VBITS_GE_512-LABEL: fcvtzs_v8f32_v8i64:
diff --git a/llvm/test/CodeGen/AArch64/sve-fixed-length-int-extends.ll b/llvm/test/CodeGen/AArch64/sve-fixed-length-int-extends.ll
index 4feb86305f8f6..d2fa65599b973 100644
--- a/llvm/test/CodeGen/AArch64/sve-fixed-length-int-extends.ll
+++ b/llvm/test/CodeGen/AArch64/sve-fixed-length-int-extends.ll
@@ -77,11 +77,12 @@ define void @sext_v32i8_v32i16(ptr %in, ptr %out) #0 {
; VBITS_GE_256-NEXT: ld1b { z0.b }, p0/z, [x0]
; VBITS_GE_256-NEXT: ptrue p0.h, vl16
; VBITS_GE_256-NEXT: add z0.b, z0.b, z0.b
-; VBITS_GE_256-NEXT: sunpklo z1.h, z0.b
-; VBITS_GE_256-NEXT: ext z0.b, z0.b, z0.b, #16
+; VBITS_GE_256-NEXT: movprfx z1, z0
+; VBITS_GE_256-NEXT: ext z1.b, z1.b, z0.b, #16
; VBITS_GE_256-NEXT: sunpklo z0.h, z0.b
-; VBITS_GE_256-NEXT: st1h { z1.h }, p0, [x1]
-; VBITS_GE_256-NEXT: st1h { z0.h }, p0, [x1, x8, lsl #1]
+; VBITS_GE_256-NEXT: sunpklo z1.h, z1.b
+; VBITS_GE_256-NEXT: st1h { z0.h }, p0, [x1]
+; VBITS_GE_256-NEXT: st1h { z1.h }, p0, [x1, x8, lsl #1]
; VBITS_GE_256-NEXT: ret
;
; VBITS_GE_512-LABEL: sext_v32i8_v32i16:
@@ -326,11 +327,12 @@ define void @sext_v16i16_v16i32(ptr %in, ptr %out) #0 {
; VBITS_GE_256-NEXT: ld1h { z0.h }, p0/z, [x0]
; VBITS_GE_256-NEXT: ptrue p0.s, vl8
; VBITS_GE_256-NEXT: add z0.h, z0.h, z0.h
-; VBITS_GE_256-NEXT: sunpklo z1.s, z0.h
-; VBITS_GE_256-NEXT: ext z0.b, z0.b, z0.b, #16
+; VBITS_GE_256-NEXT: movprfx z1, z0
+; VBITS_GE_256-NEXT: ext z1.b, z1.b, z0.b, #16
; VBITS_GE_256-NEXT: sunpklo z0.s, z0.h
-; VBITS_GE_256-NEXT: st1w { z1.s }, p0, [x1]
-; VBITS_GE_256-NEXT: st1w { z0.s }, p0, [x1, x8, lsl #2]
+; VBITS_GE_256-NEXT: sunpklo z1.s, z1.h
+; VBITS_GE_256-NEXT: st1w { z0.s }, p0, [x1]
+; VBITS_GE_256-NEXT: st1w { z1.s }, p0, [x1, x8, lsl #2]
; VBITS_GE_256-NEXT: ret
;
; VBITS_GE_512-LABEL: sext_v16i16_v16i32:
@@ -490,11 +492,12 @@ define void @sext_v8i32_v8i64(ptr %in, ptr %out) #0 {
; VBITS_GE_256-NEXT: ld1w { z0.s }, p0/z, [x0]
; VBITS_GE_256-NEXT: ptrue p0.d, vl4
; VBITS_GE_256-NEXT: add z0.s, z0.s, z0.s
-; VBITS_GE_256-NEXT: sunpklo z1.d, z0.s
-; VBITS_GE_256-NEXT: ext z0.b, z0.b, z0.b, #16
+; VBITS_GE_256-NEXT: movprfx z1, z0
+; VBITS_GE_256-NEXT: ext z1.b, z1.b, z0.b, #16
; VBITS_GE_256-NEXT: sunpklo z0.d, z0.s
-; VBITS_GE_256-NEXT: st1d { z1.d }, p0, [x1]
-; VBITS_GE_256-NEXT: st1d { z0.d }, p0, [x1, x8, lsl #3]
+; VBITS_GE_256-NEXT: sunpklo z1.d, z1.s
+; VBITS_GE_256-NEXT: st1d { z0.d }, p0, [x1]
+; VBITS_GE_256-NEXT: st1d { z1.d }, p0, [x1, x8, lsl #3]
; VBITS_GE_256-NEXT: ret
;
; VBITS_GE_512-LABEL: sext_v8i32_v8i64:
@@ -573,11 +576,12 @@ define void @zext_v32i8_v32i16(ptr %in, ptr %out) #0 {
; VBITS_GE_256-NEXT: ld1b { z0.b }, p0/z, [x0]
; VBITS_GE_256-NEXT: ptrue p0.h, vl16
; VBITS_GE_256-NEXT: add z0.b, z0.b, z0.b
-; VBITS_GE_256-NEXT: uunpklo z1.h, z0.b
-; VBITS_GE_256-NEXT: ext z0.b, z0.b, z0.b, #16
+; VBITS_GE_256-NEXT: movprfx z1, z0
+; VBITS_GE_256-NEXT: ext z1.b, z1.b, z0.b, #16
; VBITS_GE_256-NEXT: uunpklo z0.h, z0.b
-; VBITS_GE_256-NEXT: st1h { z1.h }, p0, [x1]
-; VBITS_GE_256-NEXT: st1h { z0.h }, p0, [x1, x8, lsl #1]
+; VBITS_GE_256-NEXT: uunpklo z1.h, z1.b
+; VBITS_GE_256-NEXT: st1h { z0.h }, p0, [x1]
+; VBITS_GE_256-NEXT: st1h { z1.h }, p0, [x1, x8, lsl #1]
; VBITS_GE_256-NEXT: ret
;
; VBITS_GE_512-LABEL: zext_v32i8_v32i16:
@@ -822,11 +826,12 @@ define void @zext_v16i16_v16i32(ptr %in, ptr %out) #0 {
; VBITS_GE_256-NEXT: ld1h { z0.h }, p0/z, [x0]
; VBITS_GE_256-NEXT: ptrue p0.s, vl8
; VBITS_GE_256-NEXT: add z0.h, z0.h, z0.h
-; VBITS_GE_256-NEXT: uunpklo z1.s, z0.h
-; VBITS_GE_256-NEXT: ext z0.b, z0.b, z0.b, #16
+; VBITS_GE_256-NEXT: movprfx z1, z0
+; VBITS_GE_256-NEXT: ext z1.b, z1.b, z0.b, #16
; VBITS_GE_256-NEXT: uunpklo z0.s, z0.h
-; VBITS_GE_256-NEXT: st1w { z1.s }, p0, [x1]
-; VBITS_GE_256-NEXT: st1w { z0.s }, p0, [x1, x8, lsl #2]
+; VBITS_GE_256-NEXT: uunpklo z1.s, z1.h
+; VBITS_GE_256-NEXT: st1w { z0.s }, p0, [x1]
+; VBITS_GE_256-NEXT: st1w { z1.s }, p0, [x1, x8, lsl #2]
; VBITS_GE_256-NEXT: ret
;
; VBITS_GE_512-LABEL: zext_v16i16_v16i32:
@@ -986,11 +991,12 @@ define void @zext_v8i32_v8i64(ptr %in, ptr %out) #0 {
; VBITS_GE_256-NEXT: ld1w { z0.s }, p0/z, [x0]
; VBITS_GE_256-NEXT: ptrue p0.d, vl4
; VBITS_GE_256-NEXT: add z0.s, z0.s, z0.s
-; VBITS_GE_256-NEXT: uunpklo z1.d, z0.s
-; VBITS_GE_256-NEXT: ext z0.b, z0.b, z0.b, #16
+; VBITS_GE_256-NEXT: movprfx z1, z0
+; VBITS_GE_256-NEXT: ext z1.b, z1.b, z0.b, #16
; VBITS_GE_256-NEXT: uunpklo z0.d, z0.s
-; VBITS_GE_256-NEXT: st1d { z1.d }, p0, [x1]
-; VBITS_GE_256-NEXT: st1d { z0.d }, p0, [x1, x8, lsl #3]
+; VBITS_GE_256-NEXT: uunpklo z1.d, z1.s
+; VBITS_GE_256-NEXT: st1d { z0.d }, p0, [x1]
+; VBITS_GE_256-NEXT: st1d { z1.d }, p0, [x1, x8, lsl #3]
; VBITS_GE_256-NEXT: ret
;
; VBITS_GE_512-LABEL: zext_v8i32_v8i64:
diff --git a/llvm/test/CodeGen/AArch64/sve-fixed-length-int-rem.ll b/llvm/test/CodeGen/AArch64/sve-fixed-length-int-rem.ll
index 2d78945399176..27be84419d59e 100644
--- a/llvm/test/CodeGen/AArch64/sve-fixed-length-int-rem.ll
+++ b/llvm/test/CodeGen/AArch64/sve-fixed-length-int-rem.ll
@@ -259,17 +259,17 @@ define void @srem_v256i8(ptr %a, ptr %b) vscale_range(16,0) #0 {
; CHECK-NEXT: sunpklo z2.s, z2.h
; CHECK-NEXT: sunpklo z3.s, z3.h
; CHECK-NEXT: sdivr z4.s, p1/m, z4.s, z5.s
-; CHECK-NEXT: mov z5.d, z0.d
+; CHECK-NEXT: movprfx z5, z0
; CHECK-NEXT: ext z5.b, z5.b, z0.b, #128
; CHECK-NEXT: sunpklo z5.h, z5.b
; CHECK-NEXT: sunpklo z7.s, z5.h
; CHECK-NEXT: ext z5.b, z5.b, z5.b, #128
-; CHECK-NEXT: sdivr z2.s, p1/m, z2.s, z3.s
-; CHECK-NEXT: mov z3.d, z1.d
; CHECK-NEXT: sunpklo z5.s, z5.h
+; CHECK-NEXT: sdivr z2.s, p1/m, z2.s, z3.s
+; CHECK-NEXT: movprfx z3, z1
; CHECK-NEXT: ext z3.b, z3.b, z1.b, #128
-; CHECK-NEXT: uzp1 z4.h, z4.h, z4.h
; CHECK-NEXT: sunpklo z3.h, z3.b
+; CHECK-NEXT: uzp1 z4.h, z4.h, z4.h
; CHECK-NEXT: sunpklo z6.s, z3.h
; CHECK-NEXT: ext z3.b, z3.b, z3.b, #128
; CHECK-NEXT: sunpklo z3.s, z3.h
@@ -420,11 +420,11 @@ define void @srem_v16i16(ptr %a, ptr %b) #0 {
; VBITS_GE_256-NEXT: ld1h { z1.h }, p0/z, [x1]
; VBITS_GE_256-NEXT: sunpklo z2.s, z1.h
; VBITS_GE_256-NEXT: sunpklo z3.s, z0.h
-; VBITS_GE_256-NEXT: mov z4.d, z0.d
+; VBITS_GE_256-NEXT: movprfx z4, z0
; VBITS_GE_256-NEXT: ext z4...
[truncated]
|
if (SrcReg1 != DstReg) { | ||
MRI->setRegAllocationHint(DstReg, 0, SrcReg1); | ||
} | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this commit is really just a WIP to show we can slightly improve codegen with some hints. I'm not sure it should remain in that PR.
; CHECK-NEXT: ext z1.b, z1.b, z0.b, #8 | ||
; CHECK-NEXT: and z1.d, z1.d, #0x1 | ||
; CHECK-NEXT: cmpne p0.d, p0/z, z1.d, #0 | ||
; CHECK-NEXT: mov z0.d, z1.d |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one case where we get worse due to an extra MOV that could not be turned into a MOVPRFX. THis is alleviated in the next commit using register hints.
; VBITS_GE_256-NEXT: st1w { z0.s }, p0, [x1, x8, lsl #2] | ||
; VBITS_GE_256-NEXT: fcvtzu z1.s, p0/m, z1.h | ||
; VBITS_GE_256-NEXT: st1w { z0.s }, p0, [x1] | ||
; VBITS_GE_256-NEXT: st1w { z1.s }, p0, [x1, x8, lsl #2] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that example, we do get one more instruction now (the movprfx
), but I think the schedule is actually better because we eliminate one dependency between ext
and the second uunpklo
. Now the two uunpklo
can execute in parallel.
This is is the theme of the test updates in general: Sometimes more instructions, but more freedom for the MachineScheduler
This tries to ensure that the dst and first src register are mapped to the same physical register. This isn't always possible because the MachineScheduler has already moved instructions in a way that causes interferences if both virt regs get mapped to the same phys reg. WIP because there is probably a better place to do this.
56ac99a
to
4ad6acf
Compare
The patch changes existing patterns to select the EXT_ZZZI pseudo
instead of the EXT_ZZI destructive instruction for vector_splice.
Given that registers aren't tied anymore, this gives the register
allocator more freedom and a lot of MOVs get replaced with MOVPRFX.
In some cases however, we could have just chosen the same input and
output register, but regalloc preferred not to. This means we end up
with some test cases now having more instructions: there is now a
MOVPRFX while no MOV was previously needed.
This is a chained PR: #152552 - #152553 - #152554