-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[X86] Allow znver3/4/5 targets to use double-shift instructions by default #132720
base: main
Are you sure you want to change the base?
Conversation
…fault While still not as fast as Intel targets, recent AMD znver3 + later CPUs are not as microcoded+bottlenecked as previous AMD targets (now only ~2cy rthroughput) which improves on the expanded 3*shift+not+or sequence we expand as an alternative. Noticed while triaging llvm#132601
@llvm/pr-subscribers-backend-x86 Author: Simon Pilgrim (RKSimon) ChangesWhile still not as fast as Intel targets, recent AMD znver3 + later CPUs are not as microcoded+bottlenecked as previous AMD targets (now only ~2cy rthroughput) which improves on the expanded 3*shift+not+or sequence we expand to as an alternative. Noticed while triaging #132601 Full diff: https://github.com/llvm/llvm-project/pull/132720.diff 2 Files Affected:
diff --git a/llvm/lib/Target/X86/X86.td b/llvm/lib/Target/X86/X86.td
index 38761e1fd7eec..4909224c458ac 100644
--- a/llvm/lib/Target/X86/X86.td
+++ b/llvm/lib/Target/X86/X86.td
@@ -1570,7 +1570,7 @@ def ProcessorFeatures {
FeatureVPCLMULQDQ];
list<SubtargetFeature> ZN3AdditionalTuning = [TuningMacroFusion];
list<SubtargetFeature> ZN3Tuning =
- !listconcat(ZN2Tuning, ZN3AdditionalTuning);
+ !listremove(!listconcat(ZN2Tuning, ZN3AdditionalTuning), [TuningSlowSHLD]);
list<SubtargetFeature> ZN3Features =
!listconcat(ZN2Features, ZN3AdditionalFeatures);
diff --git a/llvm/test/CodeGen/X86/x86-64-double-shifts-var.ll b/llvm/test/CodeGen/X86/x86-64-double-shifts-var.ll
index 58f6a66aeff79..c5e879c0135f4 100644
--- a/llvm/test/CodeGen/X86/x86-64-double-shifts-var.ll
+++ b/llvm/test/CodeGen/X86/x86-64-double-shifts-var.ll
@@ -12,12 +12,12 @@
; RUN: llc < %s -mtriple=x86_64-- -mcpu=bdver1 | FileCheck %s --check-prefixes=BMI
; RUN: llc < %s -mtriple=x86_64-- -mcpu=bdver2 | FileCheck %s --check-prefixes=BMI
; RUN: llc < %s -mtriple=x86_64-- -mcpu=bdver3 | FileCheck %s --check-prefixes=BMI
-; RUN: llc < %s -mtriple=x86_64-- -mcpu=bdver4 | FileCheck %s --check-prefixes=BMI2
-; RUN: llc < %s -mtriple=x86_64-- -mcpu=znver1 | FileCheck %s --check-prefixes=BMI2
-; RUN: llc < %s -mtriple=x86_64-- -mcpu=znver2 | FileCheck %s --check-prefixes=BMI2
-; RUN: llc < %s -mtriple=x86_64-- -mcpu=znver3 | FileCheck %s --check-prefixes=BMI2
-; RUN: llc < %s -mtriple=x86_64-- -mcpu=znver4 | FileCheck %s --check-prefixes=BMI2
-; RUN: llc < %s -mtriple=x86_64-- -mcpu=znver5 | FileCheck %s --check-prefixes=BMI2
+; RUN: llc < %s -mtriple=x86_64-- -mcpu=bdver4 | FileCheck %s --check-prefixes=BMI2-SLOW
+; RUN: llc < %s -mtriple=x86_64-- -mcpu=znver1 | FileCheck %s --check-prefixes=BMI2-SLOW
+; RUN: llc < %s -mtriple=x86_64-- -mcpu=znver2 | FileCheck %s --check-prefixes=BMI2-SLOW
+; RUN: llc < %s -mtriple=x86_64-- -mcpu=znver3 | FileCheck %s --check-prefixes=BMI2-FAST
+; RUN: llc < %s -mtriple=x86_64-- -mcpu=znver4 | FileCheck %s --check-prefixes=BMI2-FAST
+; RUN: llc < %s -mtriple=x86_64-- -mcpu=znver5 | FileCheck %s --check-prefixes=BMI2-FAST
; Verify that for the X86_64 processors that are known to have poor latency
; double precision shift instructions we do not generate 'shld' or 'shrd'
@@ -53,15 +53,23 @@ define i64 @lshift(i64 %a, i64 %b, i32 %c) nounwind readnone {
; BMI-NEXT: orq %rdi, %rax
; BMI-NEXT: retq
;
-; BMI2-LABEL: lshift:
-; BMI2: # %bb.0: # %entry
-; BMI2-NEXT: # kill: def $edx killed $edx def $rdx
-; BMI2-NEXT: shlxq %rdx, %rdi, %rcx
-; BMI2-NEXT: notb %dl
-; BMI2-NEXT: shrq %rsi
-; BMI2-NEXT: shrxq %rdx, %rsi, %rax
-; BMI2-NEXT: orq %rcx, %rax
-; BMI2-NEXT: retq
+; BMI2-SLOW-LABEL: lshift:
+; BMI2-SLOW: # %bb.0: # %entry
+; BMI2-SLOW-NEXT: # kill: def $edx killed $edx def $rdx
+; BMI2-SLOW-NEXT: shlxq %rdx, %rdi, %rcx
+; BMI2-SLOW-NEXT: notb %dl
+; BMI2-SLOW-NEXT: shrq %rsi
+; BMI2-SLOW-NEXT: shrxq %rdx, %rsi, %rax
+; BMI2-SLOW-NEXT: orq %rcx, %rax
+; BMI2-SLOW-NEXT: retq
+;
+; BMI2-FAST-LABEL: lshift:
+; BMI2-FAST: # %bb.0: # %entry
+; BMI2-FAST-NEXT: movl %edx, %ecx
+; BMI2-FAST-NEXT: movq %rdi, %rax
+; BMI2-FAST-NEXT: # kill: def $cl killed $cl killed $ecx
+; BMI2-FAST-NEXT: shldq %cl, %rsi, %rax
+; BMI2-FAST-NEXT: retq
entry:
%sh_prom = zext i32 %c to i64
%shl = shl i64 %a, %sh_prom
@@ -100,15 +108,23 @@ define i64 @rshift(i64 %a, i64 %b, i32 %c) nounwind readnone {
; BMI-NEXT: orq %rdi, %rax
; BMI-NEXT: retq
;
-; BMI2-LABEL: rshift:
-; BMI2: # %bb.0: # %entry
-; BMI2-NEXT: # kill: def $edx killed $edx def $rdx
-; BMI2-NEXT: shrxq %rdx, %rdi, %rcx
-; BMI2-NEXT: notb %dl
-; BMI2-NEXT: addq %rsi, %rsi
-; BMI2-NEXT: shlxq %rdx, %rsi, %rax
-; BMI2-NEXT: orq %rcx, %rax
-; BMI2-NEXT: retq
+; BMI2-SLOW-LABEL: rshift:
+; BMI2-SLOW: # %bb.0: # %entry
+; BMI2-SLOW-NEXT: # kill: def $edx killed $edx def $rdx
+; BMI2-SLOW-NEXT: shrxq %rdx, %rdi, %rcx
+; BMI2-SLOW-NEXT: notb %dl
+; BMI2-SLOW-NEXT: addq %rsi, %rsi
+; BMI2-SLOW-NEXT: shlxq %rdx, %rsi, %rax
+; BMI2-SLOW-NEXT: orq %rcx, %rax
+; BMI2-SLOW-NEXT: retq
+;
+; BMI2-FAST-LABEL: rshift:
+; BMI2-FAST: # %bb.0: # %entry
+; BMI2-FAST-NEXT: movl %edx, %ecx
+; BMI2-FAST-NEXT: movq %rdi, %rax
+; BMI2-FAST-NEXT: # kill: def $cl killed $cl killed $ecx
+; BMI2-FAST-NEXT: shrdq %cl, %rsi, %rax
+; BMI2-FAST-NEXT: retq
entry:
%sh_prom = zext i32 %c to i64
%shr = lshr i64 %a, %sh_prom
|
While still not as fast as Intel targets, recent AMD znver3 + later CPUs are not as microcoded+bottlenecked as previous AMD targets (now only ~2cy rthroughput) which improves on the expanded 3*shift+not+or sequence we expand to as an alternative.
Noticed while triaging #132601