[X86] Allow znver3/4/5 targets to use double-shift instructions by default #132720

RKSimon · 2025-03-24T11:59:27Z

While still not as fast as Intel targets, recent AMD znver3 + later CPUs are not as microcoded+bottlenecked as previous AMD targets (now only ~2cy rthroughput) which improves on the expanded 3*shift+not+or sequence we expand to as an alternative.

Noticed while triaging #132601

…fault While still not as fast as Intel targets, recent AMD znver3 + later CPUs are not as microcoded+bottlenecked as previous AMD targets (now only ~2cy rthroughput) which improves on the expanded 3*shift+not+or sequence we expand as an alternative. Noticed while triaging llvm#132601

llvmbot · 2025-03-24T11:59:57Z

@llvm/pr-subscribers-backend-x86

Author: Simon Pilgrim (RKSimon)

Changes

While still not as fast as Intel targets, recent AMD znver3 + later CPUs are not as microcoded+bottlenecked as previous AMD targets (now only ~2cy rthroughput) which improves on the expanded 3*shift+not+or sequence we expand to as an alternative.

Noticed while triaging #132601

Full diff: https://github.com/llvm/llvm-project/pull/132720.diff

2 Files Affected:

(modified) llvm/lib/Target/X86/X86.td (+1-1)
(modified) llvm/test/CodeGen/X86/x86-64-double-shifts-var.ll (+40-24)

diff --git a/llvm/lib/Target/X86/X86.td b/llvm/lib/Target/X86/X86.td
index 38761e1fd7eec..4909224c458ac 100644
--- a/llvm/lib/Target/X86/X86.td
+++ b/llvm/lib/Target/X86/X86.td
@@ -1570,7 +1570,7 @@ def ProcessorFeatures {
                                                   FeatureVPCLMULQDQ];
   list<SubtargetFeature> ZN3AdditionalTuning = [TuningMacroFusion];
   list<SubtargetFeature> ZN3Tuning =
-    !listconcat(ZN2Tuning, ZN3AdditionalTuning);
+    !listremove(!listconcat(ZN2Tuning, ZN3AdditionalTuning), [TuningSlowSHLD]);
   list<SubtargetFeature> ZN3Features =
     !listconcat(ZN2Features, ZN3AdditionalFeatures);
 
diff --git a/llvm/test/CodeGen/X86/x86-64-double-shifts-var.ll b/llvm/test/CodeGen/X86/x86-64-double-shifts-var.ll
index 58f6a66aeff79..c5e879c0135f4 100644
--- a/llvm/test/CodeGen/X86/x86-64-double-shifts-var.ll
+++ b/llvm/test/CodeGen/X86/x86-64-double-shifts-var.ll
@@ -12,12 +12,12 @@
 ; RUN: llc < %s -mtriple=x86_64-- -mcpu=bdver1 | FileCheck %s --check-prefixes=BMI
 ; RUN: llc < %s -mtriple=x86_64-- -mcpu=bdver2 | FileCheck %s --check-prefixes=BMI
 ; RUN: llc < %s -mtriple=x86_64-- -mcpu=bdver3 | FileCheck %s --check-prefixes=BMI
-; RUN: llc < %s -mtriple=x86_64-- -mcpu=bdver4 | FileCheck %s --check-prefixes=BMI2
-; RUN: llc < %s -mtriple=x86_64-- -mcpu=znver1 | FileCheck %s --check-prefixes=BMI2
-; RUN: llc < %s -mtriple=x86_64-- -mcpu=znver2 | FileCheck %s --check-prefixes=BMI2
-; RUN: llc < %s -mtriple=x86_64-- -mcpu=znver3 | FileCheck %s --check-prefixes=BMI2
-; RUN: llc < %s -mtriple=x86_64-- -mcpu=znver4 | FileCheck %s --check-prefixes=BMI2
-; RUN: llc < %s -mtriple=x86_64-- -mcpu=znver5 | FileCheck %s --check-prefixes=BMI2
+; RUN: llc < %s -mtriple=x86_64-- -mcpu=bdver4 | FileCheck %s --check-prefixes=BMI2-SLOW
+; RUN: llc < %s -mtriple=x86_64-- -mcpu=znver1 | FileCheck %s --check-prefixes=BMI2-SLOW
+; RUN: llc < %s -mtriple=x86_64-- -mcpu=znver2 | FileCheck %s --check-prefixes=BMI2-SLOW
+; RUN: llc < %s -mtriple=x86_64-- -mcpu=znver3 | FileCheck %s --check-prefixes=BMI2-FAST
+; RUN: llc < %s -mtriple=x86_64-- -mcpu=znver4 | FileCheck %s --check-prefixes=BMI2-FAST
+; RUN: llc < %s -mtriple=x86_64-- -mcpu=znver5 | FileCheck %s --check-prefixes=BMI2-FAST
 
 ; Verify that for the X86_64 processors that are known to have poor latency
 ; double precision shift instructions we do not generate 'shld' or 'shrd'
@@ -53,15 +53,23 @@ define i64 @lshift(i64 %a, i64 %b, i32 %c) nounwind readnone {
 ; BMI-NEXT:    orq %rdi, %rax
 ; BMI-NEXT:    retq
 ;
-; BMI2-LABEL: lshift:
-; BMI2:       # %bb.0: # %entry
-; BMI2-NEXT:    # kill: def $edx killed $edx def $rdx
-; BMI2-NEXT:    shlxq %rdx, %rdi, %rcx
-; BMI2-NEXT:    notb %dl
-; BMI2-NEXT:    shrq %rsi
-; BMI2-NEXT:    shrxq %rdx, %rsi, %rax
-; BMI2-NEXT:    orq %rcx, %rax
-; BMI2-NEXT:    retq
+; BMI2-SLOW-LABEL: lshift:
+; BMI2-SLOW:       # %bb.0: # %entry
+; BMI2-SLOW-NEXT:    # kill: def $edx killed $edx def $rdx
+; BMI2-SLOW-NEXT:    shlxq %rdx, %rdi, %rcx
+; BMI2-SLOW-NEXT:    notb %dl
+; BMI2-SLOW-NEXT:    shrq %rsi
+; BMI2-SLOW-NEXT:    shrxq %rdx, %rsi, %rax
+; BMI2-SLOW-NEXT:    orq %rcx, %rax
+; BMI2-SLOW-NEXT:    retq
+;
+; BMI2-FAST-LABEL: lshift:
+; BMI2-FAST:       # %bb.0: # %entry
+; BMI2-FAST-NEXT:    movl %edx, %ecx
+; BMI2-FAST-NEXT:    movq %rdi, %rax
+; BMI2-FAST-NEXT:    # kill: def $cl killed $cl killed $ecx
+; BMI2-FAST-NEXT:    shldq %cl, %rsi, %rax
+; BMI2-FAST-NEXT:    retq
 entry:
   %sh_prom = zext i32 %c to i64
   %shl = shl i64 %a, %sh_prom
@@ -100,15 +108,23 @@ define i64 @rshift(i64 %a, i64 %b, i32 %c) nounwind readnone {
 ; BMI-NEXT:    orq %rdi, %rax
 ; BMI-NEXT:    retq
 ;
-; BMI2-LABEL: rshift:
-; BMI2:       # %bb.0: # %entry
-; BMI2-NEXT:    # kill: def $edx killed $edx def $rdx
-; BMI2-NEXT:    shrxq %rdx, %rdi, %rcx
-; BMI2-NEXT:    notb %dl
-; BMI2-NEXT:    addq %rsi, %rsi
-; BMI2-NEXT:    shlxq %rdx, %rsi, %rax
-; BMI2-NEXT:    orq %rcx, %rax
-; BMI2-NEXT:    retq
+; BMI2-SLOW-LABEL: rshift:
+; BMI2-SLOW:       # %bb.0: # %entry
+; BMI2-SLOW-NEXT:    # kill: def $edx killed $edx def $rdx
+; BMI2-SLOW-NEXT:    shrxq %rdx, %rdi, %rcx
+; BMI2-SLOW-NEXT:    notb %dl
+; BMI2-SLOW-NEXT:    addq %rsi, %rsi
+; BMI2-SLOW-NEXT:    shlxq %rdx, %rsi, %rax
+; BMI2-SLOW-NEXT:    orq %rcx, %rax
+; BMI2-SLOW-NEXT:    retq
+;
+; BMI2-FAST-LABEL: rshift:
+; BMI2-FAST:       # %bb.0: # %entry
+; BMI2-FAST-NEXT:    movl %edx, %ecx
+; BMI2-FAST-NEXT:    movq %rdi, %rax
+; BMI2-FAST-NEXT:    # kill: def $cl killed $cl killed $ecx
+; BMI2-FAST-NEXT:    shrdq %cl, %rsi, %rax
+; BMI2-FAST-NEXT:    retq
 entry:
   %sh_prom = zext i32 %c to i64
   %shr = lshr i64 %a, %sh_prom

RKSimon requested a review from ganeshgit March 24, 2025 11:59

llvmbot added the backend:X86 label Mar 24, 2025

RKSimon added 2 commits March 24, 2025 16:22

Merge branch 'upstream' into x86-fast-amd-shld

7662cb4

[X86] shift-i512.ll - regenerate znver4 checks

cdacc64

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[X86] Allow znver3/4/5 targets to use double-shift instructions by default #132720

[X86] Allow znver3/4/5 targets to use double-shift instructions by default #132720

RKSimon commented Mar 24, 2025

llvmbot commented Mar 24, 2025

[X86] Allow znver3/4/5 targets to use double-shift instructions by default #132720

Are you sure you want to change the base?

[X86] Allow znver3/4/5 targets to use double-shift instructions by default #132720

Conversation

RKSimon commented Mar 24, 2025

llvmbot commented Mar 24, 2025