Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[X86] Allow znver3/4/5 targets to use double-shift instructions by default #132720

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

RKSimon
Copy link
Collaborator

@RKSimon RKSimon commented Mar 24, 2025

While still not as fast as Intel targets, recent AMD znver3 + later CPUs are not as microcoded+bottlenecked as previous AMD targets (now only ~2cy rthroughput) which improves on the expanded 3*shift+not+or sequence we expand to as an alternative.

Noticed while triaging #132601

…fault

While still not as fast as Intel targets, recent AMD znver3 + later CPUs are not as microcoded+bottlenecked as previous AMD targets (now only ~2cy rthroughput) which improves on the expanded 3*shift+not+or sequence we expand as an alternative.

Noticed while triaging llvm#132601
@llvmbot
Copy link
Member

llvmbot commented Mar 24, 2025

@llvm/pr-subscribers-backend-x86

Author: Simon Pilgrim (RKSimon)

Changes

While still not as fast as Intel targets, recent AMD znver3 + later CPUs are not as microcoded+bottlenecked as previous AMD targets (now only ~2cy rthroughput) which improves on the expanded 3*shift+not+or sequence we expand to as an alternative.

Noticed while triaging #132601


Full diff: https://github.com/llvm/llvm-project/pull/132720.diff

2 Files Affected:

  • (modified) llvm/lib/Target/X86/X86.td (+1-1)
  • (modified) llvm/test/CodeGen/X86/x86-64-double-shifts-var.ll (+40-24)
diff --git a/llvm/lib/Target/X86/X86.td b/llvm/lib/Target/X86/X86.td
index 38761e1fd7eec..4909224c458ac 100644
--- a/llvm/lib/Target/X86/X86.td
+++ b/llvm/lib/Target/X86/X86.td
@@ -1570,7 +1570,7 @@ def ProcessorFeatures {
                                                   FeatureVPCLMULQDQ];
   list<SubtargetFeature> ZN3AdditionalTuning = [TuningMacroFusion];
   list<SubtargetFeature> ZN3Tuning =
-    !listconcat(ZN2Tuning, ZN3AdditionalTuning);
+    !listremove(!listconcat(ZN2Tuning, ZN3AdditionalTuning), [TuningSlowSHLD]);
   list<SubtargetFeature> ZN3Features =
     !listconcat(ZN2Features, ZN3AdditionalFeatures);
 
diff --git a/llvm/test/CodeGen/X86/x86-64-double-shifts-var.ll b/llvm/test/CodeGen/X86/x86-64-double-shifts-var.ll
index 58f6a66aeff79..c5e879c0135f4 100644
--- a/llvm/test/CodeGen/X86/x86-64-double-shifts-var.ll
+++ b/llvm/test/CodeGen/X86/x86-64-double-shifts-var.ll
@@ -12,12 +12,12 @@
 ; RUN: llc < %s -mtriple=x86_64-- -mcpu=bdver1 | FileCheck %s --check-prefixes=BMI
 ; RUN: llc < %s -mtriple=x86_64-- -mcpu=bdver2 | FileCheck %s --check-prefixes=BMI
 ; RUN: llc < %s -mtriple=x86_64-- -mcpu=bdver3 | FileCheck %s --check-prefixes=BMI
-; RUN: llc < %s -mtriple=x86_64-- -mcpu=bdver4 | FileCheck %s --check-prefixes=BMI2
-; RUN: llc < %s -mtriple=x86_64-- -mcpu=znver1 | FileCheck %s --check-prefixes=BMI2
-; RUN: llc < %s -mtriple=x86_64-- -mcpu=znver2 | FileCheck %s --check-prefixes=BMI2
-; RUN: llc < %s -mtriple=x86_64-- -mcpu=znver3 | FileCheck %s --check-prefixes=BMI2
-; RUN: llc < %s -mtriple=x86_64-- -mcpu=znver4 | FileCheck %s --check-prefixes=BMI2
-; RUN: llc < %s -mtriple=x86_64-- -mcpu=znver5 | FileCheck %s --check-prefixes=BMI2
+; RUN: llc < %s -mtriple=x86_64-- -mcpu=bdver4 | FileCheck %s --check-prefixes=BMI2-SLOW
+; RUN: llc < %s -mtriple=x86_64-- -mcpu=znver1 | FileCheck %s --check-prefixes=BMI2-SLOW
+; RUN: llc < %s -mtriple=x86_64-- -mcpu=znver2 | FileCheck %s --check-prefixes=BMI2-SLOW
+; RUN: llc < %s -mtriple=x86_64-- -mcpu=znver3 | FileCheck %s --check-prefixes=BMI2-FAST
+; RUN: llc < %s -mtriple=x86_64-- -mcpu=znver4 | FileCheck %s --check-prefixes=BMI2-FAST
+; RUN: llc < %s -mtriple=x86_64-- -mcpu=znver5 | FileCheck %s --check-prefixes=BMI2-FAST
 
 ; Verify that for the X86_64 processors that are known to have poor latency
 ; double precision shift instructions we do not generate 'shld' or 'shrd'
@@ -53,15 +53,23 @@ define i64 @lshift(i64 %a, i64 %b, i32 %c) nounwind readnone {
 ; BMI-NEXT:    orq %rdi, %rax
 ; BMI-NEXT:    retq
 ;
-; BMI2-LABEL: lshift:
-; BMI2:       # %bb.0: # %entry
-; BMI2-NEXT:    # kill: def $edx killed $edx def $rdx
-; BMI2-NEXT:    shlxq %rdx, %rdi, %rcx
-; BMI2-NEXT:    notb %dl
-; BMI2-NEXT:    shrq %rsi
-; BMI2-NEXT:    shrxq %rdx, %rsi, %rax
-; BMI2-NEXT:    orq %rcx, %rax
-; BMI2-NEXT:    retq
+; BMI2-SLOW-LABEL: lshift:
+; BMI2-SLOW:       # %bb.0: # %entry
+; BMI2-SLOW-NEXT:    # kill: def $edx killed $edx def $rdx
+; BMI2-SLOW-NEXT:    shlxq %rdx, %rdi, %rcx
+; BMI2-SLOW-NEXT:    notb %dl
+; BMI2-SLOW-NEXT:    shrq %rsi
+; BMI2-SLOW-NEXT:    shrxq %rdx, %rsi, %rax
+; BMI2-SLOW-NEXT:    orq %rcx, %rax
+; BMI2-SLOW-NEXT:    retq
+;
+; BMI2-FAST-LABEL: lshift:
+; BMI2-FAST:       # %bb.0: # %entry
+; BMI2-FAST-NEXT:    movl %edx, %ecx
+; BMI2-FAST-NEXT:    movq %rdi, %rax
+; BMI2-FAST-NEXT:    # kill: def $cl killed $cl killed $ecx
+; BMI2-FAST-NEXT:    shldq %cl, %rsi, %rax
+; BMI2-FAST-NEXT:    retq
 entry:
   %sh_prom = zext i32 %c to i64
   %shl = shl i64 %a, %sh_prom
@@ -100,15 +108,23 @@ define i64 @rshift(i64 %a, i64 %b, i32 %c) nounwind readnone {
 ; BMI-NEXT:    orq %rdi, %rax
 ; BMI-NEXT:    retq
 ;
-; BMI2-LABEL: rshift:
-; BMI2:       # %bb.0: # %entry
-; BMI2-NEXT:    # kill: def $edx killed $edx def $rdx
-; BMI2-NEXT:    shrxq %rdx, %rdi, %rcx
-; BMI2-NEXT:    notb %dl
-; BMI2-NEXT:    addq %rsi, %rsi
-; BMI2-NEXT:    shlxq %rdx, %rsi, %rax
-; BMI2-NEXT:    orq %rcx, %rax
-; BMI2-NEXT:    retq
+; BMI2-SLOW-LABEL: rshift:
+; BMI2-SLOW:       # %bb.0: # %entry
+; BMI2-SLOW-NEXT:    # kill: def $edx killed $edx def $rdx
+; BMI2-SLOW-NEXT:    shrxq %rdx, %rdi, %rcx
+; BMI2-SLOW-NEXT:    notb %dl
+; BMI2-SLOW-NEXT:    addq %rsi, %rsi
+; BMI2-SLOW-NEXT:    shlxq %rdx, %rsi, %rax
+; BMI2-SLOW-NEXT:    orq %rcx, %rax
+; BMI2-SLOW-NEXT:    retq
+;
+; BMI2-FAST-LABEL: rshift:
+; BMI2-FAST:       # %bb.0: # %entry
+; BMI2-FAST-NEXT:    movl %edx, %ecx
+; BMI2-FAST-NEXT:    movq %rdi, %rax
+; BMI2-FAST-NEXT:    # kill: def $cl killed $cl killed $ecx
+; BMI2-FAST-NEXT:    shrdq %cl, %rsi, %rax
+; BMI2-FAST-NEXT:    retq
 entry:
   %sh_prom = zext i32 %c to i64
   %shr = lshr i64 %a, %sh_prom

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants