Skip to content

[MCA] New option to report scheduling information: -scheduling-info #126703

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

jvillette38
Copy link
Contributor

This is a new way to update scheduling information in llvm. I have used this to update scheduling information for AArch64 Neoverse V1 micro architecture (new patches will follow and will be dependent to this pull request).

This pull request contains 2 commits:
A) llvm-mca -scheduling-info option
B) update_mca_test_checks.py new options: --check-sched-info and --update-sched-info.

A) llvm-mca -scheduling-info disables default llvm-mca reporting (InstructionInfoView) and output information in the following format:
<uOps> | <Latency> | <Bypass Latency> | <Throughput> | <Resources> | <LLVM Opcode> | <Assembly input: instruction + comment>
Example from new llvm-mca test AArch64/Neoverse/V1-scheduling-info.s:
Input:
abs v25.2s, v25.2s // ABS <Vd>.<T>, <Vn>.<T> \\ ASIMD arith, basic \\ 1 2 2 4.0 V1UnitV
Output:
1 | 2 | 2 | 4.00 | V1UnitSVE01, V1UnitV | ABSv2i32 | abs v25.2s, v25.2s // ABS <Vd>.<T>, <Vn>.<T> \\ ASIMD arith, basic \\ 1 2 2 4.0 V1UnitV

So if we are able to extract scheduling information from micro architecture document for each instruction variant, it is possible to write test in this form and check llvm-mca -scheduling-info output for the differences between llvm information compared to the one in comments. If you get differences, check the documentation to update comment or fix llvm to update llvm-mca output.
LLVM Opcode is given to make easier the changes in target description.

B) update_mca_test_checks.py --check-sched-info is used to check informations between llvm-mca output and information in comments. If found differences, it will exit with error code and report them. Developer can fix comments or llvm target description or use update_mca_test_checks.py --update-sched-info to update automatically comments and then check differences with git.

Convention for comments used by new update_mca_test_checks.py options:

  • C or C++ style comment: '/* */' and '//'
  • Fields:
    <asm instruction> <// or /*> <instruction format> \\ <micro architecture reference> \\ <uOps> <Latency> <Bypass latency> <Throughput> <Resources seperated with commas>

@mshockwave and @Rin18 may be interested.

Copy link

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@llvmbot
Copy link
Member

llvmbot commented Feb 11, 2025

@llvm/pr-subscribers-tools-llvm-mca

@llvm/pr-subscribers-testing-tools

Author: Julien Villette (jvillette38)

Changes

This is a new way to update scheduling information in llvm. I have used this to update scheduling information for AArch64 Neoverse V1 micro architecture (new patches will follow and will be dependent to this pull request).

This pull request contains 2 commits:
A) llvm-mca -scheduling-info option
B) update_mca_test_checks.py new options: --check-sched-info and --update-sched-info.

A) llvm-mca -scheduling-info disables default llvm-mca reporting (InstructionInfoView) and output information in the following format:
&lt;uOps&gt; | &lt;Latency&gt; | &lt;Bypass Latency&gt; | &lt;Throughput&gt; | &lt;Resources&gt; | &lt;LLVM Opcode&gt; | &lt;Assembly input: instruction + comment&gt;
Example from new llvm-mca test AArch64/Neoverse/V1-scheduling-info.s:
Input:
abs v25.2s, v25.2s // ABS &lt;Vd&gt;.&lt;T&gt;, &lt;Vn&gt;.&lt;T&gt; \\ ASIMD arith, basic \\ 1 2 2 4.0 V1UnitV
Output:
1 | 2 | 2 | 4.00 | V1UnitSVE01, V1UnitV | ABSv2i32 | abs v25.2s, v25.2s // ABS &lt;Vd&gt;.&lt;T&gt;, &lt;Vn&gt;.&lt;T&gt; \\ ASIMD arith, basic \\ 1 2 2 4.0 V1UnitV

So if we are able to extract scheduling information from micro architecture document for each instruction variant, it is possible to write test in this form and check llvm-mca -scheduling-info output for the differences between llvm information compared to the one in comments. If you get differences, check the documentation to update comment or fix llvm to update llvm-mca output.
LLVM Opcode is given to make easier the changes in target description.

B) update_mca_test_checks.py --check-sched-info is used to check informations between llvm-mca output and information in comments. If found differences, it will exit with error code and report them. Developer can fix comments or llvm target description or use update_mca_test_checks.py --update-sched-info to update automatically comments and then check differences with git.

Convention for comments used by new update_mca_test_checks.py options:

  • C or C++ style comment: '/* */' and '//'
  • Fields:
    &lt;asm instruction&gt; &lt;// or /*&gt; &lt;instruction format&gt; \\ &lt;micro architecture reference&gt; \\ &lt;uOps&gt; &lt;Latency&gt; &lt;Bypass latency&gt; &lt;Throughput&gt; &lt;Resources seperated with commas&gt;

@mshockwave and @Rin18 may be interested.


Patch is 1.49 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/126703.diff

10 Files Affected:

  • (modified) llvm/docs/CommandGuide/llvm-mca.rst (+14)
  • (modified) llvm/include/llvm/MC/MCSchedule.h (+4)
  • (modified) llvm/lib/MC/MCSchedule.cpp (+37)
  • (added) llvm/test/tools/llvm-mca/AArch64/Neoverse/V1-scheduling-info.s (+7588)
  • (modified) llvm/tools/llvm-mca/CMakeLists.txt (+1)
  • (modified) llvm/tools/llvm-mca/Views/InstructionInfoView.h (+1)
  • (added) llvm/tools/llvm-mca/Views/SchedulingInfoView.cpp (+210)
  • (added) llvm/tools/llvm-mca/Views/SchedulingInfoView.h (+96)
  • (modified) llvm/tools/llvm-mca/llvm-mca.cpp (+31-11)
  • (modified) llvm/utils/update_mca_test_checks.py (+168)
diff --git a/llvm/docs/CommandGuide/llvm-mca.rst b/llvm/docs/CommandGuide/llvm-mca.rst
index f610ea2f2168269..1c5275ce000b111 100644
--- a/llvm/docs/CommandGuide/llvm-mca.rst
+++ b/llvm/docs/CommandGuide/llvm-mca.rst
@@ -170,6 +170,20 @@ option specifies "``-``", then the output will also be sent to standard output.
   Enable extra scheduler statistics. This view collects and analyzes instruction
   issue events. This view is disabled by default.
 
+.. option:: -scheduling-info
+
+  Enable scheduling info view. This view reports scheduling information defined
+  in LLVM target description in the form:
+  uOps | Latency | Bypass Latency | Throughput | LLVM OpcodeName | Resources
+  units | assembly instruction and its comment (// or /* */) if defined.
+  It allows to compare scheduling info with architecture documents and fix them
+  in target description by fixing InstrRW for the reported LLVM opcode.
+  Scheduling information can be defined in the same order in each instruction
+  comments to check easily reported and reference scheduling information.
+  Suggested information in comment:
+  // <architecture instruction form> \\ <scheduling documentation title> \\
+     <uOps>, <Latency>, <Bypass Latency>, <Throughput>, <Resources units>
+
 .. option:: -retire-stats
 
   Enable extra retire control unit statistics. This view is disabled by default.
diff --git a/llvm/include/llvm/MC/MCSchedule.h b/llvm/include/llvm/MC/MCSchedule.h
index fe731d086f70ae3..dcbc5369120a39b 100644
--- a/llvm/include/llvm/MC/MCSchedule.h
+++ b/llvm/include/llvm/MC/MCSchedule.h
@@ -402,6 +402,10 @@ struct MCSchedModel {
   static unsigned getForwardingDelayCycles(ArrayRef<MCReadAdvanceEntry> Entries,
                                            unsigned WriteResourceIdx = 0);
 
+  /// Returns the maximum forwarding delay for maximum write latency.
+  static unsigned getForwardingDelayCycles(const MCSubtargetInfo &STI,
+                                       const MCSchedClassDesc &SCDesc);
+
   /// Returns the default initialized model.
   static const MCSchedModel Default;
 };
diff --git a/llvm/lib/MC/MCSchedule.cpp b/llvm/lib/MC/MCSchedule.cpp
index ed243cecabb7638..4ef6acf78714fa7 100644
--- a/llvm/lib/MC/MCSchedule.cpp
+++ b/llvm/lib/MC/MCSchedule.cpp
@@ -174,3 +174,40 @@ MCSchedModel::getForwardingDelayCycles(ArrayRef<MCReadAdvanceEntry> Entries,
 
   return std::abs(DelayCycles);
 }
+
+unsigned
+MCSchedModel::getForwardingDelayCycles(const MCSubtargetInfo &STI,
+                                            const MCSchedClassDesc &SCDesc) {
+
+  ArrayRef<MCReadAdvanceEntry> Entries = STI.getReadAdvanceEntries(SCDesc);
+  if (Entries.empty())
+    return 0;
+
+  unsigned Latency = 0;
+  unsigned maxLatency = 0;
+  unsigned WriteResourceID = 0;
+  unsigned DefEnd = SCDesc.NumWriteLatencyEntries;
+
+  for (unsigned DefIdx = 0; DefIdx != DefEnd; ++DefIdx) {
+    // Lookup the definition's write latency in SubtargetInfo.
+    const MCWriteLatencyEntry *WLEntry =
+        STI.getWriteLatencyEntry(&SCDesc, DefIdx);
+    // Early exit if we found an invalid latency.
+    // Consider no bypass
+    if (WLEntry->Cycles < 0)
+      return 0;
+    maxLatency = std::max(Latency, static_cast<unsigned>(WLEntry->Cycles));
+    if (maxLatency > Latency) {
+      WriteResourceID = WLEntry->WriteResourceID;
+    }
+    Latency = maxLatency;
+  }
+
+  for (const MCReadAdvanceEntry &E : Entries) {
+    if (E.WriteResourceID == WriteResourceID) {
+      return E.Cycles;
+    }
+  }
+
+  llvm_unreachable("WriteResourceID not found in MCReadAdvanceEntry entries");
+}
diff --git a/llvm/test/tools/llvm-mca/AArch64/Neoverse/V1-scheduling-info.s b/llvm/test/tools/llvm-mca/AArch64/Neoverse/V1-scheduling-info.s
new file mode 100644
index 000000000000000..c421166f22ea45e
--- /dev/null
+++ b/llvm/test/tools/llvm-mca/AArch64/Neoverse/V1-scheduling-info.s
@@ -0,0 +1,7588 @@
+# NOTE: Assertions have been autogenerated by utils/update_mca_test_checks.py
+# RUN: llvm-mca -mtriple=aarch64 -mcpu=neoverse-v1 -scheduling-info < %s | FileCheck %s
+
+  .text
+  .file	        "V1-scheduling-info.s"
+  .globl	test
+  .p2align	4
+  .type	test,@function
+test:
+  .cfi_startproc
+  abs D15, D11  /* ABS <V><d>, <V><n>  \\ ASIMD arith, basic  \\ 1 2  2  4.0 V1UnitV */
+  abs V25.2S, V25.2S  // ABS <Vd>.<T>, <Vn>.<T>  \\ ASIMD arith, basic  \\ 1 2  2  4.0 V1UnitV
+  abs Z26.B, P6/M, Z27.B  // ABS <Zd>.<T>, <Pg>/M, <Zn>.<T>  \\ Arithmetic, basic  \\ 1 2  2  2.0 V1UnitV01
+  adc W13, W6, W4  // ADC <Wd>, <Wn>, <Wm>  \\ ALU, basic  \\ 1 1  1  4.0 V1UnitI
+  adc X8, X12, X10  // ADC <Xd>, <Xn>, <Xm>  \\ ALU, basic  \\ 1 1  1  4.0 V1UnitI
+  adcs W29, W7, W30  // ADCS <Wd>, <Wn>, <Wm>  \\ ALU, basic, flagset  \\ 1 1  1  3.00 V1UnitI,V1UnitFlg
+  adcs X11, X3, X5  // ADCS <Xd>, <Xn>, <Xm>  \\ ALU, basic, flagset  \\ 1 1  1  3.00 V1UnitI,V1UnitFlg
+  add WSP, WSP, W10  // ADD <Wd|WSP>, <Wn|WSP>, <Wm>  \\ ALU, basic, unconditional, no flagset  \\ 1 2  2  2.00 V1UnitI
+  add WSP, WSP, W2, UXTB   // ADD <Wd|WSP>, <Wn|WSP>, <Wm>, <wextend>   \\ ALU, basic, unconditional, no flagset  \\ 1 2  2  2.00 V1UnitI
+  add WSP, WSP, W13, UXTH #4  // ADD <Wd|WSP>, <Wn|WSP>, <Wm>, <wextend> #<amount>  \\ ALU, basic, unconditional, no flagset  \\ 1 2  2  2.00 V1UnitI
+  add WSP, WSP, W13, LSL #4  // ADD <Wd|WSP>, <Wn|WSP>, <Wm>, LSL #<amount>  \\ Arithmetic, LSL shift, shift <= 4  \\ 1 2  2  2.00 V1UnitI
+  add X22, X2, X27  // ADD <Xd|SP>, <Xn|SP>, X<m>  \\ ALU, basic  \\ 1 1  1  4.0 V1UnitI
+  add X25, X9, W25, UXTB  // ADD <Xd|SP>, <Xn|SP>, <R><m>, <extend>  \\ ALU, basic  \\ 1 2  2  2.00 V1UnitI
+  add X4, X28, W3, UXTB #3  // ADD <Xd|SP>, <Xn|SP>, <R><m>, <extend> #<amount>  \\ ALU, extend and shift  \\ 1 2  2  2.0 V1UnitM
+  add X0, X28, X26, LSL #3  // ADD <Xd|SP>, <Xn|SP>, X<m>, LSL #<amount>  \\ Arithmetic, LSL shift, shift <= 4  \\ 1 1  1  4.0 V1UnitI
+  add WSP, WSP, #3765  // ADD <Wd|WSP>, <Wn|WSP>, #<imm>  \\ ALU, basic  \\ 1 1  1  4.0 V1UnitI
+  add WSP, WSP, #3547, LSL #12  // ADD <Wd|WSP>, <Wn|WSP>, #<imm>, <shift>  \\ ALU, basic  \\ 1 1  1  4.0 V1UnitI
+  add X7, X30, #803  // ADD <Xd|SP>, <Xn|SP>, #<imm>  \\ ALU, basic  \\ 1 1  1  4.0 V1UnitI
+  add X7, X2, #319, LSL #12  // ADD <Xd|SP>, <Xn|SP>, #<imm>, <shift>  \\ ALU, basic  \\ 1 1  1  4.0 V1UnitI
+  add Z13.D, Z13.D, #245  // ADD <Zdn>.<T>, <Zdn>.<T>, #<imm>  \\ Arithmetic, basic  \\ 1 2  2  2.0 V1UnitV01
+  add Z16.D, Z16.D, #233, LSL #8  // ADD <Zdn>.<T>, <Zdn>.<T>, #<imm>, <shift>  \\ Arithmetic, basic  \\ 1 2  2  2.0 V1UnitV01
+  add W3, W2, W21, LSL #3  // ADD <Wd>, <Wn>, <Wm>, LSL #<wamountl>  \\ Arithmetic, LSL shift by immed, shift <= 4, unconditional, no flagset   \\ 1 1  1  4.0 V1UnitI
+  add W6, W21, W17, LSL #15  // ADD <Wd>, <Wn>, <Wm>, LSL #<wamounth>  \\ Arithmetic, LSR/ASR/ROR shift by immed or LSL shift by immed > 4, unconditional  \\ 1 2  2  2.0 V1UnitM
+  add W28, W30, W19, ASR #30  // ADD <Wd>, <Wn>, <Wm>, <shift> #<wamount>  \\ Arithmetic, LSR/ASR/ROR shift by immed or LSL shift by immed > 4, unconditional  \\ 1 2  2  2.0 V1UnitM
+  add X8, X3, X28, LSL #3  // ADD <Xd>, <Xn>, <Xm>, LSL #<amountl>  \\ Arithmetic, LSL shift, shift <= 4  \\ 1 1  1  4.0 V1UnitI
+  add X12, X13, X0, LSL #44  // ADD <Xd>, <Xn>, <Xm>, LSL #<amounth>  \\ Arithmetic, LSR/ASR/ROR shift or LSL shift > 4  \\ 1 2  2  2.0 V1UnitM
+  add X5, X20, X28, LSR #16  // ADD <Xd>, <Xn>, <Xm>, <shift> #<amount>  \\ Arithmetic, LSR/ASR/ROR shift or LSL shift > 4  \\ 1 2  2  2.0 V1UnitM
+  add D0, D23, D21  // ADD <V><d>, <V><n>, <V><m>  \\ ASIMD arith, basic  \\ 1 2  2  4.0 V1UnitV
+  add V19.4S, V24.4S, V15.4S  // ADD <Vd>.<T>, <Vn>.<T>, <Vm>.<T>  \\ ASIMD arith, basic  \\ 1 2  2  4.0 V1UnitV
+  add Z29.D, P5/M, Z29.D, Z29.D  // ADD <Zdn>.<T>, <Pg>/M, <Zdn>.<T>, <Zm>.<T>  \\ Arithmetic, basic  \\ 1 2  2  2.0 V1UnitV01
+  add Z10.H, Z22.H, Z13.H  // ADD <Zd>.<T>, <Zn>.<T>, <Zm>.<T>  \\ Arithmetic, basic  \\ 1 2  2  2.0 V1UnitV01
+  addhn V26.4H, V5.4S, V9.4S  // ADDHN <Vd>.<Tb>, <Vn>.<Ta>, <Vm>.<Ta>  \\ ASIMD arith, complex  \\ 1 2  2  4.0 V1UnitV
+  addhn2 V1.16B, V19.8H, V6.8H  // ADDHN2 <Vd>.<Tb>, <Vn>.<Ta>, <Vm>.<Ta>  \\ ASIMD arith, complex  \\ 1 2  2  4.0 V1UnitV
+  addp D1, V14.2D  // ADDP <V><d>, <Vn>.<T>  \\ ASIMD arith, pair-wise  \\ 1 2  2  4.0 V1UnitV
+  addp V7.2S, V1.2S, V2.2S  // ADDP <Vd>.<T>, <Vn>.<T>, <Vm>.<T>  \\ ASIMD arith, pair-wise  \\ 1 2  2  4.0 V1UnitV
+  addpl X27, X6, #-6  // ADDPL <Xd|SP>, <Xn|SP>, #<imm>  \\ Predicate counting scalar  \\ 1 2  2  1.0 V1UnitM0
+  adds W17, WSP, W25  // ADDS <Wd>, <Wn|WSP>, <Wm>  \\ ALU, basic, unconditional, flagset  \\ 1 2  2  2.00 V1UnitI,V1UnitFlg
+  adds W6, WSP, W15, UXTH   // ADDS <Wd>, <Wn|WSP>, <Wm>, <wextend>   \\ ALU, basic, unconditional, flagset  \\ 1 2  2  2.00 V1UnitI,V1UnitFlg
+  adds W22, WSP, W30, UXTB #2  // ADDS <Wd>, <Wn|WSP>, <Wm>, <wextend> #<amount>  \\ ALU, basic, unconditional, flagset  \\ 1 1  1  3.00 V1UnitI,V1UnitFlg
+  adds W12, WSP, W29, LSL #4  // ADDS <Wd>, <Wn|WSP>, <Wm>, LSL #<amount>  \\ Arithmetic, LSL shift by immed, shift <= 4, unconditional, flagset   \\ 1 2  2  2.00 V1UnitI,V1UnitFlg
+  adds X14, X0, X10  // ADDS <Xd>, <Xn|SP>, X<m>  \\ ALU, basic, flagset  \\ 1 1  1  3.00 V1UnitI,V1UnitFlg
+  adds X13, X23, W8, UXTB  // ADDS <Xd>, <Xn|SP>, <R><m>, <extend>  \\ ALU, basic, flagset  \\ 1 1  1  3.00 V1UnitI,V1UnitFlg
+  adds X4, X26, W28, UXTB #1  // ADDS <Xd>, <Xn|SP>, <R><m>, <extend> #<amount>  \\ ALU, flagset, extend and shift  \\ 1 1  1  3.00 V1UnitFlg, V1UnitI
+  adds X10, X3, X29, LSL #2  // ADDS <Xd>, <Xn|SP>, X<m>, LSL #<amount>  \\ Arithmetic, flagset, LSL shift, shift <= 4  \\ 1 1   1   3.00 V1UnitI,V1UnitFlg
+  adds W23, WSP, #502  // ADDS <Wd>, <Wn|WSP>, #<imm>  \\ ALU, basic, unconditional, flagset  \\ 1 1  1  3.00 V1UnitI,V1UnitFlg
+  adds W2, WSP, #2980, LSL #12  // ADDS <Wd>, <Wn|WSP>, #<imm>, <shift>  \\ Arithmetic, flagset, LSR/ASR/ROR shift by immed or LSL shift by immed > 4, unconditional  \\ 1 1  1  3.00 V1UnitFlg, V1UnitI
+  adds X12, X4, #1345  // ADDS <Xd>, <Xn|SP>, #<imm>  \\ ALU, basic, flagset  \\ 1 1  1  3.00 V1UnitI,V1UnitFlg
+  adds X25, X18, #3037, LSL #12  // ADDS <Xd>, <Xn|SP>, #<imm>, <shift>  \\ Arithmetic, flagset, LSR/ASR/ROR shift or LSL shift > 4  \\ 1 1  1  3.00 V1UnitFlg, V1UnitI
+  adds W12, W13, W26  // ADDS <Wd>, <Wn>, <Wm>  \\ ALU, basic, unconditional, flagset  \\ 1 1  1  3.00 V1UnitI,V1UnitFlg
+  adds W0, W23, W20, LSL #0  // ADDS <Wd>, <Wn>, <Wm>, LSL #<wamountl>  \\ Arithmetic, LSL shift by immed, shift <= 4, unconditional, flagset   \\ 1 1  1  3.00 V1UnitI,V1UnitFlg
+  adds W13, W16, W12, LSL #28  // ADDS <Wd>, <Wn>, <Wm>, LSL #<wamounth>  \\ Arithmetic, flagset, LSR/ASR/ROR shift by immed or LSL shift by immed > 4, unconditional  \\ 1 2  2  2.00 V1UnitM,V1UnitFlg
+  adds W20, W19, W16, ASR #0  // ADDS <Wd>, <Wn>, <Wm>, <shift> #<wamount>  \\ Arithmetic, flagset, LSR/ASR/ROR shift by immed or LSL shift by immed > 4, unconditional  \\ 1 2  2  2.00 V1UnitM,V1UnitFlg
+  adds X23, X12, X4  // ADDS <Xd>, <Xn>, <Xm>  \\ ALU, basic, flagset  \\ 1 1  1  3.00 V1UnitI,V1UnitFlg
+  adds X0, X13, X4, LSL #2  // ADDS <Xd>, <Xn>, <Xm>, LSL #<amountl>  \\ Arithmetic, flagset, LSL shift, shift <= 4  \\ 1 1   1   3.00 V1UnitI,V1UnitFlg
+  adds X4, X7, X6, LSL #31  // ADDS <Xd>, <Xn>, <Xm>, LSL #<amounth>  \\ Arithmetic, flagset, LSR/ASR/ROR shift or LSL shift > 4  \\ 1 2  2  2.00 V1UnitM,V1UnitFlg
+  adds X9, X8, X9, ASR #41  // ADDS <Xd>, <Xn>, <Xm>, <shift> #<amount>  \\ Arithmetic, flagset, LSR/ASR/ROR shift or LSL shift > 4  \\ 1 2  2  2.00 V1UnitM,V1UnitFlg
+  addv B0, V28.8B  // ADDV B<d>, <Vn>.8B  \\ ASIMD arith, reduce, 8B/8H  \\ 2 4  4  2.00 V1UnitV13
+  addv B1, V26.16B  // ADDV B<d>, <Vn>.16B  \\ ASIMD arith, reduce, 16B  \\ 2 4  4  1.00 V1UnitV13[2]
+  addv H18, V13.4H  // ADDV H<d>, <Vn>.4H  \\ ASIMD arith, reduce, 4H/4S  \\ 1 2  2  2.0 V1UnitV13
+  addv H29, V17.8H  // ADDV H<d>, <Vn>.8H  \\ ASIMD arith, reduce, 8B/8H  \\ 2 4  4  2.00 V1UnitV13
+  addv S22, V18.4S  // ADDV S<d>, <Vn>.4S  \\ ASIMD arith, reduce, 4H/4S  \\ 1 2  2  2.0 V1UnitV13
+  addvl X1, X27, #-8  // ADDVL <Xd|SP>, <Xn|SP>, #<imm>  \\ Predicate counting scalar  \\ 1 2  2  1.0 V1UnitM0
+  adr X3, test  // ADR <Xd>, <label>  \\ Address generation  \\ 1 1  1  4.0 V1UnitI
+  adr Z26.D, [Z1.D, Z8.D]  // ADR <Zd>.<T>, [<Zn>.<T>, <Zm>.<T>]  \\ Arithmetic, basic  \\ 1 2  2  2.0 V1UnitV01
+  adr Z22.S, [Z28.S, Z8.S, LSL #2]  // ADR <Zd>.<T>, [<Zn>.<T>, <Zm>.<T>, <mod> #<amount>]  \\ Arithmetic, basic  \\ 1 2  2  2.0 V1UnitV01
+  adr Z11.D, [Z2.D, Z29.D, SXTW ]  // ADR <Zd>.D, [<Zn>.D, <Zm>.D, SXTW ]  \\ Arithmetic, basic  \\ 1 2  2  2.0 V1UnitV01
+  adr Z3.D, [Z9.D, Z9.D, SXTW #2]  // ADR <Zd>.D, [<Zn>.D, <Zm>.D, SXTW #<amount>]  \\ Arithmetic, basic  \\ 1 2  2  2.0 V1UnitV01
+  adr Z6.D, [Z7.D, Z13.D, UXTW ]  // ADR <Zd>.D, [<Zn>.D, <Zm>.D, UXTW ]  \\ Arithmetic, basic  \\ 1 2  2  2.0 V1UnitV01
+  adr Z4.D, [Z24.D, Z22.D, UXTW #1]  // ADR <Zd>.D, [<Zn>.D, <Zm>.D, UXTW #<amount>]  \\ Arithmetic, basic  \\ 1 2  2  2.0 V1UnitV01
+  adrp X0, test  // ADRP <Xd>, <label>  \\ Address generation  \\ 1 1  1  4.0 V1UnitI
+  and WSP, W16, #0xe00  // AND <Wd|WSP>, <Wn>, #<imms>  \\ ALU, basic  \\ 1 1  1  4.0 V1UnitI
+  and X2, X22, #0x1e00  // AND <Xd|SP>, <Xn>, #<imm>  \\ ALU, basic  \\ 1 1  1  4.0 V1UnitI
+  and Z1.B, Z1.B, #0x70  // AND <Zdn>.B, <Zdn>.B, #<constb>  \\ Logical  \\ 1 2  2  2.0 V1UnitV01
+  and Z7.H, Z7.H, #0x60  // AND <Zdn>.H, <Zdn>.H, #<consth>  \\ Logical  \\ 1 2  2  2.0 V1UnitV01
+  and Z7.S, Z7.S, #0x2  // AND <Zdn>.S, <Zdn>.S, #<consts>  \\ Logical  \\ 1 2  2  2.0 V1UnitV01
+  and Z7.D, Z7.D, #0x4  // AND <Zdn>.D, <Zdn>.D, #<constd>  \\ Logical  \\ 1 2  2  2.0 V1UnitV01
+  and P5.B, P1/Z, P6.B, P4.B  // AND <Pd>.B, <Pg>/Z, <Pn>.B, <Pm>.B  \\ Predicate logical  \\ 1 1  1  1.0 V1UnitM0
+  and W11, W14, W24  // AND <Wd>, <Wn>, <Wm>  \\ Logical, shift, no flagset  \\ 1 1  1  4.0 V1UnitI
+  and W2, W21, W22, LSR #25  // AND <Wd>, <Wn>, <Wm>, <shift> #<wamount>  \\ Logical, shift, no flagset  \\ 1 1  1  4.0 V1UnitI
+  and X1, X20, X29  // AND <Xd>, <Xn>, <Xm>  \\ Logical, shift, no flagset  \\ 1 1  1  4.0 V1UnitI
+  and X8, X11, X22, ASR #56  // AND <Xd>, <Xn>, <Xm>, <shift> #<amount>  \\ Logical, shift, no flagset  \\ 1 1  1  4.0 V1UnitI
+  and V29.8B, V26.8B, V26.8B  // AND <Vd>.<T>, <Vn>.<T>, <Vm>.<T>  \\ ASIMD logical  \\ 1 2  2  4.0 V1UnitV
+  and Z17.D, P6/M, Z17.D, Z12.D  // AND <Zdn>.<T>, <Pg>/M, <Zdn>.<T>, <Zm>.<T>  \\ Logical  \\ 1 2  2  2.0 V1UnitV01
+  and Z9.D, Z5.D, Z17.D  // AND <Zd>.D, <Zn>.D, <Zm>.D  \\ Logical  \\ 1 2  2  2.0 V1UnitV01
+  ands W14, W8, #0x70  // ANDS <Wd>, <Wn>, #<imms>  \\ ALU, basic, unconditional, flagset  \\ 1 1  1  3.00 V1UnitI,V1UnitFlg
+  ands X4, X10, #0x60  // ANDS <Xd>, <Xn>, #<immd>  \\ ALU, basic, flagset  \\ 1 1  1  3.00 V1UnitI,V1UnitFlg
+  ands W29, W28, W12  // ANDS <Wd>, <Wn>, <Wm>  \\ ALU, basic, unconditional, flagset  \\ 1 2  2  2.00 V1UnitI,V1UnitFlg
+  ands W7, W13, W23, ASR #3  // ANDS <Wd>, <Wn>, <Wm>, <shift> #<wamount>  \\ Logical, shift by immed, flagset, unconditional  \\ 1 2  2  2.00 V1UnitM,V1UnitFlg
+  ands X21, X9, X6  // ANDS <Xd>, <Xn>, <Xm>  \\ ALU, basic, flagset  \\ 1 2  2  2.00 V1UnitI,V1UnitFlg
+  ands X10, X27, X7, ASR #20  // ANDS <Xd>, <Xn>, <Xm>, <shift> #<amount>  \\ Logical, shift, flagset  \\ 1 2  2  2.00 V1UnitM,V1UnitFlg
+  ands P5.B, P1/Z, P2.B, P7.B  // ANDS <Pd>.B, <Pg>/Z, <Pn>.B, <Pm>.B  \\ Predicate logical, flag setting  \\ 2 2  2  0.50 V1UnitM0[2]
+  andv H7, P6, Z31.H  // ANDV <V><d>, <Pg>, <Zn>.<T>  \\ Reduction, logical   \\ 4 12  12  0.50 V1UnitV01[4]
+  asr W30, W14, #5  // ASR <Wd>, <Wn>, #<shifts>  \\ Move, shift by immed, no flagset  \\ 1 1  1  4.0 V1UnitI
+  asr X12, X21, #28  // ASR <Xd>, <Xn>, #<shiftd>  \\ Move, shift by immed, no flagset  \\ 1 1  1  4.0 V1UnitI
+  asr Z7.B, P5/M, Z7.B, #3  // ASR <Zdn>.B, <Pg>/M, <Zdn>.B, #<constb>  \\ Arithmetic, shift  \\ 1 2  2  1.0 V1UnitV1
+  asr Z6.H, P6/M, Z6.H, #5  // ASR <Zdn>.H, <Pg>/M, <Zdn>.H, #<consth>  \\ Arithmetic, shift  \\ 1 2  2  1.0 V1UnitV1
+  asr Z28.S, P0/M, Z28.S, #11  // ASR <Zdn>.S, <Pg>/M, <Zdn>.S, #<consts>  \\ Arithmetic, shift  \\ 1 2  2  1.0 V1UnitV1
+  asr Z26.D, P5/M, Z26.D, #24  // ASR <Zdn>.D, <Pg>/M, <Zdn>.D, #<constd>  \\ Arithmetic, shift  \\ 1 2  2  1.0 V1UnitV1
+  asr Z10.B, Z14.B, #3  // ASR <Zd>.B, <Zn>.B, #<constb>  \\ Arithmetic, shift  \\ 1 2  2  1.0 V1UnitV1
+  asr Z23.H, Z18.H, #6  // ASR <Zd>.H, <Zn>.H, #<consth>  \\ Arithmetic, shift  \\ 1 2  2  1.0 V1UnitV1
+  asr Z29.S, Z11.S, #6  // ASR <Zd>.S, <Zn>.S, #<consts>  \\ Arithmetic, shift  \\ 1 2  2  1.0 V1UnitV1
+  asr Z20.D, Z26.D, #29  // ASR <Zd>.D, <Zn>.D, #<constd>  \\ Arithmetic, shift  \\ 1 2  2  1.0 V1UnitV1
+  asr W3, W0, W20  // ASR <Wd>, <Wn>, <Wm>  \\ Move, shift by register, no flagset, unconditional  \\ 1 1  1  4.0 V1UnitI
+  asr X7, X5, X21  // ASR <Xd>, <Xn>, <Xm>  \\ Move, shift by register, no flagset, unconditional  \\ 1 1  1  4.0 V1UnitI
+  asr Z3.S, P0/M, Z3.S, Z10.S  // ASR <Zdn>.<T>, <Pg>/M, <Zdn>.<T>, <Zm>.<T>  \\ Arithmetic, shift  \\ 1 2  2  1.0 V1UnitV1
+  asr Z9.S, P2/M, Z9.S, Z8.D  // ASR <Zdn>.<T>, <Pg>/M, <Zdn>.<T>, <Zm>.D  \\ Arithmetic, shift  \\ 1 2  2  1.0 V1UnitV1
+  asr Z26.S, Z21.S, Z21.D  // ASR <Zd>.<T>, <Zn>.<T>, <Zm>.D  \\ Arithmetic, shift  \\ 1 2  2  1.0 V1UnitV1
+  asrd Z6.B, P4/M, Z6.B, #2  // ASRD <Zdn>.B, <Pg>/M, <Zdn>.B, #<constb>  \\ Arithmetic, shift right for divide  \\ 1 4  4  1.0 V1UnitV1
+  asrd Z19.H, P3/M, Z19.H, #6  // ASRD <Zdn>.H, <Pg>/M, <Zdn>.H, #<consth>  \\ Arithmetic, shift right for divide  \\ 1 4  4  1.0 V1UnitV1
+  asrd Z16.S, P3/M, Z16.S, #2  // ASRD <Zdn>.S, <Pg>/M, <Zdn>.S, #<consts>  \\ Arithmetic, shift right for divide  \\ 1 4  4  1.0 V1UnitV1
+  asrd Z9.D, P6/M, Z9.D, #12  // ASRD <Zdn>.D, <Pg>/M, <Zdn>.D, #<constd>  \\ Arithmetic, shift right for divide  \\ 1 4  4  1.0 V1UnitV1
+  asrr Z0.B, P0/M, Z0.B, Z19.B  // ASRR <Zdn>.<T>, <Pg>/M, <Zdn>.<T>, <Zm>.<T>  \\ Arithmetic, shift  \\ 1 2  2  1.0 V1UnitV1
+  asrv W24, W28, W13  // ASRV <Wd>, <Wn>, <Wm>  \\ Variable shift  \\ 1 1  1  4.0 V1UnitI
+  asrv X3, X21, X24  // ASRV <Xd>, <Xn>, <Xm>  \\ Variable shift  \\ 1 1  1  4.0 V1UnitI
+  at s12e1r, X28  // AT <at_op>, <Xt>  \\ No description \\ No scheduling info
+  b test  // B <label>  \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.eq test // B.eq <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.none test // B.none <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.ne test // B.ne <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.any test // B.any <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.cs test // B.cs <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.hs test // B.hs <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.nlast test // B.nlast <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.cc test // B.cc <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.lo test // B.lo <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.last test // B.last <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.mi test // B.mi <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.first test // B.first <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.pl test // B.pl <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.nfrst test // B.nfrst <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.vs test // B.vs <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.vc test // B.vc <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.hi test // B.hi <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.pmore test // B.pmore <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.ls test // B.ls <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.plast test // B.plast <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.ge test // B.ge <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.tcont test // B.tcont <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.lt test // B.lt <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.tstop test // B.tstop <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.gt test // B.gt <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.le test // B.le <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.al test // B.al <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.nv test // B.nv <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  bfcvt H6, S20  // BFCVT <Hd>, <Sn>  \\ Scalar convert, F32 to BF16  \\ 1 3  3  2.0 V1UnitV02
+  bfcvt Z16.H, P6/M,...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Feb 11, 2025

@llvm/pr-subscribers-llvm-binary-utilities

Author: Julien Villette (jvillette38)

Changes

This is a new way to update scheduling information in llvm. I have used this to update scheduling information for AArch64 Neoverse V1 micro architecture (new patches will follow and will be dependent to this pull request).

This pull request contains 2 commits:
A) llvm-mca -scheduling-info option
B) update_mca_test_checks.py new options: --check-sched-info and --update-sched-info.

A) llvm-mca -scheduling-info disables default llvm-mca reporting (InstructionInfoView) and output information in the following format:
&lt;uOps&gt; | &lt;Latency&gt; | &lt;Bypass Latency&gt; | &lt;Throughput&gt; | &lt;Resources&gt; | &lt;LLVM Opcode&gt; | &lt;Assembly input: instruction + comment&gt;
Example from new llvm-mca test AArch64/Neoverse/V1-scheduling-info.s:
Input:
abs v25.2s, v25.2s // ABS &lt;Vd&gt;.&lt;T&gt;, &lt;Vn&gt;.&lt;T&gt; \\ ASIMD arith, basic \\ 1 2 2 4.0 V1UnitV
Output:
1 | 2 | 2 | 4.00 | V1UnitSVE01, V1UnitV | ABSv2i32 | abs v25.2s, v25.2s // ABS &lt;Vd&gt;.&lt;T&gt;, &lt;Vn&gt;.&lt;T&gt; \\ ASIMD arith, basic \\ 1 2 2 4.0 V1UnitV

So if we are able to extract scheduling information from micro architecture document for each instruction variant, it is possible to write test in this form and check llvm-mca -scheduling-info output for the differences between llvm information compared to the one in comments. If you get differences, check the documentation to update comment or fix llvm to update llvm-mca output.
LLVM Opcode is given to make easier the changes in target description.

B) update_mca_test_checks.py --check-sched-info is used to check informations between llvm-mca output and information in comments. If found differences, it will exit with error code and report them. Developer can fix comments or llvm target description or use update_mca_test_checks.py --update-sched-info to update automatically comments and then check differences with git.

Convention for comments used by new update_mca_test_checks.py options:

  • C or C++ style comment: '/* */' and '//'
  • Fields:
    &lt;asm instruction&gt; &lt;// or /*&gt; &lt;instruction format&gt; \\ &lt;micro architecture reference&gt; \\ &lt;uOps&gt; &lt;Latency&gt; &lt;Bypass latency&gt; &lt;Throughput&gt; &lt;Resources seperated with commas&gt;

@mshockwave and @Rin18 may be interested.


Patch is 1.49 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/126703.diff

10 Files Affected:

  • (modified) llvm/docs/CommandGuide/llvm-mca.rst (+14)
  • (modified) llvm/include/llvm/MC/MCSchedule.h (+4)
  • (modified) llvm/lib/MC/MCSchedule.cpp (+37)
  • (added) llvm/test/tools/llvm-mca/AArch64/Neoverse/V1-scheduling-info.s (+7588)
  • (modified) llvm/tools/llvm-mca/CMakeLists.txt (+1)
  • (modified) llvm/tools/llvm-mca/Views/InstructionInfoView.h (+1)
  • (added) llvm/tools/llvm-mca/Views/SchedulingInfoView.cpp (+210)
  • (added) llvm/tools/llvm-mca/Views/SchedulingInfoView.h (+96)
  • (modified) llvm/tools/llvm-mca/llvm-mca.cpp (+31-11)
  • (modified) llvm/utils/update_mca_test_checks.py (+168)
diff --git a/llvm/docs/CommandGuide/llvm-mca.rst b/llvm/docs/CommandGuide/llvm-mca.rst
index f610ea2f2168269..1c5275ce000b111 100644
--- a/llvm/docs/CommandGuide/llvm-mca.rst
+++ b/llvm/docs/CommandGuide/llvm-mca.rst
@@ -170,6 +170,20 @@ option specifies "``-``", then the output will also be sent to standard output.
   Enable extra scheduler statistics. This view collects and analyzes instruction
   issue events. This view is disabled by default.
 
+.. option:: -scheduling-info
+
+  Enable scheduling info view. This view reports scheduling information defined
+  in LLVM target description in the form:
+  uOps | Latency | Bypass Latency | Throughput | LLVM OpcodeName | Resources
+  units | assembly instruction and its comment (// or /* */) if defined.
+  It allows to compare scheduling info with architecture documents and fix them
+  in target description by fixing InstrRW for the reported LLVM opcode.
+  Scheduling information can be defined in the same order in each instruction
+  comments to check easily reported and reference scheduling information.
+  Suggested information in comment:
+  // <architecture instruction form> \\ <scheduling documentation title> \\
+     <uOps>, <Latency>, <Bypass Latency>, <Throughput>, <Resources units>
+
 .. option:: -retire-stats
 
   Enable extra retire control unit statistics. This view is disabled by default.
diff --git a/llvm/include/llvm/MC/MCSchedule.h b/llvm/include/llvm/MC/MCSchedule.h
index fe731d086f70ae3..dcbc5369120a39b 100644
--- a/llvm/include/llvm/MC/MCSchedule.h
+++ b/llvm/include/llvm/MC/MCSchedule.h
@@ -402,6 +402,10 @@ struct MCSchedModel {
   static unsigned getForwardingDelayCycles(ArrayRef<MCReadAdvanceEntry> Entries,
                                            unsigned WriteResourceIdx = 0);
 
+  /// Returns the maximum forwarding delay for maximum write latency.
+  static unsigned getForwardingDelayCycles(const MCSubtargetInfo &STI,
+                                       const MCSchedClassDesc &SCDesc);
+
   /// Returns the default initialized model.
   static const MCSchedModel Default;
 };
diff --git a/llvm/lib/MC/MCSchedule.cpp b/llvm/lib/MC/MCSchedule.cpp
index ed243cecabb7638..4ef6acf78714fa7 100644
--- a/llvm/lib/MC/MCSchedule.cpp
+++ b/llvm/lib/MC/MCSchedule.cpp
@@ -174,3 +174,40 @@ MCSchedModel::getForwardingDelayCycles(ArrayRef<MCReadAdvanceEntry> Entries,
 
   return std::abs(DelayCycles);
 }
+
+unsigned
+MCSchedModel::getForwardingDelayCycles(const MCSubtargetInfo &STI,
+                                            const MCSchedClassDesc &SCDesc) {
+
+  ArrayRef<MCReadAdvanceEntry> Entries = STI.getReadAdvanceEntries(SCDesc);
+  if (Entries.empty())
+    return 0;
+
+  unsigned Latency = 0;
+  unsigned maxLatency = 0;
+  unsigned WriteResourceID = 0;
+  unsigned DefEnd = SCDesc.NumWriteLatencyEntries;
+
+  for (unsigned DefIdx = 0; DefIdx != DefEnd; ++DefIdx) {
+    // Lookup the definition's write latency in SubtargetInfo.
+    const MCWriteLatencyEntry *WLEntry =
+        STI.getWriteLatencyEntry(&SCDesc, DefIdx);
+    // Early exit if we found an invalid latency.
+    // Consider no bypass
+    if (WLEntry->Cycles < 0)
+      return 0;
+    maxLatency = std::max(Latency, static_cast<unsigned>(WLEntry->Cycles));
+    if (maxLatency > Latency) {
+      WriteResourceID = WLEntry->WriteResourceID;
+    }
+    Latency = maxLatency;
+  }
+
+  for (const MCReadAdvanceEntry &E : Entries) {
+    if (E.WriteResourceID == WriteResourceID) {
+      return E.Cycles;
+    }
+  }
+
+  llvm_unreachable("WriteResourceID not found in MCReadAdvanceEntry entries");
+}
diff --git a/llvm/test/tools/llvm-mca/AArch64/Neoverse/V1-scheduling-info.s b/llvm/test/tools/llvm-mca/AArch64/Neoverse/V1-scheduling-info.s
new file mode 100644
index 000000000000000..c421166f22ea45e
--- /dev/null
+++ b/llvm/test/tools/llvm-mca/AArch64/Neoverse/V1-scheduling-info.s
@@ -0,0 +1,7588 @@
+# NOTE: Assertions have been autogenerated by utils/update_mca_test_checks.py
+# RUN: llvm-mca -mtriple=aarch64 -mcpu=neoverse-v1 -scheduling-info < %s | FileCheck %s
+
+  .text
+  .file	        "V1-scheduling-info.s"
+  .globl	test
+  .p2align	4
+  .type	test,@function
+test:
+  .cfi_startproc
+  abs D15, D11  /* ABS <V><d>, <V><n>  \\ ASIMD arith, basic  \\ 1 2  2  4.0 V1UnitV */
+  abs V25.2S, V25.2S  // ABS <Vd>.<T>, <Vn>.<T>  \\ ASIMD arith, basic  \\ 1 2  2  4.0 V1UnitV
+  abs Z26.B, P6/M, Z27.B  // ABS <Zd>.<T>, <Pg>/M, <Zn>.<T>  \\ Arithmetic, basic  \\ 1 2  2  2.0 V1UnitV01
+  adc W13, W6, W4  // ADC <Wd>, <Wn>, <Wm>  \\ ALU, basic  \\ 1 1  1  4.0 V1UnitI
+  adc X8, X12, X10  // ADC <Xd>, <Xn>, <Xm>  \\ ALU, basic  \\ 1 1  1  4.0 V1UnitI
+  adcs W29, W7, W30  // ADCS <Wd>, <Wn>, <Wm>  \\ ALU, basic, flagset  \\ 1 1  1  3.00 V1UnitI,V1UnitFlg
+  adcs X11, X3, X5  // ADCS <Xd>, <Xn>, <Xm>  \\ ALU, basic, flagset  \\ 1 1  1  3.00 V1UnitI,V1UnitFlg
+  add WSP, WSP, W10  // ADD <Wd|WSP>, <Wn|WSP>, <Wm>  \\ ALU, basic, unconditional, no flagset  \\ 1 2  2  2.00 V1UnitI
+  add WSP, WSP, W2, UXTB   // ADD <Wd|WSP>, <Wn|WSP>, <Wm>, <wextend>   \\ ALU, basic, unconditional, no flagset  \\ 1 2  2  2.00 V1UnitI
+  add WSP, WSP, W13, UXTH #4  // ADD <Wd|WSP>, <Wn|WSP>, <Wm>, <wextend> #<amount>  \\ ALU, basic, unconditional, no flagset  \\ 1 2  2  2.00 V1UnitI
+  add WSP, WSP, W13, LSL #4  // ADD <Wd|WSP>, <Wn|WSP>, <Wm>, LSL #<amount>  \\ Arithmetic, LSL shift, shift <= 4  \\ 1 2  2  2.00 V1UnitI
+  add X22, X2, X27  // ADD <Xd|SP>, <Xn|SP>, X<m>  \\ ALU, basic  \\ 1 1  1  4.0 V1UnitI
+  add X25, X9, W25, UXTB  // ADD <Xd|SP>, <Xn|SP>, <R><m>, <extend>  \\ ALU, basic  \\ 1 2  2  2.00 V1UnitI
+  add X4, X28, W3, UXTB #3  // ADD <Xd|SP>, <Xn|SP>, <R><m>, <extend> #<amount>  \\ ALU, extend and shift  \\ 1 2  2  2.0 V1UnitM
+  add X0, X28, X26, LSL #3  // ADD <Xd|SP>, <Xn|SP>, X<m>, LSL #<amount>  \\ Arithmetic, LSL shift, shift <= 4  \\ 1 1  1  4.0 V1UnitI
+  add WSP, WSP, #3765  // ADD <Wd|WSP>, <Wn|WSP>, #<imm>  \\ ALU, basic  \\ 1 1  1  4.0 V1UnitI
+  add WSP, WSP, #3547, LSL #12  // ADD <Wd|WSP>, <Wn|WSP>, #<imm>, <shift>  \\ ALU, basic  \\ 1 1  1  4.0 V1UnitI
+  add X7, X30, #803  // ADD <Xd|SP>, <Xn|SP>, #<imm>  \\ ALU, basic  \\ 1 1  1  4.0 V1UnitI
+  add X7, X2, #319, LSL #12  // ADD <Xd|SP>, <Xn|SP>, #<imm>, <shift>  \\ ALU, basic  \\ 1 1  1  4.0 V1UnitI
+  add Z13.D, Z13.D, #245  // ADD <Zdn>.<T>, <Zdn>.<T>, #<imm>  \\ Arithmetic, basic  \\ 1 2  2  2.0 V1UnitV01
+  add Z16.D, Z16.D, #233, LSL #8  // ADD <Zdn>.<T>, <Zdn>.<T>, #<imm>, <shift>  \\ Arithmetic, basic  \\ 1 2  2  2.0 V1UnitV01
+  add W3, W2, W21, LSL #3  // ADD <Wd>, <Wn>, <Wm>, LSL #<wamountl>  \\ Arithmetic, LSL shift by immed, shift <= 4, unconditional, no flagset   \\ 1 1  1  4.0 V1UnitI
+  add W6, W21, W17, LSL #15  // ADD <Wd>, <Wn>, <Wm>, LSL #<wamounth>  \\ Arithmetic, LSR/ASR/ROR shift by immed or LSL shift by immed > 4, unconditional  \\ 1 2  2  2.0 V1UnitM
+  add W28, W30, W19, ASR #30  // ADD <Wd>, <Wn>, <Wm>, <shift> #<wamount>  \\ Arithmetic, LSR/ASR/ROR shift by immed or LSL shift by immed > 4, unconditional  \\ 1 2  2  2.0 V1UnitM
+  add X8, X3, X28, LSL #3  // ADD <Xd>, <Xn>, <Xm>, LSL #<amountl>  \\ Arithmetic, LSL shift, shift <= 4  \\ 1 1  1  4.0 V1UnitI
+  add X12, X13, X0, LSL #44  // ADD <Xd>, <Xn>, <Xm>, LSL #<amounth>  \\ Arithmetic, LSR/ASR/ROR shift or LSL shift > 4  \\ 1 2  2  2.0 V1UnitM
+  add X5, X20, X28, LSR #16  // ADD <Xd>, <Xn>, <Xm>, <shift> #<amount>  \\ Arithmetic, LSR/ASR/ROR shift or LSL shift > 4  \\ 1 2  2  2.0 V1UnitM
+  add D0, D23, D21  // ADD <V><d>, <V><n>, <V><m>  \\ ASIMD arith, basic  \\ 1 2  2  4.0 V1UnitV
+  add V19.4S, V24.4S, V15.4S  // ADD <Vd>.<T>, <Vn>.<T>, <Vm>.<T>  \\ ASIMD arith, basic  \\ 1 2  2  4.0 V1UnitV
+  add Z29.D, P5/M, Z29.D, Z29.D  // ADD <Zdn>.<T>, <Pg>/M, <Zdn>.<T>, <Zm>.<T>  \\ Arithmetic, basic  \\ 1 2  2  2.0 V1UnitV01
+  add Z10.H, Z22.H, Z13.H  // ADD <Zd>.<T>, <Zn>.<T>, <Zm>.<T>  \\ Arithmetic, basic  \\ 1 2  2  2.0 V1UnitV01
+  addhn V26.4H, V5.4S, V9.4S  // ADDHN <Vd>.<Tb>, <Vn>.<Ta>, <Vm>.<Ta>  \\ ASIMD arith, complex  \\ 1 2  2  4.0 V1UnitV
+  addhn2 V1.16B, V19.8H, V6.8H  // ADDHN2 <Vd>.<Tb>, <Vn>.<Ta>, <Vm>.<Ta>  \\ ASIMD arith, complex  \\ 1 2  2  4.0 V1UnitV
+  addp D1, V14.2D  // ADDP <V><d>, <Vn>.<T>  \\ ASIMD arith, pair-wise  \\ 1 2  2  4.0 V1UnitV
+  addp V7.2S, V1.2S, V2.2S  // ADDP <Vd>.<T>, <Vn>.<T>, <Vm>.<T>  \\ ASIMD arith, pair-wise  \\ 1 2  2  4.0 V1UnitV
+  addpl X27, X6, #-6  // ADDPL <Xd|SP>, <Xn|SP>, #<imm>  \\ Predicate counting scalar  \\ 1 2  2  1.0 V1UnitM0
+  adds W17, WSP, W25  // ADDS <Wd>, <Wn|WSP>, <Wm>  \\ ALU, basic, unconditional, flagset  \\ 1 2  2  2.00 V1UnitI,V1UnitFlg
+  adds W6, WSP, W15, UXTH   // ADDS <Wd>, <Wn|WSP>, <Wm>, <wextend>   \\ ALU, basic, unconditional, flagset  \\ 1 2  2  2.00 V1UnitI,V1UnitFlg
+  adds W22, WSP, W30, UXTB #2  // ADDS <Wd>, <Wn|WSP>, <Wm>, <wextend> #<amount>  \\ ALU, basic, unconditional, flagset  \\ 1 1  1  3.00 V1UnitI,V1UnitFlg
+  adds W12, WSP, W29, LSL #4  // ADDS <Wd>, <Wn|WSP>, <Wm>, LSL #<amount>  \\ Arithmetic, LSL shift by immed, shift <= 4, unconditional, flagset   \\ 1 2  2  2.00 V1UnitI,V1UnitFlg
+  adds X14, X0, X10  // ADDS <Xd>, <Xn|SP>, X<m>  \\ ALU, basic, flagset  \\ 1 1  1  3.00 V1UnitI,V1UnitFlg
+  adds X13, X23, W8, UXTB  // ADDS <Xd>, <Xn|SP>, <R><m>, <extend>  \\ ALU, basic, flagset  \\ 1 1  1  3.00 V1UnitI,V1UnitFlg
+  adds X4, X26, W28, UXTB #1  // ADDS <Xd>, <Xn|SP>, <R><m>, <extend> #<amount>  \\ ALU, flagset, extend and shift  \\ 1 1  1  3.00 V1UnitFlg, V1UnitI
+  adds X10, X3, X29, LSL #2  // ADDS <Xd>, <Xn|SP>, X<m>, LSL #<amount>  \\ Arithmetic, flagset, LSL shift, shift <= 4  \\ 1 1   1   3.00 V1UnitI,V1UnitFlg
+  adds W23, WSP, #502  // ADDS <Wd>, <Wn|WSP>, #<imm>  \\ ALU, basic, unconditional, flagset  \\ 1 1  1  3.00 V1UnitI,V1UnitFlg
+  adds W2, WSP, #2980, LSL #12  // ADDS <Wd>, <Wn|WSP>, #<imm>, <shift>  \\ Arithmetic, flagset, LSR/ASR/ROR shift by immed or LSL shift by immed > 4, unconditional  \\ 1 1  1  3.00 V1UnitFlg, V1UnitI
+  adds X12, X4, #1345  // ADDS <Xd>, <Xn|SP>, #<imm>  \\ ALU, basic, flagset  \\ 1 1  1  3.00 V1UnitI,V1UnitFlg
+  adds X25, X18, #3037, LSL #12  // ADDS <Xd>, <Xn|SP>, #<imm>, <shift>  \\ Arithmetic, flagset, LSR/ASR/ROR shift or LSL shift > 4  \\ 1 1  1  3.00 V1UnitFlg, V1UnitI
+  adds W12, W13, W26  // ADDS <Wd>, <Wn>, <Wm>  \\ ALU, basic, unconditional, flagset  \\ 1 1  1  3.00 V1UnitI,V1UnitFlg
+  adds W0, W23, W20, LSL #0  // ADDS <Wd>, <Wn>, <Wm>, LSL #<wamountl>  \\ Arithmetic, LSL shift by immed, shift <= 4, unconditional, flagset   \\ 1 1  1  3.00 V1UnitI,V1UnitFlg
+  adds W13, W16, W12, LSL #28  // ADDS <Wd>, <Wn>, <Wm>, LSL #<wamounth>  \\ Arithmetic, flagset, LSR/ASR/ROR shift by immed or LSL shift by immed > 4, unconditional  \\ 1 2  2  2.00 V1UnitM,V1UnitFlg
+  adds W20, W19, W16, ASR #0  // ADDS <Wd>, <Wn>, <Wm>, <shift> #<wamount>  \\ Arithmetic, flagset, LSR/ASR/ROR shift by immed or LSL shift by immed > 4, unconditional  \\ 1 2  2  2.00 V1UnitM,V1UnitFlg
+  adds X23, X12, X4  // ADDS <Xd>, <Xn>, <Xm>  \\ ALU, basic, flagset  \\ 1 1  1  3.00 V1UnitI,V1UnitFlg
+  adds X0, X13, X4, LSL #2  // ADDS <Xd>, <Xn>, <Xm>, LSL #<amountl>  \\ Arithmetic, flagset, LSL shift, shift <= 4  \\ 1 1   1   3.00 V1UnitI,V1UnitFlg
+  adds X4, X7, X6, LSL #31  // ADDS <Xd>, <Xn>, <Xm>, LSL #<amounth>  \\ Arithmetic, flagset, LSR/ASR/ROR shift or LSL shift > 4  \\ 1 2  2  2.00 V1UnitM,V1UnitFlg
+  adds X9, X8, X9, ASR #41  // ADDS <Xd>, <Xn>, <Xm>, <shift> #<amount>  \\ Arithmetic, flagset, LSR/ASR/ROR shift or LSL shift > 4  \\ 1 2  2  2.00 V1UnitM,V1UnitFlg
+  addv B0, V28.8B  // ADDV B<d>, <Vn>.8B  \\ ASIMD arith, reduce, 8B/8H  \\ 2 4  4  2.00 V1UnitV13
+  addv B1, V26.16B  // ADDV B<d>, <Vn>.16B  \\ ASIMD arith, reduce, 16B  \\ 2 4  4  1.00 V1UnitV13[2]
+  addv H18, V13.4H  // ADDV H<d>, <Vn>.4H  \\ ASIMD arith, reduce, 4H/4S  \\ 1 2  2  2.0 V1UnitV13
+  addv H29, V17.8H  // ADDV H<d>, <Vn>.8H  \\ ASIMD arith, reduce, 8B/8H  \\ 2 4  4  2.00 V1UnitV13
+  addv S22, V18.4S  // ADDV S<d>, <Vn>.4S  \\ ASIMD arith, reduce, 4H/4S  \\ 1 2  2  2.0 V1UnitV13
+  addvl X1, X27, #-8  // ADDVL <Xd|SP>, <Xn|SP>, #<imm>  \\ Predicate counting scalar  \\ 1 2  2  1.0 V1UnitM0
+  adr X3, test  // ADR <Xd>, <label>  \\ Address generation  \\ 1 1  1  4.0 V1UnitI
+  adr Z26.D, [Z1.D, Z8.D]  // ADR <Zd>.<T>, [<Zn>.<T>, <Zm>.<T>]  \\ Arithmetic, basic  \\ 1 2  2  2.0 V1UnitV01
+  adr Z22.S, [Z28.S, Z8.S, LSL #2]  // ADR <Zd>.<T>, [<Zn>.<T>, <Zm>.<T>, <mod> #<amount>]  \\ Arithmetic, basic  \\ 1 2  2  2.0 V1UnitV01
+  adr Z11.D, [Z2.D, Z29.D, SXTW ]  // ADR <Zd>.D, [<Zn>.D, <Zm>.D, SXTW ]  \\ Arithmetic, basic  \\ 1 2  2  2.0 V1UnitV01
+  adr Z3.D, [Z9.D, Z9.D, SXTW #2]  // ADR <Zd>.D, [<Zn>.D, <Zm>.D, SXTW #<amount>]  \\ Arithmetic, basic  \\ 1 2  2  2.0 V1UnitV01
+  adr Z6.D, [Z7.D, Z13.D, UXTW ]  // ADR <Zd>.D, [<Zn>.D, <Zm>.D, UXTW ]  \\ Arithmetic, basic  \\ 1 2  2  2.0 V1UnitV01
+  adr Z4.D, [Z24.D, Z22.D, UXTW #1]  // ADR <Zd>.D, [<Zn>.D, <Zm>.D, UXTW #<amount>]  \\ Arithmetic, basic  \\ 1 2  2  2.0 V1UnitV01
+  adrp X0, test  // ADRP <Xd>, <label>  \\ Address generation  \\ 1 1  1  4.0 V1UnitI
+  and WSP, W16, #0xe00  // AND <Wd|WSP>, <Wn>, #<imms>  \\ ALU, basic  \\ 1 1  1  4.0 V1UnitI
+  and X2, X22, #0x1e00  // AND <Xd|SP>, <Xn>, #<imm>  \\ ALU, basic  \\ 1 1  1  4.0 V1UnitI
+  and Z1.B, Z1.B, #0x70  // AND <Zdn>.B, <Zdn>.B, #<constb>  \\ Logical  \\ 1 2  2  2.0 V1UnitV01
+  and Z7.H, Z7.H, #0x60  // AND <Zdn>.H, <Zdn>.H, #<consth>  \\ Logical  \\ 1 2  2  2.0 V1UnitV01
+  and Z7.S, Z7.S, #0x2  // AND <Zdn>.S, <Zdn>.S, #<consts>  \\ Logical  \\ 1 2  2  2.0 V1UnitV01
+  and Z7.D, Z7.D, #0x4  // AND <Zdn>.D, <Zdn>.D, #<constd>  \\ Logical  \\ 1 2  2  2.0 V1UnitV01
+  and P5.B, P1/Z, P6.B, P4.B  // AND <Pd>.B, <Pg>/Z, <Pn>.B, <Pm>.B  \\ Predicate logical  \\ 1 1  1  1.0 V1UnitM0
+  and W11, W14, W24  // AND <Wd>, <Wn>, <Wm>  \\ Logical, shift, no flagset  \\ 1 1  1  4.0 V1UnitI
+  and W2, W21, W22, LSR #25  // AND <Wd>, <Wn>, <Wm>, <shift> #<wamount>  \\ Logical, shift, no flagset  \\ 1 1  1  4.0 V1UnitI
+  and X1, X20, X29  // AND <Xd>, <Xn>, <Xm>  \\ Logical, shift, no flagset  \\ 1 1  1  4.0 V1UnitI
+  and X8, X11, X22, ASR #56  // AND <Xd>, <Xn>, <Xm>, <shift> #<amount>  \\ Logical, shift, no flagset  \\ 1 1  1  4.0 V1UnitI
+  and V29.8B, V26.8B, V26.8B  // AND <Vd>.<T>, <Vn>.<T>, <Vm>.<T>  \\ ASIMD logical  \\ 1 2  2  4.0 V1UnitV
+  and Z17.D, P6/M, Z17.D, Z12.D  // AND <Zdn>.<T>, <Pg>/M, <Zdn>.<T>, <Zm>.<T>  \\ Logical  \\ 1 2  2  2.0 V1UnitV01
+  and Z9.D, Z5.D, Z17.D  // AND <Zd>.D, <Zn>.D, <Zm>.D  \\ Logical  \\ 1 2  2  2.0 V1UnitV01
+  ands W14, W8, #0x70  // ANDS <Wd>, <Wn>, #<imms>  \\ ALU, basic, unconditional, flagset  \\ 1 1  1  3.00 V1UnitI,V1UnitFlg
+  ands X4, X10, #0x60  // ANDS <Xd>, <Xn>, #<immd>  \\ ALU, basic, flagset  \\ 1 1  1  3.00 V1UnitI,V1UnitFlg
+  ands W29, W28, W12  // ANDS <Wd>, <Wn>, <Wm>  \\ ALU, basic, unconditional, flagset  \\ 1 2  2  2.00 V1UnitI,V1UnitFlg
+  ands W7, W13, W23, ASR #3  // ANDS <Wd>, <Wn>, <Wm>, <shift> #<wamount>  \\ Logical, shift by immed, flagset, unconditional  \\ 1 2  2  2.00 V1UnitM,V1UnitFlg
+  ands X21, X9, X6  // ANDS <Xd>, <Xn>, <Xm>  \\ ALU, basic, flagset  \\ 1 2  2  2.00 V1UnitI,V1UnitFlg
+  ands X10, X27, X7, ASR #20  // ANDS <Xd>, <Xn>, <Xm>, <shift> #<amount>  \\ Logical, shift, flagset  \\ 1 2  2  2.00 V1UnitM,V1UnitFlg
+  ands P5.B, P1/Z, P2.B, P7.B  // ANDS <Pd>.B, <Pg>/Z, <Pn>.B, <Pm>.B  \\ Predicate logical, flag setting  \\ 2 2  2  0.50 V1UnitM0[2]
+  andv H7, P6, Z31.H  // ANDV <V><d>, <Pg>, <Zn>.<T>  \\ Reduction, logical   \\ 4 12  12  0.50 V1UnitV01[4]
+  asr W30, W14, #5  // ASR <Wd>, <Wn>, #<shifts>  \\ Move, shift by immed, no flagset  \\ 1 1  1  4.0 V1UnitI
+  asr X12, X21, #28  // ASR <Xd>, <Xn>, #<shiftd>  \\ Move, shift by immed, no flagset  \\ 1 1  1  4.0 V1UnitI
+  asr Z7.B, P5/M, Z7.B, #3  // ASR <Zdn>.B, <Pg>/M, <Zdn>.B, #<constb>  \\ Arithmetic, shift  \\ 1 2  2  1.0 V1UnitV1
+  asr Z6.H, P6/M, Z6.H, #5  // ASR <Zdn>.H, <Pg>/M, <Zdn>.H, #<consth>  \\ Arithmetic, shift  \\ 1 2  2  1.0 V1UnitV1
+  asr Z28.S, P0/M, Z28.S, #11  // ASR <Zdn>.S, <Pg>/M, <Zdn>.S, #<consts>  \\ Arithmetic, shift  \\ 1 2  2  1.0 V1UnitV1
+  asr Z26.D, P5/M, Z26.D, #24  // ASR <Zdn>.D, <Pg>/M, <Zdn>.D, #<constd>  \\ Arithmetic, shift  \\ 1 2  2  1.0 V1UnitV1
+  asr Z10.B, Z14.B, #3  // ASR <Zd>.B, <Zn>.B, #<constb>  \\ Arithmetic, shift  \\ 1 2  2  1.0 V1UnitV1
+  asr Z23.H, Z18.H, #6  // ASR <Zd>.H, <Zn>.H, #<consth>  \\ Arithmetic, shift  \\ 1 2  2  1.0 V1UnitV1
+  asr Z29.S, Z11.S, #6  // ASR <Zd>.S, <Zn>.S, #<consts>  \\ Arithmetic, shift  \\ 1 2  2  1.0 V1UnitV1
+  asr Z20.D, Z26.D, #29  // ASR <Zd>.D, <Zn>.D, #<constd>  \\ Arithmetic, shift  \\ 1 2  2  1.0 V1UnitV1
+  asr W3, W0, W20  // ASR <Wd>, <Wn>, <Wm>  \\ Move, shift by register, no flagset, unconditional  \\ 1 1  1  4.0 V1UnitI
+  asr X7, X5, X21  // ASR <Xd>, <Xn>, <Xm>  \\ Move, shift by register, no flagset, unconditional  \\ 1 1  1  4.0 V1UnitI
+  asr Z3.S, P0/M, Z3.S, Z10.S  // ASR <Zdn>.<T>, <Pg>/M, <Zdn>.<T>, <Zm>.<T>  \\ Arithmetic, shift  \\ 1 2  2  1.0 V1UnitV1
+  asr Z9.S, P2/M, Z9.S, Z8.D  // ASR <Zdn>.<T>, <Pg>/M, <Zdn>.<T>, <Zm>.D  \\ Arithmetic, shift  \\ 1 2  2  1.0 V1UnitV1
+  asr Z26.S, Z21.S, Z21.D  // ASR <Zd>.<T>, <Zn>.<T>, <Zm>.D  \\ Arithmetic, shift  \\ 1 2  2  1.0 V1UnitV1
+  asrd Z6.B, P4/M, Z6.B, #2  // ASRD <Zdn>.B, <Pg>/M, <Zdn>.B, #<constb>  \\ Arithmetic, shift right for divide  \\ 1 4  4  1.0 V1UnitV1
+  asrd Z19.H, P3/M, Z19.H, #6  // ASRD <Zdn>.H, <Pg>/M, <Zdn>.H, #<consth>  \\ Arithmetic, shift right for divide  \\ 1 4  4  1.0 V1UnitV1
+  asrd Z16.S, P3/M, Z16.S, #2  // ASRD <Zdn>.S, <Pg>/M, <Zdn>.S, #<consts>  \\ Arithmetic, shift right for divide  \\ 1 4  4  1.0 V1UnitV1
+  asrd Z9.D, P6/M, Z9.D, #12  // ASRD <Zdn>.D, <Pg>/M, <Zdn>.D, #<constd>  \\ Arithmetic, shift right for divide  \\ 1 4  4  1.0 V1UnitV1
+  asrr Z0.B, P0/M, Z0.B, Z19.B  // ASRR <Zdn>.<T>, <Pg>/M, <Zdn>.<T>, <Zm>.<T>  \\ Arithmetic, shift  \\ 1 2  2  1.0 V1UnitV1
+  asrv W24, W28, W13  // ASRV <Wd>, <Wn>, <Wm>  \\ Variable shift  \\ 1 1  1  4.0 V1UnitI
+  asrv X3, X21, X24  // ASRV <Xd>, <Xn>, <Xm>  \\ Variable shift  \\ 1 1  1  4.0 V1UnitI
+  at s12e1r, X28  // AT <at_op>, <Xt>  \\ No description \\ No scheduling info
+  b test  // B <label>  \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.eq test // B.eq <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.none test // B.none <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.ne test // B.ne <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.any test // B.any <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.cs test // B.cs <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.hs test // B.hs <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.nlast test // B.nlast <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.cc test // B.cc <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.lo test // B.lo <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.last test // B.last <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.mi test // B.mi <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.first test // B.first <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.pl test // B.pl <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.nfrst test // B.nfrst <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.vs test // B.vs <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.vc test // B.vc <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.hi test // B.hi <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.pmore test // B.pmore <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.ls test // B.ls <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.plast test // B.plast <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.ge test // B.ge <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.tcont test // B.tcont <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.lt test // B.lt <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.tstop test // B.tstop <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.gt test // B.gt <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.le test // B.le <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.al test // B.al <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.nv test // B.nv <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  bfcvt H6, S20  // BFCVT <Hd>, <Sn>  \\ Scalar convert, F32 to BF16  \\ 1 3  3  2.0 V1UnitV02
+  bfcvt Z16.H, P6/M,...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Feb 11, 2025

@llvm/pr-subscribers-mc

Author: Julien Villette (jvillette38)

Changes

This is a new way to update scheduling information in llvm. I have used this to update scheduling information for AArch64 Neoverse V1 micro architecture (new patches will follow and will be dependent to this pull request).

This pull request contains 2 commits:
A) llvm-mca -scheduling-info option
B) update_mca_test_checks.py new options: --check-sched-info and --update-sched-info.

A) llvm-mca -scheduling-info disables default llvm-mca reporting (InstructionInfoView) and output information in the following format:
&lt;uOps&gt; | &lt;Latency&gt; | &lt;Bypass Latency&gt; | &lt;Throughput&gt; | &lt;Resources&gt; | &lt;LLVM Opcode&gt; | &lt;Assembly input: instruction + comment&gt;
Example from new llvm-mca test AArch64/Neoverse/V1-scheduling-info.s:
Input:
abs v25.2s, v25.2s // ABS &lt;Vd&gt;.&lt;T&gt;, &lt;Vn&gt;.&lt;T&gt; \\ ASIMD arith, basic \\ 1 2 2 4.0 V1UnitV
Output:
1 | 2 | 2 | 4.00 | V1UnitSVE01, V1UnitV | ABSv2i32 | abs v25.2s, v25.2s // ABS &lt;Vd&gt;.&lt;T&gt;, &lt;Vn&gt;.&lt;T&gt; \\ ASIMD arith, basic \\ 1 2 2 4.0 V1UnitV

So if we are able to extract scheduling information from micro architecture document for each instruction variant, it is possible to write test in this form and check llvm-mca -scheduling-info output for the differences between llvm information compared to the one in comments. If you get differences, check the documentation to update comment or fix llvm to update llvm-mca output.
LLVM Opcode is given to make easier the changes in target description.

B) update_mca_test_checks.py --check-sched-info is used to check informations between llvm-mca output and information in comments. If found differences, it will exit with error code and report them. Developer can fix comments or llvm target description or use update_mca_test_checks.py --update-sched-info to update automatically comments and then check differences with git.

Convention for comments used by new update_mca_test_checks.py options:

  • C or C++ style comment: '/* */' and '//'
  • Fields:
    &lt;asm instruction&gt; &lt;// or /*&gt; &lt;instruction format&gt; \\ &lt;micro architecture reference&gt; \\ &lt;uOps&gt; &lt;Latency&gt; &lt;Bypass latency&gt; &lt;Throughput&gt; &lt;Resources seperated with commas&gt;

@mshockwave and @Rin18 may be interested.


Patch is 1.49 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/126703.diff

10 Files Affected:

  • (modified) llvm/docs/CommandGuide/llvm-mca.rst (+14)
  • (modified) llvm/include/llvm/MC/MCSchedule.h (+4)
  • (modified) llvm/lib/MC/MCSchedule.cpp (+37)
  • (added) llvm/test/tools/llvm-mca/AArch64/Neoverse/V1-scheduling-info.s (+7588)
  • (modified) llvm/tools/llvm-mca/CMakeLists.txt (+1)
  • (modified) llvm/tools/llvm-mca/Views/InstructionInfoView.h (+1)
  • (added) llvm/tools/llvm-mca/Views/SchedulingInfoView.cpp (+210)
  • (added) llvm/tools/llvm-mca/Views/SchedulingInfoView.h (+96)
  • (modified) llvm/tools/llvm-mca/llvm-mca.cpp (+31-11)
  • (modified) llvm/utils/update_mca_test_checks.py (+168)
diff --git a/llvm/docs/CommandGuide/llvm-mca.rst b/llvm/docs/CommandGuide/llvm-mca.rst
index f610ea2f2168269..1c5275ce000b111 100644
--- a/llvm/docs/CommandGuide/llvm-mca.rst
+++ b/llvm/docs/CommandGuide/llvm-mca.rst
@@ -170,6 +170,20 @@ option specifies "``-``", then the output will also be sent to standard output.
   Enable extra scheduler statistics. This view collects and analyzes instruction
   issue events. This view is disabled by default.
 
+.. option:: -scheduling-info
+
+  Enable scheduling info view. This view reports scheduling information defined
+  in LLVM target description in the form:
+  uOps | Latency | Bypass Latency | Throughput | LLVM OpcodeName | Resources
+  units | assembly instruction and its comment (// or /* */) if defined.
+  It allows to compare scheduling info with architecture documents and fix them
+  in target description by fixing InstrRW for the reported LLVM opcode.
+  Scheduling information can be defined in the same order in each instruction
+  comments to check easily reported and reference scheduling information.
+  Suggested information in comment:
+  // <architecture instruction form> \\ <scheduling documentation title> \\
+     <uOps>, <Latency>, <Bypass Latency>, <Throughput>, <Resources units>
+
 .. option:: -retire-stats
 
   Enable extra retire control unit statistics. This view is disabled by default.
diff --git a/llvm/include/llvm/MC/MCSchedule.h b/llvm/include/llvm/MC/MCSchedule.h
index fe731d086f70ae3..dcbc5369120a39b 100644
--- a/llvm/include/llvm/MC/MCSchedule.h
+++ b/llvm/include/llvm/MC/MCSchedule.h
@@ -402,6 +402,10 @@ struct MCSchedModel {
   static unsigned getForwardingDelayCycles(ArrayRef<MCReadAdvanceEntry> Entries,
                                            unsigned WriteResourceIdx = 0);
 
+  /// Returns the maximum forwarding delay for maximum write latency.
+  static unsigned getForwardingDelayCycles(const MCSubtargetInfo &STI,
+                                       const MCSchedClassDesc &SCDesc);
+
   /// Returns the default initialized model.
   static const MCSchedModel Default;
 };
diff --git a/llvm/lib/MC/MCSchedule.cpp b/llvm/lib/MC/MCSchedule.cpp
index ed243cecabb7638..4ef6acf78714fa7 100644
--- a/llvm/lib/MC/MCSchedule.cpp
+++ b/llvm/lib/MC/MCSchedule.cpp
@@ -174,3 +174,40 @@ MCSchedModel::getForwardingDelayCycles(ArrayRef<MCReadAdvanceEntry> Entries,
 
   return std::abs(DelayCycles);
 }
+
+unsigned
+MCSchedModel::getForwardingDelayCycles(const MCSubtargetInfo &STI,
+                                            const MCSchedClassDesc &SCDesc) {
+
+  ArrayRef<MCReadAdvanceEntry> Entries = STI.getReadAdvanceEntries(SCDesc);
+  if (Entries.empty())
+    return 0;
+
+  unsigned Latency = 0;
+  unsigned maxLatency = 0;
+  unsigned WriteResourceID = 0;
+  unsigned DefEnd = SCDesc.NumWriteLatencyEntries;
+
+  for (unsigned DefIdx = 0; DefIdx != DefEnd; ++DefIdx) {
+    // Lookup the definition's write latency in SubtargetInfo.
+    const MCWriteLatencyEntry *WLEntry =
+        STI.getWriteLatencyEntry(&SCDesc, DefIdx);
+    // Early exit if we found an invalid latency.
+    // Consider no bypass
+    if (WLEntry->Cycles < 0)
+      return 0;
+    maxLatency = std::max(Latency, static_cast<unsigned>(WLEntry->Cycles));
+    if (maxLatency > Latency) {
+      WriteResourceID = WLEntry->WriteResourceID;
+    }
+    Latency = maxLatency;
+  }
+
+  for (const MCReadAdvanceEntry &E : Entries) {
+    if (E.WriteResourceID == WriteResourceID) {
+      return E.Cycles;
+    }
+  }
+
+  llvm_unreachable("WriteResourceID not found in MCReadAdvanceEntry entries");
+}
diff --git a/llvm/test/tools/llvm-mca/AArch64/Neoverse/V1-scheduling-info.s b/llvm/test/tools/llvm-mca/AArch64/Neoverse/V1-scheduling-info.s
new file mode 100644
index 000000000000000..c421166f22ea45e
--- /dev/null
+++ b/llvm/test/tools/llvm-mca/AArch64/Neoverse/V1-scheduling-info.s
@@ -0,0 +1,7588 @@
+# NOTE: Assertions have been autogenerated by utils/update_mca_test_checks.py
+# RUN: llvm-mca -mtriple=aarch64 -mcpu=neoverse-v1 -scheduling-info < %s | FileCheck %s
+
+  .text
+  .file	        "V1-scheduling-info.s"
+  .globl	test
+  .p2align	4
+  .type	test,@function
+test:
+  .cfi_startproc
+  abs D15, D11  /* ABS <V><d>, <V><n>  \\ ASIMD arith, basic  \\ 1 2  2  4.0 V1UnitV */
+  abs V25.2S, V25.2S  // ABS <Vd>.<T>, <Vn>.<T>  \\ ASIMD arith, basic  \\ 1 2  2  4.0 V1UnitV
+  abs Z26.B, P6/M, Z27.B  // ABS <Zd>.<T>, <Pg>/M, <Zn>.<T>  \\ Arithmetic, basic  \\ 1 2  2  2.0 V1UnitV01
+  adc W13, W6, W4  // ADC <Wd>, <Wn>, <Wm>  \\ ALU, basic  \\ 1 1  1  4.0 V1UnitI
+  adc X8, X12, X10  // ADC <Xd>, <Xn>, <Xm>  \\ ALU, basic  \\ 1 1  1  4.0 V1UnitI
+  adcs W29, W7, W30  // ADCS <Wd>, <Wn>, <Wm>  \\ ALU, basic, flagset  \\ 1 1  1  3.00 V1UnitI,V1UnitFlg
+  adcs X11, X3, X5  // ADCS <Xd>, <Xn>, <Xm>  \\ ALU, basic, flagset  \\ 1 1  1  3.00 V1UnitI,V1UnitFlg
+  add WSP, WSP, W10  // ADD <Wd|WSP>, <Wn|WSP>, <Wm>  \\ ALU, basic, unconditional, no flagset  \\ 1 2  2  2.00 V1UnitI
+  add WSP, WSP, W2, UXTB   // ADD <Wd|WSP>, <Wn|WSP>, <Wm>, <wextend>   \\ ALU, basic, unconditional, no flagset  \\ 1 2  2  2.00 V1UnitI
+  add WSP, WSP, W13, UXTH #4  // ADD <Wd|WSP>, <Wn|WSP>, <Wm>, <wextend> #<amount>  \\ ALU, basic, unconditional, no flagset  \\ 1 2  2  2.00 V1UnitI
+  add WSP, WSP, W13, LSL #4  // ADD <Wd|WSP>, <Wn|WSP>, <Wm>, LSL #<amount>  \\ Arithmetic, LSL shift, shift <= 4  \\ 1 2  2  2.00 V1UnitI
+  add X22, X2, X27  // ADD <Xd|SP>, <Xn|SP>, X<m>  \\ ALU, basic  \\ 1 1  1  4.0 V1UnitI
+  add X25, X9, W25, UXTB  // ADD <Xd|SP>, <Xn|SP>, <R><m>, <extend>  \\ ALU, basic  \\ 1 2  2  2.00 V1UnitI
+  add X4, X28, W3, UXTB #3  // ADD <Xd|SP>, <Xn|SP>, <R><m>, <extend> #<amount>  \\ ALU, extend and shift  \\ 1 2  2  2.0 V1UnitM
+  add X0, X28, X26, LSL #3  // ADD <Xd|SP>, <Xn|SP>, X<m>, LSL #<amount>  \\ Arithmetic, LSL shift, shift <= 4  \\ 1 1  1  4.0 V1UnitI
+  add WSP, WSP, #3765  // ADD <Wd|WSP>, <Wn|WSP>, #<imm>  \\ ALU, basic  \\ 1 1  1  4.0 V1UnitI
+  add WSP, WSP, #3547, LSL #12  // ADD <Wd|WSP>, <Wn|WSP>, #<imm>, <shift>  \\ ALU, basic  \\ 1 1  1  4.0 V1UnitI
+  add X7, X30, #803  // ADD <Xd|SP>, <Xn|SP>, #<imm>  \\ ALU, basic  \\ 1 1  1  4.0 V1UnitI
+  add X7, X2, #319, LSL #12  // ADD <Xd|SP>, <Xn|SP>, #<imm>, <shift>  \\ ALU, basic  \\ 1 1  1  4.0 V1UnitI
+  add Z13.D, Z13.D, #245  // ADD <Zdn>.<T>, <Zdn>.<T>, #<imm>  \\ Arithmetic, basic  \\ 1 2  2  2.0 V1UnitV01
+  add Z16.D, Z16.D, #233, LSL #8  // ADD <Zdn>.<T>, <Zdn>.<T>, #<imm>, <shift>  \\ Arithmetic, basic  \\ 1 2  2  2.0 V1UnitV01
+  add W3, W2, W21, LSL #3  // ADD <Wd>, <Wn>, <Wm>, LSL #<wamountl>  \\ Arithmetic, LSL shift by immed, shift <= 4, unconditional, no flagset   \\ 1 1  1  4.0 V1UnitI
+  add W6, W21, W17, LSL #15  // ADD <Wd>, <Wn>, <Wm>, LSL #<wamounth>  \\ Arithmetic, LSR/ASR/ROR shift by immed or LSL shift by immed > 4, unconditional  \\ 1 2  2  2.0 V1UnitM
+  add W28, W30, W19, ASR #30  // ADD <Wd>, <Wn>, <Wm>, <shift> #<wamount>  \\ Arithmetic, LSR/ASR/ROR shift by immed or LSL shift by immed > 4, unconditional  \\ 1 2  2  2.0 V1UnitM
+  add X8, X3, X28, LSL #3  // ADD <Xd>, <Xn>, <Xm>, LSL #<amountl>  \\ Arithmetic, LSL shift, shift <= 4  \\ 1 1  1  4.0 V1UnitI
+  add X12, X13, X0, LSL #44  // ADD <Xd>, <Xn>, <Xm>, LSL #<amounth>  \\ Arithmetic, LSR/ASR/ROR shift or LSL shift > 4  \\ 1 2  2  2.0 V1UnitM
+  add X5, X20, X28, LSR #16  // ADD <Xd>, <Xn>, <Xm>, <shift> #<amount>  \\ Arithmetic, LSR/ASR/ROR shift or LSL shift > 4  \\ 1 2  2  2.0 V1UnitM
+  add D0, D23, D21  // ADD <V><d>, <V><n>, <V><m>  \\ ASIMD arith, basic  \\ 1 2  2  4.0 V1UnitV
+  add V19.4S, V24.4S, V15.4S  // ADD <Vd>.<T>, <Vn>.<T>, <Vm>.<T>  \\ ASIMD arith, basic  \\ 1 2  2  4.0 V1UnitV
+  add Z29.D, P5/M, Z29.D, Z29.D  // ADD <Zdn>.<T>, <Pg>/M, <Zdn>.<T>, <Zm>.<T>  \\ Arithmetic, basic  \\ 1 2  2  2.0 V1UnitV01
+  add Z10.H, Z22.H, Z13.H  // ADD <Zd>.<T>, <Zn>.<T>, <Zm>.<T>  \\ Arithmetic, basic  \\ 1 2  2  2.0 V1UnitV01
+  addhn V26.4H, V5.4S, V9.4S  // ADDHN <Vd>.<Tb>, <Vn>.<Ta>, <Vm>.<Ta>  \\ ASIMD arith, complex  \\ 1 2  2  4.0 V1UnitV
+  addhn2 V1.16B, V19.8H, V6.8H  // ADDHN2 <Vd>.<Tb>, <Vn>.<Ta>, <Vm>.<Ta>  \\ ASIMD arith, complex  \\ 1 2  2  4.0 V1UnitV
+  addp D1, V14.2D  // ADDP <V><d>, <Vn>.<T>  \\ ASIMD arith, pair-wise  \\ 1 2  2  4.0 V1UnitV
+  addp V7.2S, V1.2S, V2.2S  // ADDP <Vd>.<T>, <Vn>.<T>, <Vm>.<T>  \\ ASIMD arith, pair-wise  \\ 1 2  2  4.0 V1UnitV
+  addpl X27, X6, #-6  // ADDPL <Xd|SP>, <Xn|SP>, #<imm>  \\ Predicate counting scalar  \\ 1 2  2  1.0 V1UnitM0
+  adds W17, WSP, W25  // ADDS <Wd>, <Wn|WSP>, <Wm>  \\ ALU, basic, unconditional, flagset  \\ 1 2  2  2.00 V1UnitI,V1UnitFlg
+  adds W6, WSP, W15, UXTH   // ADDS <Wd>, <Wn|WSP>, <Wm>, <wextend>   \\ ALU, basic, unconditional, flagset  \\ 1 2  2  2.00 V1UnitI,V1UnitFlg
+  adds W22, WSP, W30, UXTB #2  // ADDS <Wd>, <Wn|WSP>, <Wm>, <wextend> #<amount>  \\ ALU, basic, unconditional, flagset  \\ 1 1  1  3.00 V1UnitI,V1UnitFlg
+  adds W12, WSP, W29, LSL #4  // ADDS <Wd>, <Wn|WSP>, <Wm>, LSL #<amount>  \\ Arithmetic, LSL shift by immed, shift <= 4, unconditional, flagset   \\ 1 2  2  2.00 V1UnitI,V1UnitFlg
+  adds X14, X0, X10  // ADDS <Xd>, <Xn|SP>, X<m>  \\ ALU, basic, flagset  \\ 1 1  1  3.00 V1UnitI,V1UnitFlg
+  adds X13, X23, W8, UXTB  // ADDS <Xd>, <Xn|SP>, <R><m>, <extend>  \\ ALU, basic, flagset  \\ 1 1  1  3.00 V1UnitI,V1UnitFlg
+  adds X4, X26, W28, UXTB #1  // ADDS <Xd>, <Xn|SP>, <R><m>, <extend> #<amount>  \\ ALU, flagset, extend and shift  \\ 1 1  1  3.00 V1UnitFlg, V1UnitI
+  adds X10, X3, X29, LSL #2  // ADDS <Xd>, <Xn|SP>, X<m>, LSL #<amount>  \\ Arithmetic, flagset, LSL shift, shift <= 4  \\ 1 1   1   3.00 V1UnitI,V1UnitFlg
+  adds W23, WSP, #502  // ADDS <Wd>, <Wn|WSP>, #<imm>  \\ ALU, basic, unconditional, flagset  \\ 1 1  1  3.00 V1UnitI,V1UnitFlg
+  adds W2, WSP, #2980, LSL #12  // ADDS <Wd>, <Wn|WSP>, #<imm>, <shift>  \\ Arithmetic, flagset, LSR/ASR/ROR shift by immed or LSL shift by immed > 4, unconditional  \\ 1 1  1  3.00 V1UnitFlg, V1UnitI
+  adds X12, X4, #1345  // ADDS <Xd>, <Xn|SP>, #<imm>  \\ ALU, basic, flagset  \\ 1 1  1  3.00 V1UnitI,V1UnitFlg
+  adds X25, X18, #3037, LSL #12  // ADDS <Xd>, <Xn|SP>, #<imm>, <shift>  \\ Arithmetic, flagset, LSR/ASR/ROR shift or LSL shift > 4  \\ 1 1  1  3.00 V1UnitFlg, V1UnitI
+  adds W12, W13, W26  // ADDS <Wd>, <Wn>, <Wm>  \\ ALU, basic, unconditional, flagset  \\ 1 1  1  3.00 V1UnitI,V1UnitFlg
+  adds W0, W23, W20, LSL #0  // ADDS <Wd>, <Wn>, <Wm>, LSL #<wamountl>  \\ Arithmetic, LSL shift by immed, shift <= 4, unconditional, flagset   \\ 1 1  1  3.00 V1UnitI,V1UnitFlg
+  adds W13, W16, W12, LSL #28  // ADDS <Wd>, <Wn>, <Wm>, LSL #<wamounth>  \\ Arithmetic, flagset, LSR/ASR/ROR shift by immed or LSL shift by immed > 4, unconditional  \\ 1 2  2  2.00 V1UnitM,V1UnitFlg
+  adds W20, W19, W16, ASR #0  // ADDS <Wd>, <Wn>, <Wm>, <shift> #<wamount>  \\ Arithmetic, flagset, LSR/ASR/ROR shift by immed or LSL shift by immed > 4, unconditional  \\ 1 2  2  2.00 V1UnitM,V1UnitFlg
+  adds X23, X12, X4  // ADDS <Xd>, <Xn>, <Xm>  \\ ALU, basic, flagset  \\ 1 1  1  3.00 V1UnitI,V1UnitFlg
+  adds X0, X13, X4, LSL #2  // ADDS <Xd>, <Xn>, <Xm>, LSL #<amountl>  \\ Arithmetic, flagset, LSL shift, shift <= 4  \\ 1 1   1   3.00 V1UnitI,V1UnitFlg
+  adds X4, X7, X6, LSL #31  // ADDS <Xd>, <Xn>, <Xm>, LSL #<amounth>  \\ Arithmetic, flagset, LSR/ASR/ROR shift or LSL shift > 4  \\ 1 2  2  2.00 V1UnitM,V1UnitFlg
+  adds X9, X8, X9, ASR #41  // ADDS <Xd>, <Xn>, <Xm>, <shift> #<amount>  \\ Arithmetic, flagset, LSR/ASR/ROR shift or LSL shift > 4  \\ 1 2  2  2.00 V1UnitM,V1UnitFlg
+  addv B0, V28.8B  // ADDV B<d>, <Vn>.8B  \\ ASIMD arith, reduce, 8B/8H  \\ 2 4  4  2.00 V1UnitV13
+  addv B1, V26.16B  // ADDV B<d>, <Vn>.16B  \\ ASIMD arith, reduce, 16B  \\ 2 4  4  1.00 V1UnitV13[2]
+  addv H18, V13.4H  // ADDV H<d>, <Vn>.4H  \\ ASIMD arith, reduce, 4H/4S  \\ 1 2  2  2.0 V1UnitV13
+  addv H29, V17.8H  // ADDV H<d>, <Vn>.8H  \\ ASIMD arith, reduce, 8B/8H  \\ 2 4  4  2.00 V1UnitV13
+  addv S22, V18.4S  // ADDV S<d>, <Vn>.4S  \\ ASIMD arith, reduce, 4H/4S  \\ 1 2  2  2.0 V1UnitV13
+  addvl X1, X27, #-8  // ADDVL <Xd|SP>, <Xn|SP>, #<imm>  \\ Predicate counting scalar  \\ 1 2  2  1.0 V1UnitM0
+  adr X3, test  // ADR <Xd>, <label>  \\ Address generation  \\ 1 1  1  4.0 V1UnitI
+  adr Z26.D, [Z1.D, Z8.D]  // ADR <Zd>.<T>, [<Zn>.<T>, <Zm>.<T>]  \\ Arithmetic, basic  \\ 1 2  2  2.0 V1UnitV01
+  adr Z22.S, [Z28.S, Z8.S, LSL #2]  // ADR <Zd>.<T>, [<Zn>.<T>, <Zm>.<T>, <mod> #<amount>]  \\ Arithmetic, basic  \\ 1 2  2  2.0 V1UnitV01
+  adr Z11.D, [Z2.D, Z29.D, SXTW ]  // ADR <Zd>.D, [<Zn>.D, <Zm>.D, SXTW ]  \\ Arithmetic, basic  \\ 1 2  2  2.0 V1UnitV01
+  adr Z3.D, [Z9.D, Z9.D, SXTW #2]  // ADR <Zd>.D, [<Zn>.D, <Zm>.D, SXTW #<amount>]  \\ Arithmetic, basic  \\ 1 2  2  2.0 V1UnitV01
+  adr Z6.D, [Z7.D, Z13.D, UXTW ]  // ADR <Zd>.D, [<Zn>.D, <Zm>.D, UXTW ]  \\ Arithmetic, basic  \\ 1 2  2  2.0 V1UnitV01
+  adr Z4.D, [Z24.D, Z22.D, UXTW #1]  // ADR <Zd>.D, [<Zn>.D, <Zm>.D, UXTW #<amount>]  \\ Arithmetic, basic  \\ 1 2  2  2.0 V1UnitV01
+  adrp X0, test  // ADRP <Xd>, <label>  \\ Address generation  \\ 1 1  1  4.0 V1UnitI
+  and WSP, W16, #0xe00  // AND <Wd|WSP>, <Wn>, #<imms>  \\ ALU, basic  \\ 1 1  1  4.0 V1UnitI
+  and X2, X22, #0x1e00  // AND <Xd|SP>, <Xn>, #<imm>  \\ ALU, basic  \\ 1 1  1  4.0 V1UnitI
+  and Z1.B, Z1.B, #0x70  // AND <Zdn>.B, <Zdn>.B, #<constb>  \\ Logical  \\ 1 2  2  2.0 V1UnitV01
+  and Z7.H, Z7.H, #0x60  // AND <Zdn>.H, <Zdn>.H, #<consth>  \\ Logical  \\ 1 2  2  2.0 V1UnitV01
+  and Z7.S, Z7.S, #0x2  // AND <Zdn>.S, <Zdn>.S, #<consts>  \\ Logical  \\ 1 2  2  2.0 V1UnitV01
+  and Z7.D, Z7.D, #0x4  // AND <Zdn>.D, <Zdn>.D, #<constd>  \\ Logical  \\ 1 2  2  2.0 V1UnitV01
+  and P5.B, P1/Z, P6.B, P4.B  // AND <Pd>.B, <Pg>/Z, <Pn>.B, <Pm>.B  \\ Predicate logical  \\ 1 1  1  1.0 V1UnitM0
+  and W11, W14, W24  // AND <Wd>, <Wn>, <Wm>  \\ Logical, shift, no flagset  \\ 1 1  1  4.0 V1UnitI
+  and W2, W21, W22, LSR #25  // AND <Wd>, <Wn>, <Wm>, <shift> #<wamount>  \\ Logical, shift, no flagset  \\ 1 1  1  4.0 V1UnitI
+  and X1, X20, X29  // AND <Xd>, <Xn>, <Xm>  \\ Logical, shift, no flagset  \\ 1 1  1  4.0 V1UnitI
+  and X8, X11, X22, ASR #56  // AND <Xd>, <Xn>, <Xm>, <shift> #<amount>  \\ Logical, shift, no flagset  \\ 1 1  1  4.0 V1UnitI
+  and V29.8B, V26.8B, V26.8B  // AND <Vd>.<T>, <Vn>.<T>, <Vm>.<T>  \\ ASIMD logical  \\ 1 2  2  4.0 V1UnitV
+  and Z17.D, P6/M, Z17.D, Z12.D  // AND <Zdn>.<T>, <Pg>/M, <Zdn>.<T>, <Zm>.<T>  \\ Logical  \\ 1 2  2  2.0 V1UnitV01
+  and Z9.D, Z5.D, Z17.D  // AND <Zd>.D, <Zn>.D, <Zm>.D  \\ Logical  \\ 1 2  2  2.0 V1UnitV01
+  ands W14, W8, #0x70  // ANDS <Wd>, <Wn>, #<imms>  \\ ALU, basic, unconditional, flagset  \\ 1 1  1  3.00 V1UnitI,V1UnitFlg
+  ands X4, X10, #0x60  // ANDS <Xd>, <Xn>, #<immd>  \\ ALU, basic, flagset  \\ 1 1  1  3.00 V1UnitI,V1UnitFlg
+  ands W29, W28, W12  // ANDS <Wd>, <Wn>, <Wm>  \\ ALU, basic, unconditional, flagset  \\ 1 2  2  2.00 V1UnitI,V1UnitFlg
+  ands W7, W13, W23, ASR #3  // ANDS <Wd>, <Wn>, <Wm>, <shift> #<wamount>  \\ Logical, shift by immed, flagset, unconditional  \\ 1 2  2  2.00 V1UnitM,V1UnitFlg
+  ands X21, X9, X6  // ANDS <Xd>, <Xn>, <Xm>  \\ ALU, basic, flagset  \\ 1 2  2  2.00 V1UnitI,V1UnitFlg
+  ands X10, X27, X7, ASR #20  // ANDS <Xd>, <Xn>, <Xm>, <shift> #<amount>  \\ Logical, shift, flagset  \\ 1 2  2  2.00 V1UnitM,V1UnitFlg
+  ands P5.B, P1/Z, P2.B, P7.B  // ANDS <Pd>.B, <Pg>/Z, <Pn>.B, <Pm>.B  \\ Predicate logical, flag setting  \\ 2 2  2  0.50 V1UnitM0[2]
+  andv H7, P6, Z31.H  // ANDV <V><d>, <Pg>, <Zn>.<T>  \\ Reduction, logical   \\ 4 12  12  0.50 V1UnitV01[4]
+  asr W30, W14, #5  // ASR <Wd>, <Wn>, #<shifts>  \\ Move, shift by immed, no flagset  \\ 1 1  1  4.0 V1UnitI
+  asr X12, X21, #28  // ASR <Xd>, <Xn>, #<shiftd>  \\ Move, shift by immed, no flagset  \\ 1 1  1  4.0 V1UnitI
+  asr Z7.B, P5/M, Z7.B, #3  // ASR <Zdn>.B, <Pg>/M, <Zdn>.B, #<constb>  \\ Arithmetic, shift  \\ 1 2  2  1.0 V1UnitV1
+  asr Z6.H, P6/M, Z6.H, #5  // ASR <Zdn>.H, <Pg>/M, <Zdn>.H, #<consth>  \\ Arithmetic, shift  \\ 1 2  2  1.0 V1UnitV1
+  asr Z28.S, P0/M, Z28.S, #11  // ASR <Zdn>.S, <Pg>/M, <Zdn>.S, #<consts>  \\ Arithmetic, shift  \\ 1 2  2  1.0 V1UnitV1
+  asr Z26.D, P5/M, Z26.D, #24  // ASR <Zdn>.D, <Pg>/M, <Zdn>.D, #<constd>  \\ Arithmetic, shift  \\ 1 2  2  1.0 V1UnitV1
+  asr Z10.B, Z14.B, #3  // ASR <Zd>.B, <Zn>.B, #<constb>  \\ Arithmetic, shift  \\ 1 2  2  1.0 V1UnitV1
+  asr Z23.H, Z18.H, #6  // ASR <Zd>.H, <Zn>.H, #<consth>  \\ Arithmetic, shift  \\ 1 2  2  1.0 V1UnitV1
+  asr Z29.S, Z11.S, #6  // ASR <Zd>.S, <Zn>.S, #<consts>  \\ Arithmetic, shift  \\ 1 2  2  1.0 V1UnitV1
+  asr Z20.D, Z26.D, #29  // ASR <Zd>.D, <Zn>.D, #<constd>  \\ Arithmetic, shift  \\ 1 2  2  1.0 V1UnitV1
+  asr W3, W0, W20  // ASR <Wd>, <Wn>, <Wm>  \\ Move, shift by register, no flagset, unconditional  \\ 1 1  1  4.0 V1UnitI
+  asr X7, X5, X21  // ASR <Xd>, <Xn>, <Xm>  \\ Move, shift by register, no flagset, unconditional  \\ 1 1  1  4.0 V1UnitI
+  asr Z3.S, P0/M, Z3.S, Z10.S  // ASR <Zdn>.<T>, <Pg>/M, <Zdn>.<T>, <Zm>.<T>  \\ Arithmetic, shift  \\ 1 2  2  1.0 V1UnitV1
+  asr Z9.S, P2/M, Z9.S, Z8.D  // ASR <Zdn>.<T>, <Pg>/M, <Zdn>.<T>, <Zm>.D  \\ Arithmetic, shift  \\ 1 2  2  1.0 V1UnitV1
+  asr Z26.S, Z21.S, Z21.D  // ASR <Zd>.<T>, <Zn>.<T>, <Zm>.D  \\ Arithmetic, shift  \\ 1 2  2  1.0 V1UnitV1
+  asrd Z6.B, P4/M, Z6.B, #2  // ASRD <Zdn>.B, <Pg>/M, <Zdn>.B, #<constb>  \\ Arithmetic, shift right for divide  \\ 1 4  4  1.0 V1UnitV1
+  asrd Z19.H, P3/M, Z19.H, #6  // ASRD <Zdn>.H, <Pg>/M, <Zdn>.H, #<consth>  \\ Arithmetic, shift right for divide  \\ 1 4  4  1.0 V1UnitV1
+  asrd Z16.S, P3/M, Z16.S, #2  // ASRD <Zdn>.S, <Pg>/M, <Zdn>.S, #<consts>  \\ Arithmetic, shift right for divide  \\ 1 4  4  1.0 V1UnitV1
+  asrd Z9.D, P6/M, Z9.D, #12  // ASRD <Zdn>.D, <Pg>/M, <Zdn>.D, #<constd>  \\ Arithmetic, shift right for divide  \\ 1 4  4  1.0 V1UnitV1
+  asrr Z0.B, P0/M, Z0.B, Z19.B  // ASRR <Zdn>.<T>, <Pg>/M, <Zdn>.<T>, <Zm>.<T>  \\ Arithmetic, shift  \\ 1 2  2  1.0 V1UnitV1
+  asrv W24, W28, W13  // ASRV <Wd>, <Wn>, <Wm>  \\ Variable shift  \\ 1 1  1  4.0 V1UnitI
+  asrv X3, X21, X24  // ASRV <Xd>, <Xn>, <Xm>  \\ Variable shift  \\ 1 1  1  4.0 V1UnitI
+  at s12e1r, X28  // AT <at_op>, <Xt>  \\ No description \\ No scheduling info
+  b test  // B <label>  \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.eq test // B.eq <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.none test // B.none <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.ne test // B.ne <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.any test // B.any <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.cs test // B.cs <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.hs test // B.hs <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.nlast test // B.nlast <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.cc test // B.cc <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.lo test // B.lo <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.last test // B.last <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.mi test // B.mi <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.first test // B.first <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.pl test // B.pl <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.nfrst test // B.nfrst <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.vs test // B.vs <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.vc test // B.vc <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.hi test // B.hi <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.pmore test // B.pmore <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.ls test // B.ls <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.plast test // B.plast <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.ge test // B.ge <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.tcont test // B.tcont <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.lt test // B.lt <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.tstop test // B.tstop <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.gt test // B.gt <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.le test // B.le <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.al test // B.al <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  b.nv test // B.nv <label> \\ Branch, immed  \\ 1 1  1  2.0 V1UnitB
+  bfcvt H6, S20  // BFCVT <Hd>, <Sn>  \\ Scalar convert, F32 to BF16  \\ 1 3  3  2.0 V1UnitV02
+  bfcvt Z16.H, P6/M,...
[truncated]

Copy link

github-actions bot commented Feb 11, 2025

✅ With the latest revision this PR passed the Python code formatter.

Copy link

github-actions bot commented Feb 11, 2025

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:
git-clang-format --diff b7baf2ee8d302aab7cae787645ee7b7ec107e3ee 84e754240c004f01ff3756d07241e2181cda5693 --extensions cpp,h -- llvm/tools/llvm-mca/Views/SchedulingInfoView.cpp llvm/tools/llvm-mca/Views/SchedulingInfoView.h llvm/include/llvm/MC/MCSchedule.h llvm/lib/MC/MCSchedule.cpp llvm/tools/llvm-mca/Views/InstructionInfoView.h llvm/tools/llvm-mca/llvm-mca.cpp
View the diff from clang-format here.
diff --git a/llvm/tools/llvm-mca/Views/SchedulingInfoView.cpp b/llvm/tools/llvm-mca/Views/SchedulingInfoView.cpp
index f05f4e3114..1416c9d7f1 100644
--- a/llvm/tools/llvm-mca/Views/SchedulingInfoView.cpp
+++ b/llvm/tools/llvm-mca/Views/SchedulingInfoView.cpp
@@ -174,10 +174,11 @@ void SchedulingInfoView::collectData(
           SM.getProcResource(Index->ProcResourceIdx);
       if (Index->ReleaseAtCycle > 1) {
         // Output ReleaseAtCycle between [] if not 1 (default)
-	// This is to be able to evaluate throughput.
-	// See getReciprocalThroughput in MCSchedule.cpp
-	// TODO: report AcquireAtCycle to check this scheduling info.
-        TempStream << sep << format("%s[%d]", MCProc->Name, Index->ReleaseAtCycle);
+        // This is to be able to evaluate throughput.
+        // See getReciprocalThroughput in MCSchedule.cpp
+        // TODO: report AcquireAtCycle to check this scheduling info.
+        TempStream << sep
+                   << format("%s[%d]", MCProc->Name, Index->ReleaseAtCycle);
       } else {
         TempStream << sep << format("%s", MCProc->Name);
       }

comments to check easily reported and reference scheduling information.
Suggested information in comment:
// <architecture instruction form> \\ <scheduling documentation title> \\
<uOps>, <Latency>, <Bypass Latency>, <Throughput>, <Resources units>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still failing to build, see llvm/docs/README.txt for instructions on how to build locally. Please could you also run the formatters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be ok now regarding coding style checking.

Copy link
Collaborator

@c-rhodes c-rhodes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the patch. IIUC, this adds a new -scheduling-info option, which is similar to -instruction-info. So whereas -instruction-info has:

Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

this option has:

Scheduling Info:
[1]: #uOps
[2]: Latency
[3]: Bypass Latency
[4]: Throughput
[5]: Resources
[6]: LLVM OpcodeName
[7]: Instruction
[8]: Comment if any

and the purpose of this new option is to make it easier to compare the model against a good reference (and fix if necessary), which is specified via a comment after the assembly instruction.

So for example:

echo "abs D15, D11  /* ABS <V><d>, <V><n>  \\ ASIMD arith, basic  \\ 1 2  2  4.0 V1UnitV */" | build/bin/llvm-mca -mtriple=aarch64 -mcpu=neoverse-v1 -scheduling-info
...
Scheduling Info:
[1]: #uOps
[2]: Latency
[3]: Bypass Latency
[4]: Throughput
[5]: Resources
[6]: LLVM OpcodeName
[7]: Instruction
[8]: Comment if any
 [1]    [2]  [3]   [4]      [5]                                                                [6]                [7]                                 [8]
  1    | 2  | 2   | 4.00   | V1UnitV                                                          | ABSv1i64         | abs  d15, d11                       /* ABS <V><d>, <V><n>  \ ASIMD arith, basic  \ 1 2  2  4.0 V1UnitV */

here we can compare 1 | 2 | 2 | 4.00 | V1UnitV against the reference in the comment and see it's ok.

There's a few things going on here. I can see value in some of the extra info alongside the instruction like bypass latency/resources/LLVM OpcodeName, but this could be added to the existing -instruction-info view if there's agreement this is useful?

For the reference, why not just make the CHECK lines in a test like llvm/test/tools/llvm-mca/AArch64/Neoverse/V1-basic-instructions.s be the reference? It would have to be XFAIL'ed until the model matched the reference, but any difference would be obvious looking at the diff after running the update script.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is everything in this file handwritten in this patch?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also unclear to me how this test case was constructed. How could we generate this kind of test for other scheduler models?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

llvm-exegesis can do this with -opcode-index=-1 (assuming exegesis supports generating benchmarks for those opcodes). It wouldn't be difficult to implement something like that in llvm-mca.

std::vector<unsigned> Result;
unsigned NumOpcodes = State.getInstrInfo().getNumOpcodes();
Result.reserve(NumOpcodes);
for (unsigned I = 0, E = NumOpcodes; I < E; ++I) {
if (!ET.isOpcodeAvailable(I, AvailableFeatures))
continue;
Result.push_back(I);
}
return Result;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review.
I have generated this test by parsing ARM architecture + Software Optimization Guide...
Next step is to get more precise information with llvm-exegesis.
In all cases, I think it is a good starting point to have complete list of instructions (architecture document) and then get scheduling information from micro-architecture document or even better from llvm-exegesis...

@@ -66,6 +66,7 @@ class InstructionInfoView : public InstructionView {
struct InstructionInfoViewData {
unsigned NumMicroOpcodes = 0;
unsigned Latency = 0;
unsigned Advance = 0; // ReadAvance Bypasses cycles
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please improve this comment

///
/// This file implements the instruction scheduling info view.
///
/// The goal fo the instruction scheduling info view is to print the latency,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a fair bit of overlap with the InstructionInfoView, did you consider extending that instead?

Copy link
Contributor

@boomanaiden154 boomanaiden154 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't looked in detail at the code yet, but I'm not necessarily sure llvm-mca is the best place for this. For validating/updating scheduling models, we already have llvm-exegesis, where this would be a more natural fit.

As a side note, I don't often have much confidence in manufacturer-provided scheduling information. It could be that ARM's information is much better than what we get from AMD/Intel, but on X86/RISCV so far, using llvm-exegesis with benchmarks to validate scheduling information has been more effective than trying to construct scheduling models from manufacturer reference documentation.

@jvillette38
Copy link
Contributor Author

There's a few things going on here. I can see value in some of the extra info alongside the instruction like bypass latency/resources/LLVM OpcodeName, but this could be added to the existing -instruction-info view if there's agreement this is useful?

For the reference, why not just make the CHECK lines in a test like llvm/test/tools/llvm-mca/AArch64/Neoverse/V1-basic-instructions.s be the reference? It would have to be XFAIL'ed until the model matched the reference, but any difference would be obvious looking at the diff after running the update script.

Thanks for the review.
I can add this information in InstructionInfoView:

[3]: Bypass Latency
[4]: Throughput
[5]: Resources
[6]: LLVM OpcodeName
[7]: Instruction
[8]: Comment if any

I can let reverse throughput instead of throughput (it was easier to compare with documentation). The goal was to have an easy way to compare llvm scheduling information with documentation/llvm-exegesis and be able to fix llvm target description with LLVM opcode name.

Note: If I do these changes in InstructionInfoView, it will modify all tests that use this view.
Are you agree with that?

My point was to extract scheduling information from documentation or llvm-exegesis to get a reference and then fix easily llvm scheduling information in target description (I have done that by using Software Optimization Guide). l should use llvm-exegesis in second time.
So It is possible to use comments with scheduling information from documentation/llvm-exegesis during update of target description (scripts help to check differences) and then we can remove them when pushing updates.

@jvillette38
Copy link
Contributor Author

Haven't looked in detail at the code yet, but I'm not necessarily sure llvm-mca is the best place for this. For validating/updating scheduling models, we already have llvm-exegesis, where this would be a more natural fit.

As a side note, I don't often have much confidence in manufacturer-provided scheduling information. It could be that ARM's information is much better than what we get from AMD/Intel, but on X86/RISCV so far, using llvm-exegesis with benchmarks to validate scheduling information has been more effective than trying to construct scheduling models from manufacturer reference documentation.

I agree with you but it is a first start and llvm-exegesis is here to refine these information. This patch is here to help to quickly update llvm target description from scheduling information reference (documentation and/or llvm-exegesis).

const MCWriteProcResEntry *Last = STI.getWriteProcResEnd(&SCDesc);
auto sep = "";
for (; Index != Last; ++Index) {
if (!Index->ReleaseAtCycle)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does AcquireAtCycles need to be handled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No because the goal is to get reservation cycles to compute the throughput.
https://github.com/llvm/llvm-project/blob/main/llvm/lib/MC/MCSchedule.cpp#L97
But for now, AcquireAtCycle does not appear and cannot be checked regarding scheduling modele...

@boomanaiden154
Copy link
Contributor

I agree with you but it is a first start and llvm-exegesis is here to refine these information. This patch is here to help to quickly update llvm target description from scheduling information reference (documentation and/or llvm-exegesis).

Originally my point was that this functionality should be part of exegesis rather than being in MCA, but I can see how this information would also potentially be useful to end users, so I guess I'm fine with it either way.

Outputs micro ops, latency, bypass latency, throughput, llvm opcode
name, used resources and parsed assembly instruction with comments.

This option is used to compare scheduling info from micro architecture documents.
Reference scheduling information (from Architecture and micro
architecture) are in comment section after each instruction (// or
/* */).

These information may be generated from Architecture Description Language.
By this way, it is easy to compare information from llvm and from
documentation/ADL.

LLVM Opcode name help to find right instruction regexp to fix in
Target Scheduling Info specification.

Example:
  Input:      abs D20, D11  // ABS <V><d>, <V><n>
      \\ ASIMD arith, basic  \\ 1 2  2  4.0 V1UnitV
  Output:  1 | 2 | 2 | 4.00 | V1UnitV | ABSv1i64 |
           abs     d20, d11              // ABS <V><d>, <V><n>
      \\ ASIMD arith, basic  \\ 1 2  2  4.0 V1UnitV
…g-info new option

When using llvm-mca -scheduling-info and if assembly
test contains comments with reference values of scheduling
info: <MicroOps> <Latency> <Forward Latency> <Throughput> <Units>

To check coherency between llvm-mca -scheduling-info output
and scheduling references in comment, use --check-sched-info.
Exit with error if found deferences and report them.

This is usefull to check new scheduling info patches as
we can specify source documentation references for each
instructions and so be able to understand easier differences.

Example of comment in AArch64/Neoverse/V1-scheduling-info.s:
  abs D15, D23  // ABS <V><d>, <V><n> \\ ASIMD arith, basic \\ 1 2 2 4.0 V1UnitV

llvm-mca -scheduling-info output:
 1 2 2 4.00 - ABSv1i64 V1UnitSVE01, V1UnitV,
  abs d15, d23  // ABS <V><d>, <V><n> \\ ASIMD arith, basic \\ 1 2 2 4.0 V1UnitV

update_mca_test_checks.py is searching for 4 values at the begining and
after comment // and compare these values. Values order must be the same
as llvm-mca output. And it will check that all resources in comment
(reference) is included in llvm-mca output.

It is possible to update source test scheduling information references
using -update-sched-info option. If you want to update test source
references and llvm-mca output references, you have to run two times
update_mca_test_checks.py -update-sched-info. First time to update
scheduling information references and second time to update llvm-mca new
output reference.
jvillette38 pushed a commit to SiPearl/llvm-project that referenced this pull request Feb 13, 2025
…ireAtCycle

MR llvm#126703:
 - Negative ReadAdvance cycles can be negative, so add ForwardingDelayCycles to Latency (computeInstrLatency).
 - Resource reservation cycles given with ReleaseAtCycle - AcquireAtCycle.
…ireAtCycle

MR llvm#126703:
 - Negative ReadAdvance cycles can be negative, so add ForwardingDelayCycles to Latency (computeInstrLatency).
@RKSimon
Copy link
Collaborator

RKSimon commented Feb 13, 2025

I must agree with @boomanaiden154 on this - I'd expect a better approach to be to generate llvm-exegesis yaml for each instruction and use the analysis tools to compare against the scheduler models.

I do something similar in my "uops_to_exegesis" script here: https://github.com/RKSimon/llvm-scripts

Copy link
Member

@mshockwave mshockwave left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must agree with @boomanaiden154 on this - I'd expect a better approach to be to generate llvm-exegesis yaml for each instruction and use the analysis tools to compare against the scheduler models.

Majority of the RISCV MCA test are actually there to make sure our TableGen changes on scheduling models reflect correctly, especially on vector instructions which use lots of non-trivial TableGen machinery.

So while I agree that a llvm-exegesis-generated YAML should be a better golden reference for scheduling models, I think the existing MCA tests still have their values.

In general, I think printing just <uOps>, <Latency>, <Bypass Latency>, <Throughput>, <Resources units> for scheduling model tests makes sense to me, as the MCA view we use in scheduling model tests right now is a little too verbose.
But I would expect update_mca_tests_checks.py to generate something like

// CHECK: <uOps>, <Latency>, <Bypass Latency>, <Throughput>, <Resources units>

Which effectively does the same things as the current scheduling model tests, that is, doing the validation as part of the test, rather than doing the validation using the script via --check-sched-info as you propose.

@jvillette38
Copy link
Contributor Author

I must agree with @boomanaiden154 on this - I'd expect a better approach to be to generate llvm-exegesis yaml for each instruction and use the analysis tools to compare against the scheduler models.

Majority of the RISCV MCA test are actually there to make sure our TableGen changes on scheduling models reflect correctly, especially on vector instructions which use lots of non-trivial TableGen machinery.

So while I agree that a llvm-exegesis-generated YAML should be a better golden reference for scheduling models, I think the existing MCA tests still have their values.

In general, I think printing just <uOps>, <Latency>, <Bypass Latency>, <Throughput>, <Resources units> for scheduling model tests makes sense to me, as the MCA view we use in scheduling model tests right now is a little too verbose. But I would expect update_mca_tests_checks.py to generate something like

// CHECK: <uOps>, <Latency>, <Bypass Latency>, <Throughput>, <Resources units>

Which effectively does the same things as the current scheduling model tests, that is, doing the validation as part of the test, rather than doing the validation using the script via --check-sched-info as you propose.

Thanks for the review!
I would propose to add LLVM OpcodeName to help fixing TableGen in case of diffs:

// CHECK: <uOps>, <Latency>, <Bypass Latency>, <Throughput>, <Resources units>, <OpcodeName>

I am going to remove changes to update_mca_tests_checks.py.

@jvillette38
Copy link
Contributor Author

New MR #128892 to integrate instructions from V1-scheduling-info.s test in existing AArch64 Neoverse V1. Just added one test: V1-misc-instructions.s for undocumented instructions in ARM Software Optimization Guide.
If MR #128892 is accepted, I will create new MR for llvm-mca -scheduling-info option implemented in InstructionInfoView.
And then I will close actual MR (#126703).

davemgreen pushed a commit that referenced this pull request Mar 10, 2025
Added missing instructions for LLVM Opcodes coverage. It will help to
maintain TableGen scheduling information of AArch64 Neoverse V1.

Follow up of MR ##126703
This is a dispatch of new instructions of the big test:
V1-scheduling-info.s
I have created a new test for special instructions without scheduling
info in Software Optimization Guide: V1-misc-instructions.s

No more asm instruction comments to maintain.
mshockwave pushed a commit that referenced this pull request Mar 25, 2025
Option becomes: -instruction-tables=`<level>`
 
The choice of `<level>` controls number of printed information.
`<level>` may be `none` (default), `normal`, `full`.
Note: If the option is used without `<label>`, default is `normal`
(legacy).

When `<level>` is `full`, additional information are:
- `<Bypass Latency>`: Latency when a bypass is implemented between
operands
  in pipelines (see SchedReadAdvance).
  - `<LLVM Opcode Name>`: mnemonic plus operands identifier.
  - `<Resources units>`: Used resources associated with LLVM Opcode.
- `<instruction comment>`: reports comment if any from source assembly.

Level `full` can be used to better check scheduling info when TableGen
is modified.
LLVM Opcode name help to find right instruction regexp to fix TableGen
Scheduling Info.

-instruction-tables=full option is validated on
AArch64/Neoverse/V1-sve-instructions.s

Follow up of MR #126703

---------

Co-authored-by: Julien Villette <julien.villette@sipearl.com>
@jvillette38
Copy link
Contributor Author

Integrated with MRs #128892 #130574 #132972.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants