[AArch64] Neoverse V1 scheduling info #126707

jvillette38 · 2025-02-11T10:10:17Z

This PR fixes scheduling model for the Neoverse V1. All information is taken from the Neoverse V1 Software Optimisation Guide:

https://developer.arm.com/documentation/pjdoc466751330-9685/6-0

Changes:

micro operations are reduced to maximum 3 and respect the number of max issues.
use ReleaseAtCycles to specify throughput
fix bypass latencies
fix some latencies/throughput

Consider conflicts between SVE and ASIMD instructions.
Software Optimization Guide:
Maximum issue bandwidth is sustained using one of the following combinations:
• 2 SVE Uops.
• 4 ASIMD Uops.
• 1 SVE Uop on V0 and 2 ASIMD Uops on VX13.
• 1 SVE Uop on V1 and 2 ASIMD Uops on V02.

This merge request depends on #126703 due to new test: llvm/test/tools/llvm-mca/AArch64/Neoverse/V1-scheduling-info.s.
This test reports all scheduling information changes from this patch if compared with the version of #126703.

@Rin18 may be interested.

This fix scheduling model for the Neoverse V1. All information is taken from the Neoverse V1 Software Optimisation Guide: https://developer.arm.com/documentation/pjdoc466751330-9685/6-0 Changes: - micro operations are reduced to maximum 3 - use ReleaseAtCycles to specify throughput - fix bypass latencies - fix some latencies/throughput

Consider conflicts between SVE and ASIMD instructions. Software Optimization Guide: Maximum issue bandwidth is sustained using one of the following combinations: • 2 SVE Uops. • 4 ASIMD Uops. • 1 SVE Uop on V0 and 2 ASIMD Uops on VX13. • 1 SVE Uop on V1 and 2 ASIMD Uops on V02.

github-actions · 2025-02-11T10:10:44Z

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

llvmbot · 2025-02-11T10:11:19Z

@llvm/pr-subscribers-backend-aarch64

Author: Julien Villette (jvillette38)

Changes

This PR fixes scheduling model for the Neoverse V1. All information is taken from the Neoverse V1 Software Optimisation Guide:

https://developer.arm.com/documentation/pjdoc466751330-9685/6-0

Changes:

micro operations are reduced to maximum 3 and respect the number of max issues.
use ReleaseAtCycles to specify throughput
fix bypass latencies
fix some latencies/throughput

Consider conflicts between SVE and ASIMD instructions.
Software Optimization Guide:
Maximum issue bandwidth is sustained using one of the following combinations:
• 2 SVE Uops.
• 4 ASIMD Uops.
• 1 SVE Uop on V0 and 2 ASIMD Uops on VX13.
• 1 SVE Uop on V1 and 2 ASIMD Uops on V02.

This merge request depends on #126703 due to new test: llvm/test/tools/llvm-mca/AArch64/Neoverse/V1-scheduling-info.s.
This test reports all scheduling information changes from this patch if compared with the version of #126703.

@Rin18 may be interested.

Patch is 2.95 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/126707.diff

11 Files Affected:

(modified) llvm/lib/Target/AArch64/AArch64SchedNeoverseV1.td (+797-598)
(modified) llvm/lib/Target/AArch64/AArch64SchedPredNeoverse.td (+43)
(modified) llvm/lib/Target/AArch64/AArch64SchedPredicates.td (+31-3)
(modified) llvm/test/tools/llvm-mca/AArch64/Neoverse/512tvb-sve-instructions.s (+3-3)
(modified) llvm/test/tools/llvm-mca/AArch64/Neoverse/V1-basic-instructions.s (+388-388)
(modified) llvm/test/tools/llvm-mca/AArch64/Neoverse/V1-forwarding.s (+50-50)
(modified) llvm/test/tools/llvm-mca/AArch64/Neoverse/V1-neon-instructions.s (+405-405)
(added) llvm/test/tools/llvm-mca/AArch64/Neoverse/V1-scheduling-info.s (+7591)
(modified) llvm/test/tools/llvm-mca/AArch64/Neoverse/V1-sve-instructions.s (+3316-3316)
(modified) llvm/test/tools/llvm-mca/AArch64/Neoverse/V1-writeback.s (+900-898)
(modified) llvm/test/tools/llvm-mca/AArch64/Neoverse/V1-zero-dependency.s (+3-3)

diff --git a/llvm/lib/Target/AArch64/AArch64SchedNeoverseV1.td b/llvm/lib/Target/AArch64/AArch64SchedNeoverseV1.td
index 368665467859f5f..99ca28bc4151dad 100644
--- a/llvm/lib/Target/AArch64/AArch64SchedNeoverseV1.td
+++ b/llvm/lib/Target/AArch64/AArch64SchedNeoverseV1.td
@@ -66,6 +66,11 @@ def V1UnitV   : ProcResGroup<[V1UnitV0, V1UnitV1,
 def V1UnitV01 : ProcResGroup<[V1UnitV0, V1UnitV1]>;   // FP/ASIMD 0/1 units
 def V1UnitV02 : ProcResGroup<[V1UnitV0, V1UnitV2]>;   // FP/ASIMD 0/2 units
 def V1UnitV13 : ProcResGroup<[V1UnitV1, V1UnitV3]>;   // FP/ASIMD 1/3 units
+// Select V0 + V2 or V1 + V3 by issuing 2 micro operations
+def V1UnitSVE01 : ProcResGroup<[V1UnitV0, V1UnitV1,   // FP/ASIMD 0,2/1,3 units
+				V1UnitV2, V1UnitV3]>;
+def V1UnitSVE0 : ProcResGroup<[V1UnitV0, V1UnitV2]>;  // FP/ASIMD 0,2 units
+def V1UnitSVE1 : ProcResGroup<[V1UnitV1, V1UnitV3]>;  // FP/ASIMD 1,3 units
 
 // Define commonly used read types.
 
@@ -98,377 +103,487 @@ def V1Write_0c_0Z : SchedWriteRes<[]>;
 
 def V1Write_1c_1B      : SchedWriteRes<[V1UnitB]>   { let Latency = 1; }
 def V1Write_1c_1I      : SchedWriteRes<[V1UnitI]>   { let Latency = 1; }
-def V1Write_1c_1I_1Flg : SchedWriteRes<[V1UnitI, V1UnitFlg]>   { let Latency = 1; }
+def V1Write_1c_1I_1Flg : SchedWriteRes<[V1UnitI, V1UnitFlg]>   { let Latency = 1;
+								 let NumMicroOps = 2; }
 def V1Write_4c_1L      : SchedWriteRes<[V1UnitL]>   { let Latency = 4; }
+def V1Write_4c3_1L     : SchedWriteRes<[V1UnitL]>   { let Latency = 4;
+						      let ReleaseAtCycles = [3]; }
+def V1Write_5c3_1L     : SchedWriteRes<[V1UnitL]>   { let Latency = 5;
+						      let ReleaseAtCycles = [3]; }
+
 def V1Write_6c_1L      : SchedWriteRes<[V1UnitL]>   { let Latency = 6; }
+def V1Write_6c2_1L     : SchedWriteRes<[V1UnitL]>   { let Latency = 6;
+						      let ReleaseAtCycles = [2]; }
+def V1Write_6c3_1L     : SchedWriteRes<[V1UnitL]>   { let Latency = 6;
+						      let ReleaseAtCycles = [3]; }
+def V1Write_7c4_1L     : SchedWriteRes<[V1UnitL]>   { let Latency = 7;
+						      let ReleaseAtCycles = [4]; }
 def V1Write_1c_1L01    : SchedWriteRes<[V1UnitL01]> { let Latency = 1; }
 def V1Write_4c_1L01    : SchedWriteRes<[V1UnitL01]> { let Latency = 4; }
 def V1Write_6c_1L01    : SchedWriteRes<[V1UnitL01]> { let Latency = 6; }
 def V1Write_2c_1M      : SchedWriteRes<[V1UnitM]>   { let Latency = 2; }
-def V1Write_2c_1M_1Flg : SchedWriteRes<[V1UnitM, V1UnitFlg]>   { let Latency = 2; }
+def V1Write_2c_1M_1Flg : SchedWriteRes<[V1UnitM, V1UnitFlg]>   { let Latency = 2;
+								 let NumMicroOps = 2; }
 def V1Write_3c_1M      : SchedWriteRes<[V1UnitM]>   { let Latency = 3; }
-def V1Write_4c_1M      : SchedWriteRes<[V1UnitM]>   { let Latency = 4; }
+def V1Write_4c6_1M     : SchedWriteRes<[V1UnitM]>   { let Latency = 4;
+						      let ReleaseAtCycles = [6]; }
 def V1Write_1c_1M0     : SchedWriteRes<[V1UnitM0]>  { let Latency = 1; }
 def V1Write_2c_1M0     : SchedWriteRes<[V1UnitM0]>  { let Latency = 2; }
+def V1Write_2c2_1M0    : SchedWriteRes<[V1UnitM0]>  { let Latency = 2;
+						      let ReleaseAtCycles = [2]; }
+def V1Write_3c2_1M0    : SchedWriteRes<[V1UnitM0]>  { let Latency = 3;
+						      let ReleaseAtCycles = [2]; }
 def V1Write_3c_1M0     : SchedWriteRes<[V1UnitM0]>  { let Latency = 3; }
 def V1Write_5c_1M0     : SchedWriteRes<[V1UnitM0]>  { let Latency = 5; }
-def V1Write_12c5_1M0   : SchedWriteRes<[V1UnitM0]>  { let Latency = 12;
-                                                      let ReleaseAtCycles = [5]; }
-def V1Write_20c5_1M0   : SchedWriteRes<[V1UnitM0]>  { let Latency = 20;
-                                                      let ReleaseAtCycles = [5]; }
+def V1Write_12c12_1M0  : SchedWriteRes<[V1UnitM0]>  { let Latency = 12;
+						      let ReleaseAtCycles = [12]; }
+def V1Write_20c20_1M0  : SchedWriteRes<[V1UnitM0]>  { let Latency = 20;
+						      let ReleaseAtCycles = [20]; }
 def V1Write_2c_1V      : SchedWriteRes<[V1UnitV]>   { let Latency = 2; }
+def V1Write_2c4_1V     : SchedWriteRes<[V1UnitV]>   { let Latency = 2;
+						      let ReleaseAtCycles = [4]; }
 def V1Write_3c_1V      : SchedWriteRes<[V1UnitV]>   { let Latency = 3; }
 def V1Write_4c_1V      : SchedWriteRes<[V1UnitV]>   { let Latency = 4; }
+def V1Write_4c2_1V     : SchedWriteRes<[V1UnitV]>   { let Latency = 4;
+						      let ReleaseAtCycles = [2]; }
 def V1Write_5c_1V      : SchedWriteRes<[V1UnitV]>   { let Latency = 5; }
+def V1Write_6c3_1V     : SchedWriteRes<[V1UnitV]>   { let Latency = 6;
+						      let ReleaseAtCycles = [3]; }
+def V1Write_12c4_1SVE1 : SchedWriteRes<[V1UnitSVE1]> { let Latency = 12;
+						       let NumMicroOps = 2;
+						       let ReleaseAtCycles = [4]; }
+def V1Write_14c4_1SVE1 : SchedWriteRes<[V1UnitSVE1]> { let Latency = 14;
+						       let NumMicroOps = 2;
+						       let ReleaseAtCycles = [4]; }
+
 def V1Write_2c_1V0     : SchedWriteRes<[V1UnitV0]>  { let Latency = 2; }
+def V1Write_2c_1SVE0   : SchedWriteRes<[V1UnitSVE0,V1UnitSVE0]>	 { let Latency = 2;
+								   let NumMicroOps = 2; }
 def V1Write_3c_1V0     : SchedWriteRes<[V1UnitV0]>  { let Latency = 3; }
+def V1Write_3c_1SVE0   : SchedWriteRes<[V1UnitSVE0,V1UnitSVE0]>	 { let Latency = 3;
+								   let NumMicroOps = 2; }
 def V1Write_4c_1V0     : SchedWriteRes<[V1UnitV0]>  { let Latency = 4; }
-def V1Write_6c_1V0     : SchedWriteRes<[V1UnitV0]>  { let Latency = 6; }
-def V1Write_10c7_1V0   : SchedWriteRes<[V1UnitV0]>  { let Latency = 10;
-                                                      let ReleaseAtCycles = [7]; }
-def V1Write_12c7_1V0   : SchedWriteRes<[V1UnitV0]>  { let Latency = 12;
-                                                      let ReleaseAtCycles = [7]; }
-def V1Write_13c10_1V0  : SchedWriteRes<[V1UnitV0]>  { let Latency = 13;
-                                                      let ReleaseAtCycles = [10]; }
-def V1Write_15c7_1V0   : SchedWriteRes<[V1UnitV0]>  { let Latency = 15;
-                                                      let ReleaseAtCycles = [7]; }
-def V1Write_16c7_1V0   : SchedWriteRes<[V1UnitV0]>  { let Latency = 16;
-                                                      let ReleaseAtCycles = [7]; }
-def V1Write_20c7_1V0   : SchedWriteRes<[V1UnitV0]>  { let Latency = 20;
-                                                      let ReleaseAtCycles = [7]; }
+def V1Write_4c_1SVE0   : SchedWriteRes<[V1UnitSVE0,V1UnitSVE0]>	 { let Latency = 4;
+								   let NumMicroOps = 2; }
+def V1Write_5c4_1SVE0  : SchedWriteRes<[V1UnitSVE0]>  { let Latency = 5;
+							let ReleaseAtCycles = [4];
+							let NumMicroOps = 2; }
+def V1Write_6c_1SVE0	: SchedWriteRes<[V1UnitSVE0,V1UnitSVE0]>  { let Latency = 6;
+								    let NumMicroOps = 2; }
+def V1Write_6c4_1SVE0	: SchedWriteRes<[V1UnitSVE0,V1UnitSVE0]>  { let Latency = 6;
+								    let NumMicroOps = 2;
+								    let ReleaseAtCycles = [4,4]; }
+def V1Write_10c18_1SVE0	: SchedWriteRes<[V1UnitSVE0]>  { let Latency = 10;
+							 let NumMicroOps = 2;
+							 let ReleaseAtCycles = [18]; }
+def V1Write_11c20_1SVE0	 : SchedWriteRes<[V1UnitSVE0]> { let Latency = 11;
+							 let NumMicroOps = 2;
+							 let ReleaseAtCycles = [20]; }
+def V1Write_12c22_1SVE0	: SchedWriteRes<[V1UnitSVE0]>  { let Latency = 12;
+							 let NumMicroOps = 2;
+							 let ReleaseAtCycles = [22]; }
+def V1Write_13c24_1SVE0 : SchedWriteRes<[V1UnitSVE0]>  { let Latency = 13;
+							 let NumMicroOps = 2;
+							 let ReleaseAtCycles = [24]; }
+def V1Write_15c28_1SVE0	: SchedWriteRes<[V1UnitSVE0]>  { let Latency = 15;
+							 let NumMicroOps = 2;
+							let ReleaseAtCycles = [28]; }
+def V1Write_16c28_1SVE0	: SchedWriteRes<[V1UnitSVE0]>  { let Latency = 16;
+							 let NumMicroOps = 2;
+							 let ReleaseAtCycles = [28]; }
+def V1Write_19c36_1SVE0 : SchedWriteRes<[V1UnitSVE0]> { let Latency = 19;
+							let NumMicroOps = 2;
+							let ReleaseAtCycles = [36]; }
+def V1Write_20c40_1SVE0	: SchedWriteRes<[V1UnitSVE0]> { let Latency = 20;
+							let NumMicroOps = 2;
+							let ReleaseAtCycles = [40]; }
+
 def V1Write_2c_1V01    : SchedWriteRes<[V1UnitV01]> { let Latency = 2; }
+def V1Write_2c_1SVE01  : SchedWriteRes<[V1UnitSVE01,V1UnitSVE01]> { let Latency = 2;
+								    let NumMicroOps = 2; }
 def V1Write_3c_1V01    : SchedWriteRes<[V1UnitV01]> { let Latency = 3; }
-def V1Write_4c_1V01    : SchedWriteRes<[V1UnitV01]> { let Latency = 4; }
-def V1Write_5c_1V01    : SchedWriteRes<[V1UnitV01]> { let Latency = 5; }
+def V1Write_3c_1SVE01  : SchedWriteRes<[V1UnitSVE01,V1UnitSVE01]> { let Latency = 3;
+								    let NumMicroOps = 2; }
+def V1Write_4c_1SVE01  : SchedWriteRes<[V1UnitSVE01,V1UnitSVE01]> { let Latency = 4;
+								    let NumMicroOps = 2; }
+def V1Write_4c2_1V01   : SchedWriteRes<[V1UnitV01]> { let Latency = 4;
+						      let ReleaseAtCycles = [2]; }
+def V1Write_4c3_1V01   : SchedWriteRes<[V1UnitV01]> { let Latency = 4;
+						      let ReleaseAtCycles = [3]; }
+def V1Write_6c3_1V01   : SchedWriteRes<[V1UnitV01]> { let Latency = 6;
+						      let ReleaseAtCycles = [3]; }
+def V1Write_6c5_1V01   : SchedWriteRes<[V1UnitV01]> { let Latency = 6;
+						      let ReleaseAtCycles = [5]; }
+def V1Write_8c6_1SVE01 : SchedWriteRes<[V1UnitSVE01]> { let Latency = 8;
+							let NumMicroOps = 2;
+							let ReleaseAtCycles = [6]; }
+def V1Write_9c8_1SVE01 : SchedWriteRes<[V1UnitSVE01]> { let Latency = 9;
+							let NumMicroOps = 2;
+							let ReleaseAtCycles = [8]; }
+def V1Write_12c8_1SVE01: SchedWriteRes<[V1UnitSVE01]> { let Latency = 12;
+							let ReleaseAtCycles = [8];
+							let NumMicroOps = 2; }
+def V1Write_13c6_1SVE01	 : SchedWriteRes<[V1UnitSVE01]> { let Latency = 13;
+							  let ReleaseAtCycles = [12];
+							  let NumMicroOps = 2; }
+def V1Write_11c10_1SVE01  : SchedWriteRes<[V1UnitSVE01]> { let Latency = 11;
+							   let NumMicroOps = 2;
+							   let ReleaseAtCycles = [10]; }
 def V1Write_3c_1V02    : SchedWriteRes<[V1UnitV02]> { let Latency = 3; }
 def V1Write_4c_1V02    : SchedWriteRes<[V1UnitV02]> { let Latency = 4; }
+def V1Write_4c2_1V02   : SchedWriteRes<[V1UnitV02]> { let Latency = 4;
+						      let ReleaseAtCycles = [2]; }
+def V1Write_6c4_1V02   : SchedWriteRes<[V1UnitV02]> { let Latency = 6;
+						      let ReleaseAtCycles = [4]; }
+def V1Write_7c2_1V02   : SchedWriteRes<[V1UnitV02]> { let Latency = 7;
+						      let ReleaseAtCycles = [2]; }
 def V1Write_7c7_1V02   : SchedWriteRes<[V1UnitV02]> { let Latency = 7;
                                                       let ReleaseAtCycles = [7]; }
-def V1Write_10c7_1V02  : SchedWriteRes<[V1UnitV02]> { let Latency = 10;
-                                                      let ReleaseAtCycles = [7]; }
-def V1Write_13c5_1V02  : SchedWriteRes<[V1UnitV02]> { let Latency = 13;
+def V1Write_9c3_1V02  : SchedWriteRes<[V1UnitV02]>  { let Latency = 9;
+						      let ReleaseAtCycles = [2]; }
+def V1Write_10c3_1V02  : SchedWriteRes<[V1UnitV02]> { let Latency = 10;
+						      let ReleaseAtCycles = [3]; }
+def V1Write_10c5_1V02  : SchedWriteRes<[V1UnitV02]> { let Latency = 10;
                                                       let ReleaseAtCycles = [5]; }
-def V1Write_13c11_1V02 : SchedWriteRes<[V1UnitV02]> { let Latency = 13;
-                                                      let ReleaseAtCycles = [11]; }
+def V1Write_10c9_1V02 : SchedWriteRes<[V1UnitV02]> { let Latency = 10;
+						      let ReleaseAtCycles = [9]; }
+def V1Write_13c13_1V02 : SchedWriteRes<[V1UnitV02]> { let Latency = 13;
+						      let ReleaseAtCycles = [13]; }
 def V1Write_15c7_1V02  : SchedWriteRes<[V1UnitV02]> { let Latency = 15;
                                                       let ReleaseAtCycles = [7]; }
-def V1Write_16c7_1V02  : SchedWriteRes<[V1UnitV02]> { let Latency = 16;
-                                                      let ReleaseAtCycles = [7]; }
+def V1Write_15c14_1V02 : SchedWriteRes<[V1UnitV02]> { let Latency = 15;
+						      let ReleaseAtCycles = [14]; }
+def V1Write_16c8_1V02  : SchedWriteRes<[V1UnitV02]> { let Latency = 16;
+						      let ReleaseAtCycles = [8]; }
+def V1Write_16c15_1V02	: SchedWriteRes<[V1UnitV02]> { let Latency = 16;
+						      let ReleaseAtCycles = [15]; }
 def V1Write_2c_1V1     : SchedWriteRes<[V1UnitV1]>  { let Latency = 2; }
-def V1Write_3c_1V1     : SchedWriteRes<[V1UnitV1]>  { let Latency = 3; }
-def V1Write_4c_1V1     : SchedWriteRes<[V1UnitV1]>  { let Latency = 4; }
+def V1Write_2c_1SVE1   : SchedWriteRes<[V1UnitSVE1,V1UnitSVE1]>	 { let Latency = 2;
+								   let NumMicroOps = 2; }
+def V1Write_3c_1SVE1   : SchedWriteRes<[V1UnitSVE1,V1UnitSVE1]>	 { let Latency = 3;
+								   let NumMicroOps = 2; }
+def V1Write_4c_1SVE1   : SchedWriteRes<[V1UnitSVE1,V1UnitSVE1]>	 { let Latency = 4;
+								   let NumMicroOps = 2; }
+def V1Write_8c4_1SVE1  : SchedWriteRes<[V1UnitSVE1]>  { let Latency = 8;
+							let NumMicroOps = 2;
+							let ReleaseAtCycles = [4]; }
+def V1Write_10c4_1SVE1 : SchedWriteRes<[V1UnitSVE1]>  { let Latency = 10;
+							let NumMicroOps = 2;
+							let ReleaseAtCycles = [4]; }
 def V1Write_2c_1V13    : SchedWriteRes<[V1UnitV13]> { let Latency = 2; }
 def V1Write_4c_1V13    : SchedWriteRes<[V1UnitV13]> { let Latency = 4; }
+def V1Write_4c2_1V13   : SchedWriteRes<[V1UnitV13]> { let Latency = 4;
+						      let ReleaseAtCycles = [2]; }
+
 
 //===----------------------------------------------------------------------===//
 // Define generic 2 micro-op types
 
-let Latency = 1, NumMicroOps = 2 in
-def V1Write_1c_1B_1S     : SchedWriteRes<[V1UnitB, V1UnitS]>;
-let Latency = 6, NumMicroOps = 2 in
-def V1Write_6c_1B_1M0    : SchedWriteRes<[V1UnitB, V1UnitM0]>;
-let Latency = 3, NumMicroOps = 2 in
-def V1Write_3c_1I_1M     : SchedWriteRes<[V1UnitI, V1UnitM]>;
-let Latency = 5, NumMicroOps = 2 in
-def V1Write_5c_1I_1L     : SchedWriteRes<[V1UnitI, V1UnitL]>;
-let Latency = 7, NumMicroOps = 2 in
-def V1Write_7c_1I_1L     : SchedWriteRes<[V1UnitI, V1UnitL]>;
-let Latency = 6, NumMicroOps = 2 in
-def V1Write_6c_2L        : SchedWriteRes<[V1UnitL, V1UnitL]>;
-let Latency = 6, NumMicroOps = 2 in
-def V1Write_6c_1L_1M     : SchedWriteRes<[V1UnitL, V1UnitM]>;
-let Latency = 8, NumMicroOps = 2 in
-def V1Write_8c_1L_1V     : SchedWriteRes<[V1UnitL, V1UnitV]>;
-let Latency = 9, NumMicroOps = 2 in
-def V1Write_9c_1L_1V     : SchedWriteRes<[V1UnitL, V1UnitV]>;
-let Latency = 11, NumMicroOps = 2 in
-def V1Write_11c_1L_1V     : SchedWriteRes<[V1UnitL, V1UnitV]>;
-let Latency = 1, NumMicroOps = 2 in
-def V1Write_1c_1L01_1D   : SchedWriteRes<[V1UnitL01, V1UnitD]>;
-let Latency = 6, NumMicroOps = 2 in
-def V1Write_6c_1L01_1S   : SchedWriteRes<[V1UnitL01, V1UnitS]>;
-let Latency = 7, NumMicroOps = 2 in
-def V1Write_7c_1L01_1S   : SchedWriteRes<[V1UnitL01, V1UnitS]>;
-let Latency = 2, NumMicroOps = 2 in
-def V1Write_2c_1L01_1V   : SchedWriteRes<[V1UnitL01, V1UnitV]>;
-let Latency = 4, NumMicroOps = 2 in
-def V1Write_4c_1L01_1V   : SchedWriteRes<[V1UnitL01, V1UnitV]>;
-let Latency = 6, NumMicroOps = 2 in
-def V1Write_6c_1L01_1V   : SchedWriteRes<[V1UnitL01, V1UnitV]>;
-let Latency = 2, NumMicroOps = 2 in
-def V1Write_2c_1L01_1V01 : SchedWriteRes<[V1UnitL01, V1UnitV01]>;
-let Latency = 4, NumMicroOps = 2 in
-def V1Write_4c_1L01_1V01 : SchedWriteRes<[V1UnitL01, V1UnitV01]>;
-let Latency = 2, NumMicroOps = 2 in
-def V1Write_2c_2M0       : SchedWriteRes<[V1UnitM0, V1UnitM0]>;
-let Latency = 3, NumMicroOps = 2 in
-def V1Write_3c_2M0       : SchedWriteRes<[V1UnitM0, V1UnitM0]>;
-let Latency = 9, NumMicroOps = 2 in
-def V1Write_9c_1M0_1L    : SchedWriteRes<[V1UnitM0, V1UnitL]>;
-let Latency = 5, NumMicroOps = 2 in
-def V1Write_5c_1M0_1V    : SchedWriteRes<[V1UnitM0, V1UnitV]>;
-let Latency = 4, NumMicroOps = 2 in
-def V1Write_4c_1M0_1V0    : SchedWriteRes<[V1UnitM0, V1UnitV0]>;
-let Latency = 7, NumMicroOps = 2 in
-def V1Write_7c_1M0_1V0   : SchedWriteRes<[V1UnitM0, V1UnitV1]>;
-let Latency = 5, NumMicroOps = 2 in
-def V1Write_5c_1M0_1V01    : SchedWriteRes<[V1UnitM0, V1UnitV01]>;
-let Latency = 6, NumMicroOps = 2 in
-def V1Write_6c_1M0_1V1   : SchedWriteRes<[V1UnitM0, V1UnitV1]>;
-let Latency = 9, NumMicroOps = 2 in
-def V1Write_9c_1M0_1V1    : SchedWriteRes<[V1UnitM0, V1UnitV1]>;
-let Latency = 4, NumMicroOps = 2 in
-def V1Write_4c_2V        : SchedWriteRes<[V1UnitV, V1UnitV]>;
-let Latency = 8, NumMicroOps = 2 in
-def V1Write_8c_1V_1V01   : SchedWriteRes<[V1UnitV, V1UnitV01]>;
-let Latency = 4, NumMicroOps = 2 in
-def V1Write_4c_2V0       : SchedWriteRes<[V1UnitV0, V1UnitV0]>;
-let Latency = 5, NumMicroOps = 2 in
-def V1Write_5c_2V0       : SchedWriteRes<[V1UnitV0, V1UnitV0]>;
-let Latency = 2, NumMicroOps = 2 in
-def V1Write_2c_2V01      : SchedWriteRes<[V1UnitV01, V1UnitV01]>;
-let Latency = 4, NumMicroOps = 2 in
-def V1Write_4c_2V01      : SchedWriteRes<[V1UnitV01, V1UnitV01]>;
-let Latency = 4, NumMicroOps = 2 in
-def V1Write_4c_2V02      : SchedWriteRes<[V1UnitV02, V1UnitV02]>;
-let Latency = 6, NumMicroOps = 2 in
-def V1Write_6c_2V02      : SchedWriteRes<[V1UnitV02, V1UnitV02]>;
-let Latency = 4, NumMicroOps = 2 in
-def V1Write_4c_1V13_1V   : SchedWriteRes<[V1UnitV13, V1UnitV]>;
-let Latency = 4, NumMicroOps = 2 in
-def V1Write_4c_2V13      : SchedWriteRes<[V1UnitV13, V1UnitV13]>;
+def V1Write_1c_1B_1S	 : SchedWriteRes<[V1UnitB, V1UnitS]> {
+    let Latency = 1;
+    let NumMicroOps = 2;
+}
 
-//===----------------------------------------------------------------------===//
-// Define generic 3 micro-op types
+def V1Write_6c_1B_1M0	 : SchedWriteRes<[V1UnitB, V1UnitM0]> {
+    let Latency = 6;
+    let NumMicroOps = 2;
+}
 
-let Latency = 2, NumMicroOps = 3 in
-def V1Write_2c_1I_1L01_1V01 : SchedWriteRes<[V1UnitI, V1UnitL01, V1UnitV01]>;
-let Latency = 7, NumMicroOps = 3 in
-def V1Write_7c_2M0_1V01     : SchedWriteRes<[V1UnitM0, V1UnitM0, V1UnitV01]>;
-let Latency = 8, NumMicroOps = 3 in
-def V1Write_8c_1L_2V        : SchedWriteRes<[V1UnitL, V1UnitV, V1UnitV]>;
-let Latency = 6, NumMicroOps = 3 in
-def V1Write_6c_3L           : SchedWriteRes<[V1UnitL, V1UnitL, V1UnitL]>;
-let Latency = 2, NumMicroOps = 3 in
-def V1Write_2c_1L01_1S_1V   : SchedWriteRes<[V1UnitL01, V1UnitS, V1UnitV]>;
-let Latency = 4, NumMicroOps = 3 in
-def V1Write_4c_1L01_1S_1V   : SchedWriteRes<[V1UnitL01, V1UnitS, V1UnitV]>;
-let Latency = 2, NumMicroOps = 3 in
-def V1Write_2c_2L01_1V01    : SchedWriteRes<[V1UnitL01, V1UnitL01, V1UnitV01]>;
-let Latency = 6, NumMicroOps = 3 in
-def V1Write_6c_3V           : SchedWriteRes<[V1UnitV, V1UnitV, V1UnitV]>;
-let Latency = 4, NumMicroOps = 3 in
-def V1Write_4c_3V01         : SchedWriteRes<[V1UnitV01, V1UnitV01, V1UnitV01]>;
-let Latency = 6, NumMicroOps = 3 in
-def V1Write_6c_3V01         : SchedWriteRes<[V1UnitV01, V1UnitV01, V1UnitV01]>;
-let Latency = 8, NumMicroOps = 3 in
-def V1Write_8c_3V01         : SchedWriteRes<[V1UnitV01, V1UnitV01, V1UnitV01]>;
+def V1Write_5c_1I_1L	 : SchedWriteRes<[V1UnitI, V1UnitL]> {
+    let Latency = 5;
+    let NumMicroOps = 2;
+}
 
-//===----------------------------------------------------------------------===//
-// Define generic 4 micro-op types
-
-let Latency = 8, NumMicroOps = 4 in
-def V1Write_8c_2M0_2V0   : SchedWriteRes<[V1UnitM0, V1UnitM0,
-                                          V1UnitV0, V1UnitV0]>;
-let Latency = 7, NumMicroOps = 4 in
-def V1Write_7c_4L        : SchedWriteRes<[V1UnitL, V1UnitL, V1UnitL, V1UnitL]>;
-let Latency = 8, NumMicroOps = 4 in
-def V1Write_8c_2L_2V        : SchedWriteRes<[V1UnitL, V1UnitL,
-                                             V1UnitV, V1UnitV]>;
-let Latency = 9, NumMicroOps = 4 in
-def V1Write_9c_2L_2V        : SchedWriteRes<[V1UnitL, V1UnitL,
-                                             V1UnitV, V1UnitV]>;
-let Latency = 11, NumMicroOps = 4 in
-def V1Write_11c_2L_2V       : SchedWriteRes<[V1UnitL, V1UnitL,
-                                             V1UnitV, V1UnitV]>;
-let Latency = 10, NumMicroOps = 4 in
-def V1Write_10c_2L01_2V     : SchedWriteRes<[V1UnitL01, V1UnitL01,
-                                             V1UnitV, V1UnitV]>;
-let Latency = 2, NumMicroOps = 4 in
-def V1Write_2c_2L01_2V01    : SchedWriteRes<[V1UnitL01, V1UnitL01,
-                                             V1UnitV01, V1UnitV01]>;
-let Latency = 4, NumMicroOps = 4 in
-def V1Write_4c_2L01_2V01    : SchedWriteRes<[V1UnitL01, V1UnitL01,
-                                             V1UnitV01, V1UnitV01]>;
-let Latency = 8, NumMicroOps = 4 in
-def V1Write_8c_2L01_2V01    : SchedWriteRes<[V1UnitL01, V1UnitL01,
-                              ...
[truncated]

davemgreen

Hi - quite a lot is changing here and I'm not sure about some of the details. It is probably worth splitting it up into smaller patches or pulling some of the more obvious parts out so they can be reviewed separately.

As quite a lot is changing, do you have performance results?

davemgreen · 2025-02-11T12:20:04Z

llvm/lib/Target/AArch64/AArch64SchedNeoverseV1.td

@@ -66,6 +66,11 @@ def V1UnitV   : ProcResGroup<[V1UnitV0, V1UnitV1,
 def V1UnitV01 : ProcResGroup<[V1UnitV0, V1UnitV1]>;   // FP/ASIMD 0/1 units
 def V1UnitV02 : ProcResGroup<[V1UnitV0, V1UnitV2]>;   // FP/ASIMD 0/2 units
 def V1UnitV13 : ProcResGroup<[V1UnitV1, V1UnitV3]>;   // FP/ASIMD 1/3 units
+// Select V0 + V2 or V1 + V3 by issuing 2 micro operations
+def V1UnitSVE01 : ProcResGroup<[V1UnitV0, V1UnitV1,   // FP/ASIMD 0,2/1,3 units


These look the same as the V unit groups above?

Yes but used to reserve V units for SVE instructions.
It is to help to modelize the conflicts between SVE and ASIMD instructions:

Maximum issue bandwidth is sustained using one of the following combinations:
• 2 SVE Uops.
• 4 ASIMD Uops.
• 1 SVE Uop on V0 and 2 ASIMD Uops on VX13.
• 1 SVE Uop on V1 and 2 ASIMD Uops on V02.

My understanding is that SVE instructions reserve V0+V2 or V1+V3.

My understanding is that SVE instructions reserve V0+V2 or V1+V3.

For units V0+V2 and V1+V3 there already exists V1UnitV02 and V1UnitV13. And if you want to use V1+V2+V3+V4 there exists V1UnitV. The resource V1UnitSVE01 looks to be the same as V1UnitV. Why not use the existing V unit groups as David mentioned?

Yes! Ok. It was to help changes and do not confuse between SVE and others.

davemgreen · 2025-02-11T12:23:04Z

llvm/lib/Target/AArch64/AArch64SchedNeoverseV1.td

@@ -98,377 +103,487 @@ def V1Write_0c_0Z : SchedWriteRes<[]>;

 def V1Write_1c_1B      : SchedWriteRes<[V1UnitB]>   { let Latency = 1; }
 def V1Write_1c_1I      : SchedWriteRes<[V1UnitI]>   { let Latency = 1; }
-def V1Write_1c_1I_1Flg : SchedWriteRes<[V1UnitI, V1UnitFlg]>   { let Latency = 1; }
+def V1Write_1c_1I_1Flg : SchedWriteRes<[V1UnitI, V1UnitFlg]>   { let Latency = 1;
+								 let NumMicroOps = 2; }


I wouldn't expect these to use 2 micro ops I don't think. Can you explain why?

Ah yes, you are right. This is wrong regarding the number of issues of the processor. I am going to fix it.

davemgreen · 2025-02-11T12:27:17Z

llvm/test/tools/llvm-mca/AArch64/Neoverse/V1-basic-instructions.s

-# CHECK-NEXT:  -      -      -      -      -      -      -      -      -      -     5.00    -      -      -      -      -      -      -     sdiv	w12, w21, w0
-# CHECK-NEXT:  -      -      -      -      -      -      -      -      -      -     5.00    -      -      -      -      -      -      -     sdiv	x13, x2, x1
+# CHECK-NEXT:  -      -      -      -      -      -      -      -      -      -     12.00   -      -      -      -      -      -      -     udiv	w0, w7, w10
+# CHECK-NEXT:  -      -      -      -      -      -      -      -      -      -     20.00   -      -      -      -      -      -      -     udiv	x9, x22, x4


This looks like it is going from best-case to worse-case time. The average is probably somewhere in the middle, closer to 6-8 depending on how you weight it.

Yes. I took the worst-case... I am agree with you. I am going to fix it and probably doing some experiment with small benchmark (llvm-exegesis). We have seen that FDIV tends to be faster in reality than in SOG.
For a quick fix, do you prefer best-case, or 1/3 between best and worst cases for example (udiv w: 6, udiv x: 10)?

Asher8118

There are a lot of changes in this patch and it is hard to see how they influence each other.

Changes:

micro operations are reduced to maximum 3 and respect the number of max issues.

use ReleaseAtCycles to specify throughput

fix bypass latencies

fix some latencies/throughput

Could you separate those changes into individual patches? It would make it easier to review them.

Asher8118 · 2025-02-11T15:22:37Z

llvm/lib/Target/AArch64/AArch64SchedNeoverseV1.td

+def V1Wr_ZFMA : SchedWriteRes<[V1UnitSVE01,V1UnitSVE01]> { let Latency = 4;
+							   let NumMicroOps = 2; }


From what I can see you have aded let NumMicroOps = 2 to most SVE forwarded types. Could you explain where this comes from? I'd expect those instructions to use one microOp.

It is to consider following constraints explained in chapter 4.16 of SOG.

Maximum issue bandwidth is sustained using one of the following combinations:
• 2 SVE Uops.
• 4 ASIMD Uops.
• 1 SVE Uop on V0 and 2 ASIMD Uops on VX13.
• 1 SVE Uop on V1 and 2 ASIMD Uops on V02.

Asher8118 · 2025-02-11T15:24:55Z

llvm/test/tools/llvm-mca/AArch64/Neoverse/V1-forwarding.s

@@ -911,7 +911,7 @@ bfmlalb z0.s, z0.h, z1.h
 # CHECK-NEXT: [1,0]     D============eeeeeER.    ..   mul	z0.d, p0/m, z0.d, z0.d
 # CHECK-NEXT: [1,1]     D=================eeeER  ..   sdot	z0.s, z1.b, z2.b
 # CHECK-NEXT: [1,2]     D==================eeeER ..   sdot	z0.s, z1.b, z2.b
-# CHECK-NEXT: [1,3]     D=====================eeeER   sdot	z0.s, z0.b, z1.b
+# CHECK-NEXT: [1,3]     .D====================eeeER   sdot	z0.s, z0.b, z1.b


I don't see why there would be a need for a cycle stall here(and for the other cycle stalls added in this test). Could you explain this change?

Asher8118 · 2025-02-11T15:44:36Z

llvm/lib/Target/AArch64/AArch64SchedNeoverseV1.td

+def V1Write_16c8_1V02  : SchedWriteRes<[V1UnitV02]> { let Latency = 16;
+						      let ReleaseAtCycles = [8]; }


I see you've changed the ReleaseAtCycles of some resources. Could you explain what you based this change on? For example, I see you've changed this from ReleaseAtCycles = [7] to ReleaseAtCycles = [8], could you explain why?

This one is used for FSQRTDr. Again, I used worste-case throughput.
FP square root, D-form | FSQRT | 7 to 16 | 4/15 to 4/7 | V02
So throughput of 4/15. This instruction can be issued in V0 or V2 so throughput in each pipeline is 4/15/2: 2/15.
To get the number of cycles the micro op should stay in pipeline: 15/2 so 7.5.
It was computed with a script to generate references. I am agree that it should be better to consider also best-case + 1/3 between best and worst cases. And probably after benchmarking, only the best case...
I can skip this kind of changes in new patches versions.
Sorry.

jvillette38 · 2025-02-12T09:35:47Z

There are a lot of changes in this patch and it is hard to see how they influence each other.

Changes:

micro operations are reduced to maximum 3 and respect the number of max issues.

use ReleaseAtCycles to specify throughput

fix bypass latencies

fix some latencies/throughput

Could you separate those changes into individual patches? It would make it easier to review them.

Thanks for the review!
Yes... I am going to split these patches.

JulienVillette added 2 commits February 11, 2025 11:07

llvmbot added the backend:AArch64 label Feb 11, 2025

davemgreen reviewed Feb 11, 2025

View reviewed changes

davemgreen requested review from c-rhodes and Asher8118 February 11, 2025 12:29

Asher8118 reviewed Feb 11, 2025

View reviewed changes

		def V1Wr_ZFMA : SchedWriteRes<[V1UnitSVE01,V1UnitSVE01]> { let Latency = 4;
		let NumMicroOps = 2; }

		def V1Write_16c8_1V02 : SchedWriteRes<[V1UnitV02]> { let Latency = 16;
		let ReleaseAtCycles = [8]; }

[AArch64] Neoverse V1 scheduling info #126707

Are you sure you want to change the base?

[AArch64] Neoverse V1 scheduling info #126707

Uh oh!

Conversation

jvillette38 commented Feb 11, 2025

Uh oh!

github-actions bot commented Feb 11, 2025

Uh oh!

llvmbot commented Feb 11, 2025

Uh oh!

davemgreen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Asher8118 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jvillette38 commented Feb 12, 2025

Uh oh!

Uh oh!