-
Notifications
You must be signed in to change notification settings - Fork 14.5k
[AArch64] Neoverse V1 scheduling info #126707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This fix scheduling model for the Neoverse V1. All information is taken from the Neoverse V1 Software Optimisation Guide: https://developer.arm.com/documentation/pjdoc466751330-9685/6-0 Changes: - micro operations are reduced to maximum 3 - use ReleaseAtCycles to specify throughput - fix bypass latencies - fix some latencies/throughput
Consider conflicts between SVE and ASIMD instructions. Software Optimization Guide: Maximum issue bandwidth is sustained using one of the following combinations: • 2 SVE Uops. • 4 ASIMD Uops. • 1 SVE Uop on V0 and 2 ASIMD Uops on VX13. • 1 SVE Uop on V1 and 2 ASIMD Uops on V02.
Thank you for submitting a Pull Request (PR) to the LLVM Project! This PR will be automatically labeled and the relevant teams will be notified. If you wish to, you can add reviewers by using the "Reviewers" section on this page. If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers. If you have further questions, they may be answered by the LLVM GitHub User Guide. You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums. |
@llvm/pr-subscribers-backend-aarch64 Author: Julien Villette (jvillette38) ChangesThis PR fixes scheduling model for the Neoverse V1. All information is taken from the Neoverse V1 Software Optimisation Guide: https://developer.arm.com/documentation/pjdoc466751330-9685/6-0 Changes:
Consider conflicts between SVE and ASIMD instructions. This merge request depends on #126703 due to new test: llvm/test/tools/llvm-mca/AArch64/Neoverse/V1-scheduling-info.s. @Rin18 may be interested. Patch is 2.95 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/126707.diff 11 Files Affected:
diff --git a/llvm/lib/Target/AArch64/AArch64SchedNeoverseV1.td b/llvm/lib/Target/AArch64/AArch64SchedNeoverseV1.td
index 368665467859f5f..99ca28bc4151dad 100644
--- a/llvm/lib/Target/AArch64/AArch64SchedNeoverseV1.td
+++ b/llvm/lib/Target/AArch64/AArch64SchedNeoverseV1.td
@@ -66,6 +66,11 @@ def V1UnitV : ProcResGroup<[V1UnitV0, V1UnitV1,
def V1UnitV01 : ProcResGroup<[V1UnitV0, V1UnitV1]>; // FP/ASIMD 0/1 units
def V1UnitV02 : ProcResGroup<[V1UnitV0, V1UnitV2]>; // FP/ASIMD 0/2 units
def V1UnitV13 : ProcResGroup<[V1UnitV1, V1UnitV3]>; // FP/ASIMD 1/3 units
+// Select V0 + V2 or V1 + V3 by issuing 2 micro operations
+def V1UnitSVE01 : ProcResGroup<[V1UnitV0, V1UnitV1, // FP/ASIMD 0,2/1,3 units
+ V1UnitV2, V1UnitV3]>;
+def V1UnitSVE0 : ProcResGroup<[V1UnitV0, V1UnitV2]>; // FP/ASIMD 0,2 units
+def V1UnitSVE1 : ProcResGroup<[V1UnitV1, V1UnitV3]>; // FP/ASIMD 1,3 units
// Define commonly used read types.
@@ -98,377 +103,487 @@ def V1Write_0c_0Z : SchedWriteRes<[]>;
def V1Write_1c_1B : SchedWriteRes<[V1UnitB]> { let Latency = 1; }
def V1Write_1c_1I : SchedWriteRes<[V1UnitI]> { let Latency = 1; }
-def V1Write_1c_1I_1Flg : SchedWriteRes<[V1UnitI, V1UnitFlg]> { let Latency = 1; }
+def V1Write_1c_1I_1Flg : SchedWriteRes<[V1UnitI, V1UnitFlg]> { let Latency = 1;
+ let NumMicroOps = 2; }
def V1Write_4c_1L : SchedWriteRes<[V1UnitL]> { let Latency = 4; }
+def V1Write_4c3_1L : SchedWriteRes<[V1UnitL]> { let Latency = 4;
+ let ReleaseAtCycles = [3]; }
+def V1Write_5c3_1L : SchedWriteRes<[V1UnitL]> { let Latency = 5;
+ let ReleaseAtCycles = [3]; }
+
def V1Write_6c_1L : SchedWriteRes<[V1UnitL]> { let Latency = 6; }
+def V1Write_6c2_1L : SchedWriteRes<[V1UnitL]> { let Latency = 6;
+ let ReleaseAtCycles = [2]; }
+def V1Write_6c3_1L : SchedWriteRes<[V1UnitL]> { let Latency = 6;
+ let ReleaseAtCycles = [3]; }
+def V1Write_7c4_1L : SchedWriteRes<[V1UnitL]> { let Latency = 7;
+ let ReleaseAtCycles = [4]; }
def V1Write_1c_1L01 : SchedWriteRes<[V1UnitL01]> { let Latency = 1; }
def V1Write_4c_1L01 : SchedWriteRes<[V1UnitL01]> { let Latency = 4; }
def V1Write_6c_1L01 : SchedWriteRes<[V1UnitL01]> { let Latency = 6; }
def V1Write_2c_1M : SchedWriteRes<[V1UnitM]> { let Latency = 2; }
-def V1Write_2c_1M_1Flg : SchedWriteRes<[V1UnitM, V1UnitFlg]> { let Latency = 2; }
+def V1Write_2c_1M_1Flg : SchedWriteRes<[V1UnitM, V1UnitFlg]> { let Latency = 2;
+ let NumMicroOps = 2; }
def V1Write_3c_1M : SchedWriteRes<[V1UnitM]> { let Latency = 3; }
-def V1Write_4c_1M : SchedWriteRes<[V1UnitM]> { let Latency = 4; }
+def V1Write_4c6_1M : SchedWriteRes<[V1UnitM]> { let Latency = 4;
+ let ReleaseAtCycles = [6]; }
def V1Write_1c_1M0 : SchedWriteRes<[V1UnitM0]> { let Latency = 1; }
def V1Write_2c_1M0 : SchedWriteRes<[V1UnitM0]> { let Latency = 2; }
+def V1Write_2c2_1M0 : SchedWriteRes<[V1UnitM0]> { let Latency = 2;
+ let ReleaseAtCycles = [2]; }
+def V1Write_3c2_1M0 : SchedWriteRes<[V1UnitM0]> { let Latency = 3;
+ let ReleaseAtCycles = [2]; }
def V1Write_3c_1M0 : SchedWriteRes<[V1UnitM0]> { let Latency = 3; }
def V1Write_5c_1M0 : SchedWriteRes<[V1UnitM0]> { let Latency = 5; }
-def V1Write_12c5_1M0 : SchedWriteRes<[V1UnitM0]> { let Latency = 12;
- let ReleaseAtCycles = [5]; }
-def V1Write_20c5_1M0 : SchedWriteRes<[V1UnitM0]> { let Latency = 20;
- let ReleaseAtCycles = [5]; }
+def V1Write_12c12_1M0 : SchedWriteRes<[V1UnitM0]> { let Latency = 12;
+ let ReleaseAtCycles = [12]; }
+def V1Write_20c20_1M0 : SchedWriteRes<[V1UnitM0]> { let Latency = 20;
+ let ReleaseAtCycles = [20]; }
def V1Write_2c_1V : SchedWriteRes<[V1UnitV]> { let Latency = 2; }
+def V1Write_2c4_1V : SchedWriteRes<[V1UnitV]> { let Latency = 2;
+ let ReleaseAtCycles = [4]; }
def V1Write_3c_1V : SchedWriteRes<[V1UnitV]> { let Latency = 3; }
def V1Write_4c_1V : SchedWriteRes<[V1UnitV]> { let Latency = 4; }
+def V1Write_4c2_1V : SchedWriteRes<[V1UnitV]> { let Latency = 4;
+ let ReleaseAtCycles = [2]; }
def V1Write_5c_1V : SchedWriteRes<[V1UnitV]> { let Latency = 5; }
+def V1Write_6c3_1V : SchedWriteRes<[V1UnitV]> { let Latency = 6;
+ let ReleaseAtCycles = [3]; }
+def V1Write_12c4_1SVE1 : SchedWriteRes<[V1UnitSVE1]> { let Latency = 12;
+ let NumMicroOps = 2;
+ let ReleaseAtCycles = [4]; }
+def V1Write_14c4_1SVE1 : SchedWriteRes<[V1UnitSVE1]> { let Latency = 14;
+ let NumMicroOps = 2;
+ let ReleaseAtCycles = [4]; }
+
def V1Write_2c_1V0 : SchedWriteRes<[V1UnitV0]> { let Latency = 2; }
+def V1Write_2c_1SVE0 : SchedWriteRes<[V1UnitSVE0,V1UnitSVE0]> { let Latency = 2;
+ let NumMicroOps = 2; }
def V1Write_3c_1V0 : SchedWriteRes<[V1UnitV0]> { let Latency = 3; }
+def V1Write_3c_1SVE0 : SchedWriteRes<[V1UnitSVE0,V1UnitSVE0]> { let Latency = 3;
+ let NumMicroOps = 2; }
def V1Write_4c_1V0 : SchedWriteRes<[V1UnitV0]> { let Latency = 4; }
-def V1Write_6c_1V0 : SchedWriteRes<[V1UnitV0]> { let Latency = 6; }
-def V1Write_10c7_1V0 : SchedWriteRes<[V1UnitV0]> { let Latency = 10;
- let ReleaseAtCycles = [7]; }
-def V1Write_12c7_1V0 : SchedWriteRes<[V1UnitV0]> { let Latency = 12;
- let ReleaseAtCycles = [7]; }
-def V1Write_13c10_1V0 : SchedWriteRes<[V1UnitV0]> { let Latency = 13;
- let ReleaseAtCycles = [10]; }
-def V1Write_15c7_1V0 : SchedWriteRes<[V1UnitV0]> { let Latency = 15;
- let ReleaseAtCycles = [7]; }
-def V1Write_16c7_1V0 : SchedWriteRes<[V1UnitV0]> { let Latency = 16;
- let ReleaseAtCycles = [7]; }
-def V1Write_20c7_1V0 : SchedWriteRes<[V1UnitV0]> { let Latency = 20;
- let ReleaseAtCycles = [7]; }
+def V1Write_4c_1SVE0 : SchedWriteRes<[V1UnitSVE0,V1UnitSVE0]> { let Latency = 4;
+ let NumMicroOps = 2; }
+def V1Write_5c4_1SVE0 : SchedWriteRes<[V1UnitSVE0]> { let Latency = 5;
+ let ReleaseAtCycles = [4];
+ let NumMicroOps = 2; }
+def V1Write_6c_1SVE0 : SchedWriteRes<[V1UnitSVE0,V1UnitSVE0]> { let Latency = 6;
+ let NumMicroOps = 2; }
+def V1Write_6c4_1SVE0 : SchedWriteRes<[V1UnitSVE0,V1UnitSVE0]> { let Latency = 6;
+ let NumMicroOps = 2;
+ let ReleaseAtCycles = [4,4]; }
+def V1Write_10c18_1SVE0 : SchedWriteRes<[V1UnitSVE0]> { let Latency = 10;
+ let NumMicroOps = 2;
+ let ReleaseAtCycles = [18]; }
+def V1Write_11c20_1SVE0 : SchedWriteRes<[V1UnitSVE0]> { let Latency = 11;
+ let NumMicroOps = 2;
+ let ReleaseAtCycles = [20]; }
+def V1Write_12c22_1SVE0 : SchedWriteRes<[V1UnitSVE0]> { let Latency = 12;
+ let NumMicroOps = 2;
+ let ReleaseAtCycles = [22]; }
+def V1Write_13c24_1SVE0 : SchedWriteRes<[V1UnitSVE0]> { let Latency = 13;
+ let NumMicroOps = 2;
+ let ReleaseAtCycles = [24]; }
+def V1Write_15c28_1SVE0 : SchedWriteRes<[V1UnitSVE0]> { let Latency = 15;
+ let NumMicroOps = 2;
+ let ReleaseAtCycles = [28]; }
+def V1Write_16c28_1SVE0 : SchedWriteRes<[V1UnitSVE0]> { let Latency = 16;
+ let NumMicroOps = 2;
+ let ReleaseAtCycles = [28]; }
+def V1Write_19c36_1SVE0 : SchedWriteRes<[V1UnitSVE0]> { let Latency = 19;
+ let NumMicroOps = 2;
+ let ReleaseAtCycles = [36]; }
+def V1Write_20c40_1SVE0 : SchedWriteRes<[V1UnitSVE0]> { let Latency = 20;
+ let NumMicroOps = 2;
+ let ReleaseAtCycles = [40]; }
+
def V1Write_2c_1V01 : SchedWriteRes<[V1UnitV01]> { let Latency = 2; }
+def V1Write_2c_1SVE01 : SchedWriteRes<[V1UnitSVE01,V1UnitSVE01]> { let Latency = 2;
+ let NumMicroOps = 2; }
def V1Write_3c_1V01 : SchedWriteRes<[V1UnitV01]> { let Latency = 3; }
-def V1Write_4c_1V01 : SchedWriteRes<[V1UnitV01]> { let Latency = 4; }
-def V1Write_5c_1V01 : SchedWriteRes<[V1UnitV01]> { let Latency = 5; }
+def V1Write_3c_1SVE01 : SchedWriteRes<[V1UnitSVE01,V1UnitSVE01]> { let Latency = 3;
+ let NumMicroOps = 2; }
+def V1Write_4c_1SVE01 : SchedWriteRes<[V1UnitSVE01,V1UnitSVE01]> { let Latency = 4;
+ let NumMicroOps = 2; }
+def V1Write_4c2_1V01 : SchedWriteRes<[V1UnitV01]> { let Latency = 4;
+ let ReleaseAtCycles = [2]; }
+def V1Write_4c3_1V01 : SchedWriteRes<[V1UnitV01]> { let Latency = 4;
+ let ReleaseAtCycles = [3]; }
+def V1Write_6c3_1V01 : SchedWriteRes<[V1UnitV01]> { let Latency = 6;
+ let ReleaseAtCycles = [3]; }
+def V1Write_6c5_1V01 : SchedWriteRes<[V1UnitV01]> { let Latency = 6;
+ let ReleaseAtCycles = [5]; }
+def V1Write_8c6_1SVE01 : SchedWriteRes<[V1UnitSVE01]> { let Latency = 8;
+ let NumMicroOps = 2;
+ let ReleaseAtCycles = [6]; }
+def V1Write_9c8_1SVE01 : SchedWriteRes<[V1UnitSVE01]> { let Latency = 9;
+ let NumMicroOps = 2;
+ let ReleaseAtCycles = [8]; }
+def V1Write_12c8_1SVE01: SchedWriteRes<[V1UnitSVE01]> { let Latency = 12;
+ let ReleaseAtCycles = [8];
+ let NumMicroOps = 2; }
+def V1Write_13c6_1SVE01 : SchedWriteRes<[V1UnitSVE01]> { let Latency = 13;
+ let ReleaseAtCycles = [12];
+ let NumMicroOps = 2; }
+def V1Write_11c10_1SVE01 : SchedWriteRes<[V1UnitSVE01]> { let Latency = 11;
+ let NumMicroOps = 2;
+ let ReleaseAtCycles = [10]; }
def V1Write_3c_1V02 : SchedWriteRes<[V1UnitV02]> { let Latency = 3; }
def V1Write_4c_1V02 : SchedWriteRes<[V1UnitV02]> { let Latency = 4; }
+def V1Write_4c2_1V02 : SchedWriteRes<[V1UnitV02]> { let Latency = 4;
+ let ReleaseAtCycles = [2]; }
+def V1Write_6c4_1V02 : SchedWriteRes<[V1UnitV02]> { let Latency = 6;
+ let ReleaseAtCycles = [4]; }
+def V1Write_7c2_1V02 : SchedWriteRes<[V1UnitV02]> { let Latency = 7;
+ let ReleaseAtCycles = [2]; }
def V1Write_7c7_1V02 : SchedWriteRes<[V1UnitV02]> { let Latency = 7;
let ReleaseAtCycles = [7]; }
-def V1Write_10c7_1V02 : SchedWriteRes<[V1UnitV02]> { let Latency = 10;
- let ReleaseAtCycles = [7]; }
-def V1Write_13c5_1V02 : SchedWriteRes<[V1UnitV02]> { let Latency = 13;
+def V1Write_9c3_1V02 : SchedWriteRes<[V1UnitV02]> { let Latency = 9;
+ let ReleaseAtCycles = [2]; }
+def V1Write_10c3_1V02 : SchedWriteRes<[V1UnitV02]> { let Latency = 10;
+ let ReleaseAtCycles = [3]; }
+def V1Write_10c5_1V02 : SchedWriteRes<[V1UnitV02]> { let Latency = 10;
let ReleaseAtCycles = [5]; }
-def V1Write_13c11_1V02 : SchedWriteRes<[V1UnitV02]> { let Latency = 13;
- let ReleaseAtCycles = [11]; }
+def V1Write_10c9_1V02 : SchedWriteRes<[V1UnitV02]> { let Latency = 10;
+ let ReleaseAtCycles = [9]; }
+def V1Write_13c13_1V02 : SchedWriteRes<[V1UnitV02]> { let Latency = 13;
+ let ReleaseAtCycles = [13]; }
def V1Write_15c7_1V02 : SchedWriteRes<[V1UnitV02]> { let Latency = 15;
let ReleaseAtCycles = [7]; }
-def V1Write_16c7_1V02 : SchedWriteRes<[V1UnitV02]> { let Latency = 16;
- let ReleaseAtCycles = [7]; }
+def V1Write_15c14_1V02 : SchedWriteRes<[V1UnitV02]> { let Latency = 15;
+ let ReleaseAtCycles = [14]; }
+def V1Write_16c8_1V02 : SchedWriteRes<[V1UnitV02]> { let Latency = 16;
+ let ReleaseAtCycles = [8]; }
+def V1Write_16c15_1V02 : SchedWriteRes<[V1UnitV02]> { let Latency = 16;
+ let ReleaseAtCycles = [15]; }
def V1Write_2c_1V1 : SchedWriteRes<[V1UnitV1]> { let Latency = 2; }
-def V1Write_3c_1V1 : SchedWriteRes<[V1UnitV1]> { let Latency = 3; }
-def V1Write_4c_1V1 : SchedWriteRes<[V1UnitV1]> { let Latency = 4; }
+def V1Write_2c_1SVE1 : SchedWriteRes<[V1UnitSVE1,V1UnitSVE1]> { let Latency = 2;
+ let NumMicroOps = 2; }
+def V1Write_3c_1SVE1 : SchedWriteRes<[V1UnitSVE1,V1UnitSVE1]> { let Latency = 3;
+ let NumMicroOps = 2; }
+def V1Write_4c_1SVE1 : SchedWriteRes<[V1UnitSVE1,V1UnitSVE1]> { let Latency = 4;
+ let NumMicroOps = 2; }
+def V1Write_8c4_1SVE1 : SchedWriteRes<[V1UnitSVE1]> { let Latency = 8;
+ let NumMicroOps = 2;
+ let ReleaseAtCycles = [4]; }
+def V1Write_10c4_1SVE1 : SchedWriteRes<[V1UnitSVE1]> { let Latency = 10;
+ let NumMicroOps = 2;
+ let ReleaseAtCycles = [4]; }
def V1Write_2c_1V13 : SchedWriteRes<[V1UnitV13]> { let Latency = 2; }
def V1Write_4c_1V13 : SchedWriteRes<[V1UnitV13]> { let Latency = 4; }
+def V1Write_4c2_1V13 : SchedWriteRes<[V1UnitV13]> { let Latency = 4;
+ let ReleaseAtCycles = [2]; }
+
//===----------------------------------------------------------------------===//
// Define generic 2 micro-op types
-let Latency = 1, NumMicroOps = 2 in
-def V1Write_1c_1B_1S : SchedWriteRes<[V1UnitB, V1UnitS]>;
-let Latency = 6, NumMicroOps = 2 in
-def V1Write_6c_1B_1M0 : SchedWriteRes<[V1UnitB, V1UnitM0]>;
-let Latency = 3, NumMicroOps = 2 in
-def V1Write_3c_1I_1M : SchedWriteRes<[V1UnitI, V1UnitM]>;
-let Latency = 5, NumMicroOps = 2 in
-def V1Write_5c_1I_1L : SchedWriteRes<[V1UnitI, V1UnitL]>;
-let Latency = 7, NumMicroOps = 2 in
-def V1Write_7c_1I_1L : SchedWriteRes<[V1UnitI, V1UnitL]>;
-let Latency = 6, NumMicroOps = 2 in
-def V1Write_6c_2L : SchedWriteRes<[V1UnitL, V1UnitL]>;
-let Latency = 6, NumMicroOps = 2 in
-def V1Write_6c_1L_1M : SchedWriteRes<[V1UnitL, V1UnitM]>;
-let Latency = 8, NumMicroOps = 2 in
-def V1Write_8c_1L_1V : SchedWriteRes<[V1UnitL, V1UnitV]>;
-let Latency = 9, NumMicroOps = 2 in
-def V1Write_9c_1L_1V : SchedWriteRes<[V1UnitL, V1UnitV]>;
-let Latency = 11, NumMicroOps = 2 in
-def V1Write_11c_1L_1V : SchedWriteRes<[V1UnitL, V1UnitV]>;
-let Latency = 1, NumMicroOps = 2 in
-def V1Write_1c_1L01_1D : SchedWriteRes<[V1UnitL01, V1UnitD]>;
-let Latency = 6, NumMicroOps = 2 in
-def V1Write_6c_1L01_1S : SchedWriteRes<[V1UnitL01, V1UnitS]>;
-let Latency = 7, NumMicroOps = 2 in
-def V1Write_7c_1L01_1S : SchedWriteRes<[V1UnitL01, V1UnitS]>;
-let Latency = 2, NumMicroOps = 2 in
-def V1Write_2c_1L01_1V : SchedWriteRes<[V1UnitL01, V1UnitV]>;
-let Latency = 4, NumMicroOps = 2 in
-def V1Write_4c_1L01_1V : SchedWriteRes<[V1UnitL01, V1UnitV]>;
-let Latency = 6, NumMicroOps = 2 in
-def V1Write_6c_1L01_1V : SchedWriteRes<[V1UnitL01, V1UnitV]>;
-let Latency = 2, NumMicroOps = 2 in
-def V1Write_2c_1L01_1V01 : SchedWriteRes<[V1UnitL01, V1UnitV01]>;
-let Latency = 4, NumMicroOps = 2 in
-def V1Write_4c_1L01_1V01 : SchedWriteRes<[V1UnitL01, V1UnitV01]>;
-let Latency = 2, NumMicroOps = 2 in
-def V1Write_2c_2M0 : SchedWriteRes<[V1UnitM0, V1UnitM0]>;
-let Latency = 3, NumMicroOps = 2 in
-def V1Write_3c_2M0 : SchedWriteRes<[V1UnitM0, V1UnitM0]>;
-let Latency = 9, NumMicroOps = 2 in
-def V1Write_9c_1M0_1L : SchedWriteRes<[V1UnitM0, V1UnitL]>;
-let Latency = 5, NumMicroOps = 2 in
-def V1Write_5c_1M0_1V : SchedWriteRes<[V1UnitM0, V1UnitV]>;
-let Latency = 4, NumMicroOps = 2 in
-def V1Write_4c_1M0_1V0 : SchedWriteRes<[V1UnitM0, V1UnitV0]>;
-let Latency = 7, NumMicroOps = 2 in
-def V1Write_7c_1M0_1V0 : SchedWriteRes<[V1UnitM0, V1UnitV1]>;
-let Latency = 5, NumMicroOps = 2 in
-def V1Write_5c_1M0_1V01 : SchedWriteRes<[V1UnitM0, V1UnitV01]>;
-let Latency = 6, NumMicroOps = 2 in
-def V1Write_6c_1M0_1V1 : SchedWriteRes<[V1UnitM0, V1UnitV1]>;
-let Latency = 9, NumMicroOps = 2 in
-def V1Write_9c_1M0_1V1 : SchedWriteRes<[V1UnitM0, V1UnitV1]>;
-let Latency = 4, NumMicroOps = 2 in
-def V1Write_4c_2V : SchedWriteRes<[V1UnitV, V1UnitV]>;
-let Latency = 8, NumMicroOps = 2 in
-def V1Write_8c_1V_1V01 : SchedWriteRes<[V1UnitV, V1UnitV01]>;
-let Latency = 4, NumMicroOps = 2 in
-def V1Write_4c_2V0 : SchedWriteRes<[V1UnitV0, V1UnitV0]>;
-let Latency = 5, NumMicroOps = 2 in
-def V1Write_5c_2V0 : SchedWriteRes<[V1UnitV0, V1UnitV0]>;
-let Latency = 2, NumMicroOps = 2 in
-def V1Write_2c_2V01 : SchedWriteRes<[V1UnitV01, V1UnitV01]>;
-let Latency = 4, NumMicroOps = 2 in
-def V1Write_4c_2V01 : SchedWriteRes<[V1UnitV01, V1UnitV01]>;
-let Latency = 4, NumMicroOps = 2 in
-def V1Write_4c_2V02 : SchedWriteRes<[V1UnitV02, V1UnitV02]>;
-let Latency = 6, NumMicroOps = 2 in
-def V1Write_6c_2V02 : SchedWriteRes<[V1UnitV02, V1UnitV02]>;
-let Latency = 4, NumMicroOps = 2 in
-def V1Write_4c_1V13_1V : SchedWriteRes<[V1UnitV13, V1UnitV]>;
-let Latency = 4, NumMicroOps = 2 in
-def V1Write_4c_2V13 : SchedWriteRes<[V1UnitV13, V1UnitV13]>;
+def V1Write_1c_1B_1S : SchedWriteRes<[V1UnitB, V1UnitS]> {
+ let Latency = 1;
+ let NumMicroOps = 2;
+}
-//===----------------------------------------------------------------------===//
-// Define generic 3 micro-op types
+def V1Write_6c_1B_1M0 : SchedWriteRes<[V1UnitB, V1UnitM0]> {
+ let Latency = 6;
+ let NumMicroOps = 2;
+}
-let Latency = 2, NumMicroOps = 3 in
-def V1Write_2c_1I_1L01_1V01 : SchedWriteRes<[V1UnitI, V1UnitL01, V1UnitV01]>;
-let Latency = 7, NumMicroOps = 3 in
-def V1Write_7c_2M0_1V01 : SchedWriteRes<[V1UnitM0, V1UnitM0, V1UnitV01]>;
-let Latency = 8, NumMicroOps = 3 in
-def V1Write_8c_1L_2V : SchedWriteRes<[V1UnitL, V1UnitV, V1UnitV]>;
-let Latency = 6, NumMicroOps = 3 in
-def V1Write_6c_3L : SchedWriteRes<[V1UnitL, V1UnitL, V1UnitL]>;
-let Latency = 2, NumMicroOps = 3 in
-def V1Write_2c_1L01_1S_1V : SchedWriteRes<[V1UnitL01, V1UnitS, V1UnitV]>;
-let Latency = 4, NumMicroOps = 3 in
-def V1Write_4c_1L01_1S_1V : SchedWriteRes<[V1UnitL01, V1UnitS, V1UnitV]>;
-let Latency = 2, NumMicroOps = 3 in
-def V1Write_2c_2L01_1V01 : SchedWriteRes<[V1UnitL01, V1UnitL01, V1UnitV01]>;
-let Latency = 6, NumMicroOps = 3 in
-def V1Write_6c_3V : SchedWriteRes<[V1UnitV, V1UnitV, V1UnitV]>;
-let Latency = 4, NumMicroOps = 3 in
-def V1Write_4c_3V01 : SchedWriteRes<[V1UnitV01, V1UnitV01, V1UnitV01]>;
-let Latency = 6, NumMicroOps = 3 in
-def V1Write_6c_3V01 : SchedWriteRes<[V1UnitV01, V1UnitV01, V1UnitV01]>;
-let Latency = 8, NumMicroOps = 3 in
-def V1Write_8c_3V01 : SchedWriteRes<[V1UnitV01, V1UnitV01, V1UnitV01]>;
+def V1Write_5c_1I_1L : SchedWriteRes<[V1UnitI, V1UnitL]> {
+ let Latency = 5;
+ let NumMicroOps = 2;
+}
-//===----------------------------------------------------------------------===//
-// Define generic 4 micro-op types
-
-let Latency = 8, NumMicroOps = 4 in
-def V1Write_8c_2M0_2V0 : SchedWriteRes<[V1UnitM0, V1UnitM0,
- V1UnitV0, V1UnitV0]>;
-let Latency = 7, NumMicroOps = 4 in
-def V1Write_7c_4L : SchedWriteRes<[V1UnitL, V1UnitL, V1UnitL, V1UnitL]>;
-let Latency = 8, NumMicroOps = 4 in
-def V1Write_8c_2L_2V : SchedWriteRes<[V1UnitL, V1UnitL,
- V1UnitV, V1UnitV]>;
-let Latency = 9, NumMicroOps = 4 in
-def V1Write_9c_2L_2V : SchedWriteRes<[V1UnitL, V1UnitL,
- V1UnitV, V1UnitV]>;
-let Latency = 11, NumMicroOps = 4 in
-def V1Write_11c_2L_2V : SchedWriteRes<[V1UnitL, V1UnitL,
- V1UnitV, V1UnitV]>;
-let Latency = 10, NumMicroOps = 4 in
-def V1Write_10c_2L01_2V : SchedWriteRes<[V1UnitL01, V1UnitL01,
- V1UnitV, V1UnitV]>;
-let Latency = 2, NumMicroOps = 4 in
-def V1Write_2c_2L01_2V01 : SchedWriteRes<[V1UnitL01, V1UnitL01,
- V1UnitV01, V1UnitV01]>;
-let Latency = 4, NumMicroOps = 4 in
-def V1Write_4c_2L01_2V01 : SchedWriteRes<[V1UnitL01, V1UnitL01,
- V1UnitV01, V1UnitV01]>;
-let Latency = 8, NumMicroOps = 4 in
-def V1Write_8c_2L01_2V01 : SchedWriteRes<[V1UnitL01, V1UnitL01,
- ...
[truncated]
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi - quite a lot is changing here and I'm not sure about some of the details. It is probably worth splitting it up into smaller patches or pulling some of the more obvious parts out so they can be reviewed separately.
As quite a lot is changing, do you have performance results?
@@ -66,6 +66,11 @@ def V1UnitV : ProcResGroup<[V1UnitV0, V1UnitV1, | |||
def V1UnitV01 : ProcResGroup<[V1UnitV0, V1UnitV1]>; // FP/ASIMD 0/1 units | |||
def V1UnitV02 : ProcResGroup<[V1UnitV0, V1UnitV2]>; // FP/ASIMD 0/2 units | |||
def V1UnitV13 : ProcResGroup<[V1UnitV1, V1UnitV3]>; // FP/ASIMD 1/3 units | |||
// Select V0 + V2 or V1 + V3 by issuing 2 micro operations | |||
def V1UnitSVE01 : ProcResGroup<[V1UnitV0, V1UnitV1, // FP/ASIMD 0,2/1,3 units |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These look the same as the V unit groups above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes but used to reserve V units for SVE instructions.
It is to help to modelize the conflicts between SVE and ASIMD instructions:
Maximum issue bandwidth is sustained using one of the following combinations:
• 2 SVE Uops.
• 4 ASIMD Uops.
• 1 SVE Uop on V0 and 2 ASIMD Uops on VX13.
• 1 SVE Uop on V1 and 2 ASIMD Uops on V02.
My understanding is that SVE instructions reserve V0+V2 or V1+V3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that SVE instructions reserve V0+V2 or V1+V3.
For units V0+V2 and V1+V3 there already exists V1UnitV02
and V1UnitV13
. And if you want to use V1+V2+V3+V4 there exists V1UnitV
. The resource V1UnitSVE01
looks to be the same as V1UnitV
. Why not use the existing V unit groups as David mentioned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! Ok. It was to help changes and do not confuse between SVE and others.
@@ -98,377 +103,487 @@ def V1Write_0c_0Z : SchedWriteRes<[]>; | |||
|
|||
def V1Write_1c_1B : SchedWriteRes<[V1UnitB]> { let Latency = 1; } | |||
def V1Write_1c_1I : SchedWriteRes<[V1UnitI]> { let Latency = 1; } | |||
def V1Write_1c_1I_1Flg : SchedWriteRes<[V1UnitI, V1UnitFlg]> { let Latency = 1; } | |||
def V1Write_1c_1I_1Flg : SchedWriteRes<[V1UnitI, V1UnitFlg]> { let Latency = 1; | |||
let NumMicroOps = 2; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't expect these to use 2 micro ops I don't think. Can you explain why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, you are right. This is wrong regarding the number of issues of the processor. I am going to fix it.
# CHECK-NEXT: - - - - - - - - - - 5.00 - - - - - - - sdiv w12, w21, w0 | ||
# CHECK-NEXT: - - - - - - - - - - 5.00 - - - - - - - sdiv x13, x2, x1 | ||
# CHECK-NEXT: - - - - - - - - - - 12.00 - - - - - - - udiv w0, w7, w10 | ||
# CHECK-NEXT: - - - - - - - - - - 20.00 - - - - - - - udiv x9, x22, x4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it is going from best-case to worse-case time. The average is probably somewhere in the middle, closer to 6-8 depending on how you weight it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I took the worst-case... I am agree with you. I am going to fix it and probably doing some experiment with small benchmark (llvm-exegesis). We have seen that FDIV
tends to be faster in reality than in SOG.
For a quick fix, do you prefer best-case, or 1/3 between best and worst cases for example (udiv w: 6, udiv x: 10
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a lot of changes in this patch and it is hard to see how they influence each other.
Changes:
- micro operations are reduced to maximum 3 and respect the number of max issues.
- use ReleaseAtCycles to specify throughput
- fix bypass latencies
- fix some latencies/throughput
Could you separate those changes into individual patches? It would make it easier to review them.
def V1Wr_ZFMA : SchedWriteRes<[V1UnitSVE01,V1UnitSVE01]> { let Latency = 4; | ||
let NumMicroOps = 2; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I can see you have aded let NumMicroOps = 2
to most SVE forwarded types. Could you explain where this comes from? I'd expect those instructions to use one microOp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is to consider following constraints explained in chapter 4.16 of SOG.
Maximum issue bandwidth is sustained using one of the following combinations:
• 2 SVE Uops.
• 4 ASIMD Uops.
• 1 SVE Uop on V0 and 2 ASIMD Uops on VX13.
• 1 SVE Uop on V1 and 2 ASIMD Uops on V02.
@@ -911,7 +911,7 @@ bfmlalb z0.s, z0.h, z1.h | |||
# CHECK-NEXT: [1,0] D============eeeeeER. .. mul z0.d, p0/m, z0.d, z0.d | |||
# CHECK-NEXT: [1,1] D=================eeeER .. sdot z0.s, z1.b, z2.b | |||
# CHECK-NEXT: [1,2] D==================eeeER .. sdot z0.s, z1.b, z2.b | |||
# CHECK-NEXT: [1,3] D=====================eeeER sdot z0.s, z0.b, z1.b | |||
# CHECK-NEXT: [1,3] .D====================eeeER sdot z0.s, z0.b, z1.b |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why there would be a need for a cycle stall here(and for the other cycle stalls added in this test). Could you explain this change?
def V1Write_16c8_1V02 : SchedWriteRes<[V1UnitV02]> { let Latency = 16; | ||
let ReleaseAtCycles = [8]; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you've changed the ReleaseAtCycles
of some resources. Could you explain what you based this change on? For example, I see you've changed this from ReleaseAtCycles = [7]
to ReleaseAtCycles = [8]
, could you explain why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is used for FSQRTDr. Again, I used worste-case throughput.
FP square root, D-form | FSQRT | 7 to 16 | 4/15 to 4/7 | V02
So throughput of 4/15. This instruction can be issued in V0 or V2 so throughput in each pipeline is 4/15/2: 2/15.
To get the number of cycles the micro op should stay in pipeline: 15/2 so 7.5.
It was computed with a script to generate references. I am agree that it should be better to consider also best-case + 1/3 between best and worst cases. And probably after benchmarking, only the best case...
I can skip this kind of changes in new patches versions.
Sorry.
Thanks for the review! |
This PR fixes scheduling model for the Neoverse V1. All information is taken from the Neoverse V1 Software Optimisation Guide:
https://developer.arm.com/documentation/pjdoc466751330-9685/6-0
Changes:
Consider conflicts between SVE and ASIMD instructions.
Software Optimization Guide:
Maximum issue bandwidth is sustained using one of the following combinations:
• 2 SVE Uops.
• 4 ASIMD Uops.
• 1 SVE Uop on V0 and 2 ASIMD Uops on VX13.
• 1 SVE Uop on V1 and 2 ASIMD Uops on V02.
This merge request depends on #126703 due to new test: llvm/test/tools/llvm-mca/AArch64/Neoverse/V1-scheduling-info.s.
This test reports all scheduling information changes from this patch if compared with the version of #126703.
@Rin18 may be interested.