[RISC-V] Update SpacemiT X60 vector scheduling model with measured latencies #144564

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

mikhailramalho wants to merge 92 commits into llvm:main from mikhailramalho:x60-rvv

Member

mikhailramalho commented Jun 17, 2025 •

edited

Loading

Updates the SpacemiT X60 scheduling model with actual latencies measured on BPi-F3 hardware.

tl;dr: execution time is neutral on SPEC after this patch. There is a code size regression described in issue #146407

Changes:

Added 10 new latency classes
Updated latencies for ~30 instruction categories based on hardware measurements

Completed:

Basic integer ALU, min/max, saturating/averaging arithmetic
Carry operations, mask operations, comparisons
Integer/FP division (split simple/complex based on LMUL)
Widening operations
FP operations including add/sub, mul, FMA
FP conversions (widening/narrowing)
FP reductions including vfredmax/min/usum (fixed fractional LMUL latencies)
FP ordered reductions vfredosum (split simple/complex)
FP widening reductions vfwredosum/vfwredusum (split simple/complex)
Integer reductions
Mask manipulation operations
Permutation operations (gather/compress/slide)
Narrowing shifts and clips (split simple/complex)

Missing:

All vector loads/stores uops are missing their actual latency values. The values in this PR are estimations, while I'm still collecting the real numbers

Performance Impact:

https://lnt.lukelau.me/db_default/v4/nts/674?compare_to=673
This change is mostly NFC
The two benchmarks with improvement/regression on execution time are known to be noisy
There is an increase in code size in two benchmarks: reviewing the code changes, we see a lot more vector load/store instructions

Known Issues:

Code size regression on two SPEC benchmarks
Some operations grouped use worst-case latency
TableGen !cond expressions not working as expected for vector single-width FMA instructions
All compromises I've made are documented as TODO in the code

Planned follow-up PRs:

Debug the code size regressions
Add latencies for vector loads/stores
Work on the TODOs introduced by this PR. It should require splitting some WriteRes groups and changing other scheduling models, but it should be NFC for them

mikhailramalho added 30 commits

June 17, 2025 12:58


          Initial version + test

1b243f3

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>


          Added all instructions

2d42b7b

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>


          Let's start with easy one vmv.x.s and vmv.s.x

5e10525

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>


          Added vrgather.vx

82b0d80

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>


          Added vrgather.vi, same as vrgather.vx

5c0019a

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>


          Add vcompress

c2efe44

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>


          Added vmv{1,2,4,8}.r

597f5cb

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>


          Added vsext/vzext

3eb214c

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>


          Added vrgather

3b2977b

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>


          Unified vgather and vcompress

9f60d75

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>


          Added vfmv

d92cedb

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>


          Added constant vm* operations

4d1d5f2

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>


          Added vmpop and vmfirst

b52a1d4

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>


          Added viota and vid

26c2c85

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>


          Moved vmv closed together

bd4bde5

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>


          added vnclip, vnclipu, vnsra, vnsrl

70978c2

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>


          Added vredmax, vredmaxu, vredmin, vredminu, vredsum, vredand, vredor,…

8c2affd

… vredxor

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>


          Added vwredsum, vwredsumu and unified the other vreds

cf7cac2

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>


          Update vmerge

671aa11

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>


          Added vmv

2bb7c08

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>


          Added vsll, vsra, vsrl

710826f

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>


          Added vsll, vsra, vsrl

633f483

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>


          Added vslide instructions

3425cfc

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>


          Added vfcvt, vfmv, vfmerge, vfclass, vfrec7, vfsgnj, vfsgnjn, vfsgnjx…

c8ed9a6

…, vfrsqrt7

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>


          Added vsbc, vadc

67fe7c0

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>


          Added vmadc, vmseq, vmsle, vmsleu, vmsne, vmsgt, vmsgtu, vmslt, vmslt…

e16a2f4

…u, vmsbc

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>


          vmax, vmaxu, vmin, vminu

016a20e

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>


          Added vaadd, vaaddu, vasub, vasubu, vsadd, vsaddu, vssub, vssubu

25bf2c2

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>


          Added vfadd.vf, vfsub.vf, vfmax, vfmin, vfrsub

c4ac239

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>


          Added vmfeq, vmfge, vmfle, vmfgt, vmflt, vmfne

429ff5c

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>

mikhailramalho added 7 commits

June 18, 2025 19:07


          Renamed classes and got rid of unnecessary variables

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>


          Make !cond formatting a bit more readable

afb7351

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>


          Merge remote-tracking branch 'origin/main' into x60-rvv

ef567f0


          Renamed test case

d582653

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>


          Added some comments on ProcResource

55ee05b

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>


          Fix vf instructions not using the VFPU

2d55067

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>


          Added estimations for vector ld/st

f7769ae

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>

preames reviewed

View reviewed changes

llvm/lib/Target/RISCV/RISCVSchedSpacemitX60.td Outdated Show resolved Hide resolved

llvm/lib/Target/RISCV/RISCVSchedSpacemitX60.td Outdated Show resolved Hide resolved

llvm/lib/Target/RISCV/RISCVSchedSpacemitX60.td

+                // Single issue for vector store/load instructions
+                def SMX60_VLS : ProcResource<1>;
+                def SMX60_VIEU : ProcResource<1>;

Collaborator

preames Jun 30, 2025

Is there actually a separate VIEU? Or is this a single int and float unit?

Member Author

mikhailramalho Jun 30, 2025

We had the same question, from the C908 manual, this is the VFPU section:

2.2.3 VFPU

FPUs include the floating-point arithmetic logic unit (FALU), floating-point fused multiply-add unit (FMAU), and floating-point divide and square root unit (FDSU). They support half-precision and single-precision operations. The FALU performs operations such as addition, subtraction, comparison, conversion,
register data transmission, sign injection, and classification.

The FMAU performs operations such as common multiplication and fused multiply-add operations. The FDSU performs operations such as floating-point division and square root operations. The vector execution unit is developed by extending the floating-point unit. On the basis of the original scalar floating-point computation, floating-point units can be extended to vector floating-point units Vector floating-point units include the vector floating-point arithmetic logic unit (VFALU), vector floating-point fused multiply-add unit (VFMAU), and vector floating-point divide and square root unit (VFDSU).

Vector floating-point units support vector floating-point computation of different bits. In addition, vector integer units are added. Vector integer units include the vector arithmetic logic unit (VALU), vector shift unit (VSHIFT), vector multiplication unit (VMUL), vector division unit (VDIVU), vector permutation unit (VPERM), vector reduction unit (VREDU), and vector logical operation unit (VMISC).

Bolded part by me. We assumed that it's saying it has both floating point and integer units, rather than that it has FP units that include integer.

@zqb-all, could you help us clarify this?

Contributor

zqb-all Jul 2, 2025

Yes, the x60 has separate VIEU

llvm/lib/Target/RISCV/RISCVSchedSpacemitX60.td Show resolved Hide resolved

llvm/lib/Target/RISCV/RISCVSchedSpacemitX60.td Outdated Show resolved Hide resolved

llvm/lib/Target/RISCV/RISCVSchedSpacemitX60.td Outdated

+              }
+              // Simple division and remainder operations
+              // Pattern of vdiu: 11/11/11

Collaborator

preames Jun 30, 2025

The split here between simple and complex seems mildly complex. This looks like this might actually be NumDLEN * 12?

llvm/lib/Target/RISCV/RISCVSchedSpacemitX60.td Show resolved Hide resolved

preames reviewed

View reviewed changes

Collaborator

preames left a comment

Another batch of mostly stylistic comments, and a few cases where the code structure is mildly suspicious and should be checked against the latency data.

llvm/lib/Target/RISCV/RISCVSchedSpacemitX60.td Show resolved Hide resolved

llvm/lib/Target/RISCV/RISCVSchedSpacemitX60.td

+              // Arithmetic scaling pattern (4,4,4,4,4,5,8): minimal increase at M4
+              // Used for: arithmetic (add/sub/min/max), saturating/averaging, FP add/sub/min/max
+              class SMX60GetArithmeticLatency<string mx> {

Collaborator

preames Jun 30, 2025

Observation triggered by something you said offline.

These look a lot like a dual port, pipelined design with the unit being released after the instruction is complete.

That is, Latency = 4, ReleaseAtCycle =

8 isn't an exact match - we'd expect 7, but it's awfully close. A number of your other sets look like single or double ported pipelined variants too.

Member Author

mikhailramalho Jul 2, 2025

Maybe we can use the data from camel-cdr to set the ReleaseAtCycle for all instructions?

llvm/lib/Target/RISCV/RISCVSchedSpacemitX60.td

+                  // Pattern for vfwsub/vfwadd.wv, vfwsub/vfwadd.wf: 5/5/9/17
+                  // TODO: Split .wf/.wv variants into separate scheduling classes to use 5/5/9/17
+                  defvar LMulLat = SMX60GetLMulCycles<mx>.c;
+                  let Latency = !mul(LMulLat, 4) in {

Collaborator

preames Jun 30, 2025

This code doesn't appear to match the comment just above.

Member Author

mikhailramalho Jun 30, 2025

The problem here is that the comment is a bit confusing; I'll improve it. These are the latencies for:

vfwsub/vfwadd.vv, vfwsub/vfwadd.vf: e16mf4=4, e16mf2=4, e16m1=4, e16m2=5, e16m4=8, e32mf2=4, e32m1=4, e32m2=5, e32m4=8
vfwsub/vfwadd.wv, vfwsub/vfwadd.wf: e16mf4=5, e16mf2=5, e16m1=5, e16m2=9, e16m4=17, e32mf2=5, e32m1=5, e32m2=9, e32m4=17

SMX60GetLMulCycles returns the following: MF4=1, MF2=1, M1=1, M2=2, M=4, which when multiplied by 4, gives us: MF4=4, MF2=4, M1=4, M2=8, M4=16.

vfwsub/vfwadd.vv, vfwsub/vfwadd.vf have greater latency than measured on higher LMUL, unfortunately, because they are all grouped together.

llvm/lib/Target/RISCV/RISCVSchedSpacemitX60.td

+                  // Pattern for vfwmacc, vfwnmacc, etc: e16 = 5/5/5/8; e32 = 6/6/7/8
+                  // Use existing 6,6,7,8 as close approximation
+                  let Latency = SMX60GetComplexFPLatency<mx>.c in {

Collaborator

preames Jun 30, 2025

Stylistic idea - rather than giving your helper functions names, maybe just do Get6678 and variants? The numbers seem to actually be the unique bits, and your comments approximate.

Another option - have a table lookup helper, and embed the interesting part (the m1 and above) as an array in the callsite? You repeat the numbers in the comment anyways.

llvm/lib/Target/RISCV/RISCVSchedSpacemitX60.td Outdated Show resolved Hide resolved

llvm/lib/Target/RISCV/RISCVSchedSpacemitX60.td Outdated Show resolved Hide resolved

llvm/lib/Target/RISCV/RISCVSchedSpacemitX60.td

+                foreach sew = SchedSEWSet<mx, 1>.val in {
+                  defvar IsWorstCase = SMX60IsWorstCaseMXSEW<mx, sew, SchedMxListF, 1>.c;
+                  // Slightly increased latencies for e32mf2=24 (should be 12)

Collaborator

preames Jun 30, 2025

Is this actually the same for all SEW at fractional LMUL?

Member Author

mikhailramalho Jun 30, 2025

It isn't:

e16mf4 = 12
e16mf2 = 24
e32mf2 = 12

This code generates:

e16mf4 = 12
e16mf2 = 24
e32mf2 = 24 (increased latency)

llvm/lib/Target/RISCV/RISCVSchedSpacemitX60.td Outdated Show resolved Hide resolved

llvm/lib/Target/RISCV/RISCVSchedSpacemitX60.td Outdated

+                    !eq(sew, 64) : 12   // e64: 12*LMUL
+                  );
+                  let Latency = !mul(SMX60GetLMulCycles<mx>.c, LatencyMul) in {

Collaborator

preames Jun 30, 2025

As written, this would seem to say that an m1 is cheaper than an mf2 at SEW64. That seems suspect?

This looks like it might be an unpipelined DLEN unit w/SEW sensitive latency?

Member Author

mikhailramalho Jul 1, 2025

I double-checked the data and re-ran the experiments, but got the same result: https://docs.google.com/spreadsheets/d/1u2LF8Uux0BS2_U9zJsG6DE1zUsoPxBd_zH31ze1313o/edit?gid=2036469797#gid=2036469797&range=1210:1224

llvm/lib/Target/RISCV/RISCVSchedSpacemitX60.td Outdated Show resolved Hide resolved

mikhailramalho added 7 commits

June 30, 2025 14:54


          Renamed variable

e83cccd

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>


          Renamed variable

eee5854

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>


          Improve comment

9925ae5

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>


          Fixed latencies that were already split

7b83053

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>


          Clean up test case

59513e6

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>


          Renamed variable

b055340

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>


          Reuse variable

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>

mikhailramalho mentioned this pull request

[RISC-V] Code Size Increase on SPEC with -mcpu=spacemit-x60 caused by PR 144564 #146407

Open

mikhailramalho requested review from mshockwave and zqb-all

June 30, 2025 19:08

mikhailramalho changed the title ~~[WIP][RISCV] Update SpacemiT X60 vector scheduling model with measured latencies~~ [RISC-V] Update SpacemiT X60 vector scheduling model with measured latencies

mikhailramalho requested a review from topperc

June 30, 2025 19:36

mshockwave reviewed

View reviewed changes

llvm/lib/Target/RISCV/RISCVSchedSpacemitX60.td Outdated Show resolved Hide resolved

llvm/lib/Target/RISCV/RISCVSchedSpacemitX60.td Outdated Show resolved Hide resolved

llvm/lib/Target/RISCV/RISCVSchedSpacemitX60.td Outdated Show resolved Hide resolved

llvm/lib/Target/RISCV/RISCVSchedSpacemitX60.td

+                // Pattern of vmacc, vmadd, vmul, vmulh, etc.: e8/e16 = 4/4/5/8, e32 = 5,5,5,8,
+                // e64 = 7,8,16,32. We use the worst-case until we can split the SEW.
+                // TODO: change WriteVIMulV, etc to be defined with LMULSEWSchedWrites

Member

mshockwave Jul 1, 2025

Personally I kind of agree that we can make multiplication's SchedWrite SEW-dependant

llvm/lib/Target/RISCV/RISCVSchedSpacemitX60.td Outdated Show resolved Hide resolved

llvm/lib/Target/RISCV/RISCVSchedSpacemitX60.td Outdated Show resolved Hide resolved

llvm/lib/Target/RISCV/RISCVSchedSpacemitX60.td Outdated Show resolved Hide resolved

llvm/lib/Target/RISCV/RISCVSchedSpacemitX60.td Outdated Show resolved Hide resolved

llvm/test/tools/llvm-mca/RISCV/SpacemitX60/rvv.s Outdated Show resolved Hide resolved

mikhailramalho added 6 commits

July 1, 2025 13:32


          Avoid string cast

10f53c7

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>


          Updated WriteVIALUV/X/I to use worst case latency

b999470

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>


          Typo

241fcb0

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>


          Replaced simple cond with if

66801a0

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>


          Renamed variable

012c8e8

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>


          Update vfdiv.vv/.vf and vfrdiv latencies

b54c0b3

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>

LiqinWeng self-requested a review

July 2, 2025 02:19


          Split test case

a2fa503

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

preames preames left review comments

mshockwave mshockwave left review comments

asb Awaiting requested review from asb

lukel97 Awaiting requested review from lukel97

zqb-all Awaiting requested review from zqb-all

topperc Awaiting requested review from topperc

LiqinWeng Awaiting requested review from LiqinWeng

Labels

None yet