Skip to content

[RISC-V] Update SpacemiT X60 vector scheduling model with measured latencies #144564

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 92 commits into
base: main
Choose a base branch
from

Conversation

mikhailramalho
Copy link
Member

@mikhailramalho mikhailramalho commented Jun 17, 2025

Updates the SpacemiT X60 scheduling model with actual latencies measured on BPi-F3 hardware.

tl;dr: execution time is neutral on SPEC after this patch. There is a code size regression described in issue #146407

Changes:

  • Added 10 new latency classes
  • Updated latencies for ~30 instruction categories based on hardware measurements

Completed:

  • Basic integer ALU, min/max, saturating/averaging arithmetic
  • Carry operations, mask operations, comparisons
  • Integer/FP division (split simple/complex based on LMUL)
  • Widening operations
  • FP operations including add/sub, mul, FMA
  • FP conversions (widening/narrowing)
  • FP reductions including vfredmax/min/usum (fixed fractional LMUL latencies)
  • FP ordered reductions vfredosum (split simple/complex)
  • FP widening reductions vfwredosum/vfwredusum (split simple/complex)
  • Integer reductions
  • Mask manipulation operations
  • Permutation operations (gather/compress/slide)
  • Narrowing shifts and clips (split simple/complex)

Missing:

  • All vector loads/stores uops are missing their actual latency values. The values in this PR are estimations, while I'm still collecting the real numbers

Performance Impact:

  • https://lnt.lukelau.me/db_default/v4/nts/674?compare_to=673
  • This change is mostly NFC
  • The two benchmarks with improvement/regression on execution time are known to be noisy
  • There is an increase in code size in two benchmarks: reviewing the code changes, we see a lot more vector load/store instructions

Known Issues:

  • Code size regression on two SPEC benchmarks
  • Some operations grouped use worst-case latency
  • TableGen !cond expressions not working as expected for vector single-width FMA instructions
  • All compromises I've made are documented as TODO in the code

Planned follow-up PRs:

  • Debug the code size regressions
  • Add latencies for vector loads/stores
  • Work on the TODOs introduced by this PR. It should require splitting some WriteRes groups and changing other scheduling models, but it should be NFC for them

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
… vredxor

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
…, vfrsqrt7

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
…u, vmsbc

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
// Single issue for vector store/load instructions
def SMX60_VLS : ProcResource<1>;

def SMX60_VIEU : ProcResource<1>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there actually a separate VIEU? Or is this a single int and float unit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had the same question, from the C908 manual, this is the VFPU section:

2.2.3 VFPU

FPUs include the floating-point arithmetic logic unit (FALU), floating-point fused multiply-add unit (FMAU), and floating-point divide and square root unit (FDSU). They support half-precision and single-precision operations. The FALU performs operations such as addition, subtraction, comparison, conversion,
register data transmission, sign injection, and classification.

The FMAU performs operations such as common multiplication and fused multiply-add operations. The FDSU performs operations such as floating-point division and square root operations. The vector execution unit is developed by extending the floating-point unit. On the basis of the original scalar floating-point computation, floating-point units can be extended to vector floating-point units Vector floating-point units include the vector floating-point arithmetic logic unit (VFALU), vector floating-point fused multiply-add unit (VFMAU), and vector floating-point divide and square root unit (VFDSU).

Vector floating-point units support vector floating-point computation of different bits. In addition, vector integer units are added. Vector integer units include the vector arithmetic logic unit (VALU), vector shift unit (VSHIFT), vector multiplication unit (VMUL), vector division unit (VDIVU), vector permutation unit (VPERM), vector reduction unit (VREDU), and vector logical operation unit (VMISC).

Bolded part by me. We assumed that it's saying it has both floating point and integer units, rather than that it has FP units that include integer.

@zqb-all, could you help us clarify this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the x60 has separate VIEU

}

// Simple division and remainder operations
// Pattern of vdiu: 11/11/11
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The split here between simple and complex seems mildly complex. This looks like this might actually be NumDLEN * 12?

Copy link
Collaborator

@preames preames left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another batch of mostly stylistic comments, and a few cases where the code structure is mildly suspicious and should be checked against the latency data.


// Arithmetic scaling pattern (4,4,4,4,4,5,8): minimal increase at M4
// Used for: arithmetic (add/sub/min/max), saturating/averaging, FP add/sub/min/max
class SMX60GetArithmeticLatency<string mx> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observation triggered by something you said offline.

These look a lot like a dual port, pipelined design with the unit being released after the instruction is complete.

That is, Latency = 4, ReleaseAtCycle =

8 isn't an exact match - we'd expect 7, but it's awfully close. A number of your other sets look like single or double ported pipelined variants too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can use the data from camel-cdr to set the ReleaseAtCycle for all instructions?

// Pattern for vfwsub/vfwadd.wv, vfwsub/vfwadd.wf: 5/5/9/17
// TODO: Split .wf/.wv variants into separate scheduling classes to use 5/5/9/17
defvar LMulLat = SMX60GetLMulCycles<mx>.c;
let Latency = !mul(LMulLat, 4) in {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code doesn't appear to match the comment just above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem here is that the comment is a bit confusing; I'll improve it. These are the latencies for:

  • vfwsub/vfwadd.vv, vfwsub/vfwadd.vf: e16mf4=4, e16mf2=4, e16m1=4, e16m2=5, e16m4=8, e32mf2=4, e32m1=4, e32m2=5, e32m4=8
  • vfwsub/vfwadd.wv, vfwsub/vfwadd.wf: e16mf4=5, e16mf2=5, e16m1=5, e16m2=9, e16m4=17, e32mf2=5, e32m1=5, e32m2=9, e32m4=17

SMX60GetLMulCycles returns the following: MF4=1, MF2=1, M1=1, M2=2, M=4, which when multiplied by 4, gives us: MF4=4, MF2=4, M1=4, M2=8, M4=16.

vfwsub/vfwadd.vv, vfwsub/vfwadd.vf have greater latency than measured on higher LMUL, unfortunately, because they are all grouped together.


// Pattern for vfwmacc, vfwnmacc, etc: e16 = 5/5/5/8; e32 = 6/6/7/8
// Use existing 6,6,7,8 as close approximation
let Latency = SMX60GetComplexFPLatency<mx>.c in {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stylistic idea - rather than giving your helper functions names, maybe just do Get6678 and variants? The numbers seem to actually be the unique bits, and your comments approximate.

Another option - have a table lookup helper, and embed the interesting part (the m1 and above) as an array in the callsite? You repeat the numbers in the comment anyways.

foreach sew = SchedSEWSet<mx, 1>.val in {
defvar IsWorstCase = SMX60IsWorstCaseMXSEW<mx, sew, SchedMxListF, 1>.c;

// Slightly increased latencies for e32mf2=24 (should be 12)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually the same for all SEW at fractional LMUL?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't:

  • e16mf4 = 12
  • e16mf2 = 24
  • e32mf2 = 12

This code generates:

  • e16mf4 = 12
  • e16mf2 = 24
  • e32mf2 = 24 (increased latency)

!eq(sew, 64) : 12 // e64: 12*LMUL
);

let Latency = !mul(SMX60GetLMulCycles<mx>.c, LatencyMul) in {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As written, this would seem to say that an m1 is cheaper than an mf2 at SEW64. That seems suspect?

This looks like it might be an unpipelined DLEN unit w/SEW sensitive latency?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
@mikhailramalho mikhailramalho changed the title [WIP][RISCV] Update SpacemiT X60 vector scheduling model with measured latencies [RISC-V] Update SpacemiT X60 vector scheduling model with measured latencies Jun 30, 2025
@mikhailramalho mikhailramalho requested a review from topperc June 30, 2025 19:36

// Pattern of vmacc, vmadd, vmul, vmulh, etc.: e8/e16 = 4/4/5/8, e32 = 5,5,5,8,
// e64 = 7,8,16,32. We use the worst-case until we can split the SEW.
// TODO: change WriteVIMulV, etc to be defined with LMULSEWSchedWrites
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I kind of agree that we can make multiplication's SchedWrite SEW-dependant

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
@LiqinWeng LiqinWeng self-requested a review July 2, 2025 02:19
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants