-
Notifications
You must be signed in to change notification settings - Fork 14.4k
[RISC-V] Update SpacemiT X60 vector scheduling model with measured latencies #144564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
… vredxor Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
…, vfrsqrt7 Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
…u, vmsbc Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
// Single issue for vector store/load instructions | ||
def SMX60_VLS : ProcResource<1>; | ||
|
||
def SMX60_VIEU : ProcResource<1>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there actually a separate VIEU? Or is this a single int and float unit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had the same question, from the C908 manual, this is the VFPU section:
2.2.3 VFPU
FPUs include the floating-point arithmetic logic unit (FALU), floating-point fused multiply-add unit (FMAU), and floating-point divide and square root unit (FDSU). They support half-precision and single-precision operations. The FALU performs operations such as addition, subtraction, comparison, conversion,
register data transmission, sign injection, and classification.The FMAU performs operations such as common multiplication and fused multiply-add operations. The FDSU performs operations such as floating-point division and square root operations. The vector execution unit is developed by extending the floating-point unit. On the basis of the original scalar floating-point computation, floating-point units can be extended to vector floating-point units Vector floating-point units include the vector floating-point arithmetic logic unit (VFALU), vector floating-point fused multiply-add unit (VFMAU), and vector floating-point divide and square root unit (VFDSU).
Vector floating-point units support vector floating-point computation of different bits. In addition, vector integer units are added. Vector integer units include the vector arithmetic logic unit (VALU), vector shift unit (VSHIFT), vector multiplication unit (VMUL), vector division unit (VDIVU), vector permutation unit (VPERM), vector reduction unit (VREDU), and vector logical operation unit (VMISC).
Bolded part by me. We assumed that it's saying it has both floating point and integer units, rather than that it has FP units that include integer.
@zqb-all, could you help us clarify this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the x60 has separate VIEU
} | ||
|
||
// Simple division and remainder operations | ||
// Pattern of vdiu: 11/11/11 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The split here between simple and complex seems mildly complex. This looks like this might actually be NumDLEN * 12?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another batch of mostly stylistic comments, and a few cases where the code structure is mildly suspicious and should be checked against the latency data.
|
||
// Arithmetic scaling pattern (4,4,4,4,4,5,8): minimal increase at M4 | ||
// Used for: arithmetic (add/sub/min/max), saturating/averaging, FP add/sub/min/max | ||
class SMX60GetArithmeticLatency<string mx> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Observation triggered by something you said offline.
These look a lot like a dual port, pipelined design with the unit being released after the instruction is complete.
That is, Latency = 4, ReleaseAtCycle =
8 isn't an exact match - we'd expect 7, but it's awfully close. A number of your other sets look like single or double ported pipelined variants too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can use the data from camel-cdr to set the ReleaseAtCycle for all instructions?
// Pattern for vfwsub/vfwadd.wv, vfwsub/vfwadd.wf: 5/5/9/17 | ||
// TODO: Split .wf/.wv variants into separate scheduling classes to use 5/5/9/17 | ||
defvar LMulLat = SMX60GetLMulCycles<mx>.c; | ||
let Latency = !mul(LMulLat, 4) in { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code doesn't appear to match the comment just above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem here is that the comment is a bit confusing; I'll improve it. These are the latencies for:
- vfwsub/vfwadd.vv, vfwsub/vfwadd.vf: e16mf4=4, e16mf2=4, e16m1=4, e16m2=5, e16m4=8, e32mf2=4, e32m1=4, e32m2=5, e32m4=8
- vfwsub/vfwadd.wv, vfwsub/vfwadd.wf: e16mf4=5, e16mf2=5, e16m1=5, e16m2=9, e16m4=17, e32mf2=5, e32m1=5, e32m2=9, e32m4=17
SMX60GetLMulCycles returns the following: MF4=1, MF2=1, M1=1, M2=2, M=4, which when multiplied by 4, gives us: MF4=4, MF2=4, M1=4, M2=8, M4=16.
vfwsub/vfwadd.vv, vfwsub/vfwadd.vf have greater latency than measured on higher LMUL, unfortunately, because they are all grouped together.
|
||
// Pattern for vfwmacc, vfwnmacc, etc: e16 = 5/5/5/8; e32 = 6/6/7/8 | ||
// Use existing 6,6,7,8 as close approximation | ||
let Latency = SMX60GetComplexFPLatency<mx>.c in { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stylistic idea - rather than giving your helper functions names, maybe just do Get6678 and variants? The numbers seem to actually be the unique bits, and your comments approximate.
Another option - have a table lookup helper, and embed the interesting part (the m1 and above) as an array in the callsite? You repeat the numbers in the comment anyways.
foreach sew = SchedSEWSet<mx, 1>.val in { | ||
defvar IsWorstCase = SMX60IsWorstCaseMXSEW<mx, sew, SchedMxListF, 1>.c; | ||
|
||
// Slightly increased latencies for e32mf2=24 (should be 12) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this actually the same for all SEW at fractional LMUL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn't:
- e16mf4 = 12
- e16mf2 = 24
- e32mf2 = 12
This code generates:
- e16mf4 = 12
- e16mf2 = 24
- e32mf2 = 24 (increased latency)
!eq(sew, 64) : 12 // e64: 12*LMUL | ||
); | ||
|
||
let Latency = !mul(SMX60GetLMulCycles<mx>.c, LatencyMul) in { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As written, this would seem to say that an m1 is cheaper than an mf2 at SEW64. That seems suspect?
This looks like it might be an unpipelined DLEN unit w/SEW sensitive latency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I double-checked the data and re-ran the experiments, but got the same result: https://docs.google.com/spreadsheets/d/1u2LF8Uux0BS2_U9zJsG6DE1zUsoPxBd_zH31ze1313o/edit?gid=2036469797#gid=2036469797&range=1210:1224
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
|
||
// Pattern of vmacc, vmadd, vmul, vmulh, etc.: e8/e16 = 4/4/5/8, e32 = 5,5,5,8, | ||
// e64 = 7,8,16,32. We use the worst-case until we can split the SEW. | ||
// TODO: change WriteVIMulV, etc to be defined with LMULSEWSchedWrites |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I kind of agree that we can make multiplication's SchedWrite SEW-dependant
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
Updates the SpacemiT X60 scheduling model with actual latencies measured on BPi-F3 hardware.
tl;dr: execution time is neutral on SPEC after this patch. There is a code size regression described in issue #146407
Changes:
Completed:
Missing:
Performance Impact:
Known Issues:
Planned follow-up PRs: