Arm Neoverse scheduling models have a way to large decode bandwidth (about 2x of the actual)

I noticed, that the Arm Neoverse scheduling models have a way to large decoding bandwidth: https://godbolt.org/z/54hPqeqdK

I tested how many independent adds llvm-mca thinks the cores can decode per cycle and compared it with the actual decode with:

* CPU: llvm-mca vs Arm-Software-Optimization-Guide "4.1 Dispatch constraints"
* Neoverse-V1: 15 vs 8
* Neoverse-V2: 16 vs 8
* Neoverse-V3: 16 vs 10
* Neoverse-N1: 8 vs 4
* Neoverse-N2: 10 vs 5
* Neoverse-N3: 10 vs 5

The decode/issue width currently used in the scheduling models seems to correspond to the number of uops that can be processed, not MOPs, that are decoded or read from opcache.
Still, unless the cores are capable of fusing independent additions, they shouldn't be able to decode the instructions this quickly.

Here is a code snippet where the additional decode capabilities cause an impossible result: https://godbolt.org/z/GbGrKWxsq
Here the V1 can execute a loop with 13 instructions with 13 IPC, even though it should only be able to decode up to 8 instructions per cycle.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arm Neoverse scheduling models have a way to large decode bandwidth (about 2x of the actual) #136374

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Arm Neoverse scheduling models have a way to large decode bandwidth (about 2x of the actual) #136374

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions