Skip to content

Arm Neoverse scheduling models have a way to large decode bandwidth (about 2x of the actual) #136374

Open
@camel-cdr

Description

@camel-cdr

I noticed, that the Arm Neoverse scheduling models have a way to large decoding bandwidth: https://godbolt.org/z/54hPqeqdK

I tested how many independent adds llvm-mca thinks the cores can decode per cycle and compared it with the actual decode with:

  • CPU: llvm-mca vs Arm-Software-Optimization-Guide "4.1 Dispatch constraints"
  • Neoverse-V1: 15 vs 8
  • Neoverse-V2: 16 vs 8
  • Neoverse-V3: 16 vs 10
  • Neoverse-N1: 8 vs 4
  • Neoverse-N2: 10 vs 5
  • Neoverse-N3: 10 vs 5

The decode/issue width currently used in the scheduling models seems to correspond to the number of uops that can be processed, not MOPs, that are decoded or read from opcache.
Still, unless the cores are capable of fusing independent additions, they shouldn't be able to decode the instructions this quickly.

Here is a code snippet where the additional decode capabilities cause an impossible result: https://godbolt.org/z/GbGrKWxsq
Here the V1 can execute a loop with 13 instructions with 13 IPC, even though it should only be able to decode up to 8 instructions per cycle.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions