ARM: Pair Memory Instructions Microops #1545
mahyarsamani
started this conversation in
gem5-dev
Replies: 2 comments
-
@ivanaamit can you add this to the agenda for the dev meeting on Thursday? |
Beta Was this translation helpful? Give feedback.
0 replies
-
Thanks @mahyarsamani; can I ask you where the HW information is coming from? Usually uops are HW specific and shouldn't be visible to the software (profiler). |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I am working on building a reliable model for the Neoverse N1 Platform found in Ampere Altra Q80 processor. I've been using NAS Parallel benchmarks to compare measurements in details between gem5 and real hardware. I use PAPI to measure things like number of cycles and number of load and store instructions.
I have started my experiments with a model of Neoverse N1 core found here:
https://github.com/binebrank/gem5/blob/neoverse_model/configs/common/cores/arm/O3_ARM_Neoverse_N1.py
Using CHI protocol, I have configured a cache system "similar" to CMN-600 in gem5.
Looking at the stats in gem5, I can see up to 4x difference between gem5 and real hardware for number of committed store instructions. Digging deeper, I have found the source of the issue to be the way pair memory instructions are microcoded in gem5. Below is some information from 4 simulations/benchmarking on real hardware.
Workload 0:
Number of store instructions on gem5: 9134773
Number of store instructions on real hardware: 9125397
Details of pair memory instructions:
8243x ['strxi_uop x30, [ureg0, #8].', 'strxi_uop x29, [ureg0].']
Workload 1:
Number of store instructions on gem5: 222628
Number of store instructions on real hardware: 150693
Details of pair memory instructions:
65983x ['strxi_uop x30, [ureg0].', 'strxi_uop x19, [ureg0, #8].']
Workload 2:
Number of store instructions on gem5: 4258729
Number of store instructions on real hardware: 2967315
Details of pair memory instructions:
987698x ['strqbfpxr_uop w1, [w11, w24].', 'strqtfpxr_uop w1, [w11, w24].']
Workload 3:
Number of store instructions on gem5: 39952364
Number of store instructions on real hardware: 10040649
Details of pair memory instructions:
9949696x ['strqtfpxi_uop x0, [ureg0, #16].', 'strqbfpxi_uop x0, [ureg0, #16].', 'strqbfpxi_uop x1, [ureg0].', 'strqtfpxi_uop x1, [ureg0].']
From this data, it seems to me like most of pair memory instructions take 1 microop on real hardware as opposed to 1/2/4 in gem5. Given that these instructions could potentially result in significant difference in performance between gem5 and real hardware, I wonder if there are recommendation/plans in place to change the way these instructions are microcoded?
For easier navigation in gem5's code, I thought I put the link to where the microops are defined.
https://github.com/gem5/gem5/blob/stable/src/arch/arm/insts/macromem.hh#L473
https://github.com/gem5/gem5/blob/stable/src/arch/arm/insts/macromem.cc#L244
Mentioning Giacomo and Tiago to get their opinion. @giactra @tiagormk
Beta Was this translation helpful? Give feedback.
All reactions