-
Notifications
You must be signed in to change notification settings - Fork 37
Description
I wanted to reproduce a few performance numbers, in case we wanted to include some in the vchep paper. So I went back in time to older commits and tried to reproduce some numbers. In particular I went back to commit 715d26f because this is the first one where I added some numbers to the logs.
I got very surprised because I was unable to reproduce the c++ throughput on my itscrd03. Eventually, I understood that this node gives a VERY different performance (a factor 3 to 4 slower!) than an lxbatch GPU node (like the one I had used for the performance recorded in the commit log) or my physical node pmpe04.
The CPUs are similar Intel's. I wondered if there was a problem with memory access or swappiness, but it does not look like it. Eventually I played a bit with flamegraphs and found out that itscrd03 seems to spent 70% of the CPU time in clock_gettime, unlike the other nodes.
This is itscrd03
time ./check.exe 393216
***********************************
NumberOfEntries = 393216
TotalTimeInWaveFuncs = 3.760869e+00 sec
MeanTimeInWaveFuncs = 9.564384e-06 sec
StdDevTimeInWaveFuncs = 4.128436e-06 sec
MinTimeInWaveFuncs = 7.822000e-06 sec
MaxTimeInWaveFuncs = 6.076190e-04 sec
-----------------------------------
NumMatrixElements = 393216
MatrixElementsPerSec = 1.045546e+05 sec^-1
***********************************
NumMatrixElements = 393216
MeanMatrixElemValue = 1.372012e-02 GeV^0
StdErrMatrixElemValue = 1.307144e-05 GeV^0
StdDevMatrixElemValue = 8.196700e-03 GeV^0
MinMatrixElemValue = 6.071582e-03 GeV^0
MaxMatrixElemValue = 3.374887e-02 GeV^0
real 0m7.065s
user 0m2.196s
sys 0m4.864s
And this is pmpe04 using the exact same executable (shared on AFS)
time ./check.exe 393216
***********************************
NumberOfEntries = 393216
TotalTimeInWaveFuncs = 1.143175e+00 sec
MeanTimeInWaveFuncs = 2.907244e-06 sec
StdDevTimeInWaveFuncs = 6.344392e-07 sec
MinTimeInWaveFuncs = 2.530000e-06 sec
MaxTimeInWaveFuncs = 8.822100e-05 sec
-----------------------------------
NumMatrixElements = 393216
MatrixElementsPerSec = 3.439683e+05 sec^-1
***********************************
NumMatrixElements = 393216
MeanMatrixElemValue = 1.372012e-02 GeV^0
StdErrMatrixElemValue = 1.307144e-05 GeV^0
StdDevMatrixElemValue = 8.196700e-03 GeV^0
MinMatrixElemValue = 6.071582e-03 GeV^0
MaxMatrixElemValue = 3.374887e-02 GeV^0
real 0m1.777s
user 0m1.711s
sys 0m0.063s
Note that the pmpe throughput 3.5E5 is compatible with that I had recorded on lxbatch of 3.0E5 (a bit higher, it's a physical machine also). Instead that on itscrd03 is a factor 3 lower, which is why I started this investigation.
Here's two flamegraphs (svg converted to png for github):
itscrd03 has 70% of clock_gettime

pmpe04 does not have that overhead

This is about old code, however, there are two things that it may be useful to understand:
- why is itscrd03 behaving differently? is there something we need to configure in a different way?
- also, do these calls come from the timer.h inside the code? if they do, it may well be that even today the issue is present, I do not think we changed that part of the code
Nothing urgent, but useful to keep in mind...