Skip to content

High CPU use in clock_gettime (TSC clocksource unavailable on itscrd03) #116

@valassi

Description

@valassi

I wanted to reproduce a few performance numbers, in case we wanted to include some in the vchep paper. So I went back in time to older commits and tried to reproduce some numbers. In particular I went back to commit 715d26f because this is the first one where I added some numbers to the logs.

I got very surprised because I was unable to reproduce the c++ throughput on my itscrd03. Eventually, I understood that this node gives a VERY different performance (a factor 3 to 4 slower!) than an lxbatch GPU node (like the one I had used for the performance recorded in the commit log) or my physical node pmpe04.

The CPUs are similar Intel's. I wondered if there was a problem with memory access or swappiness, but it does not look like it. Eventually I played a bit with flamegraphs and found out that itscrd03 seems to spent 70% of the CPU time in clock_gettime, unlike the other nodes.

This is itscrd03

time ./check.exe 393216
***********************************
NumberOfEntries       = 393216
TotalTimeInWaveFuncs  = 3.760869e+00 sec
MeanTimeInWaveFuncs   = 9.564384e-06 sec
StdDevTimeInWaveFuncs = 4.128436e-06 sec
MinTimeInWaveFuncs    = 7.822000e-06 sec
MaxTimeInWaveFuncs    = 6.076190e-04 sec
-----------------------------------
NumMatrixElements     = 393216
MatrixElementsPerSec  = 1.045546e+05 sec^-1
***********************************
NumMatrixElements     = 393216
MeanMatrixElemValue   = 1.372012e-02 GeV^0
StdErrMatrixElemValue = 1.307144e-05 GeV^0
StdDevMatrixElemValue = 8.196700e-03 GeV^0
MinMatrixElemValue    = 6.071582e-03 GeV^0
MaxMatrixElemValue    = 3.374887e-02 GeV^0

real    0m7.065s
user    0m2.196s
sys     0m4.864s

And this is pmpe04 using the exact same executable (shared on AFS)

time ./check.exe 393216
***********************************
NumberOfEntries       = 393216
TotalTimeInWaveFuncs  = 1.143175e+00 sec
MeanTimeInWaveFuncs   = 2.907244e-06 sec
StdDevTimeInWaveFuncs = 6.344392e-07 sec
MinTimeInWaveFuncs    = 2.530000e-06 sec
MaxTimeInWaveFuncs    = 8.822100e-05 sec
-----------------------------------
NumMatrixElements     = 393216
MatrixElementsPerSec  = 3.439683e+05 sec^-1
***********************************
NumMatrixElements     = 393216
MeanMatrixElemValue   = 1.372012e-02 GeV^0
StdErrMatrixElemValue = 1.307144e-05 GeV^0
StdDevMatrixElemValue = 8.196700e-03 GeV^0
MinMatrixElemValue    = 6.071582e-03 GeV^0
MaxMatrixElemValue    = 3.374887e-02 GeV^0

real    0m1.777s
user    0m1.711s
sys     0m0.063s

Note that the pmpe throughput 3.5E5 is compatible with that I had recorded on lxbatch of 3.0E5 (a bit higher, it's a physical machine also). Instead that on itscrd03 is a factor 3 lower, which is why I started this investigation.

Here's two flamegraphs (svg converted to png for github):

itscrd03 has 70% of clock_gettime
perf-itscrd03

pmpe04 does not have that overhead
perf-pmpe04

This is about old code, however, there are two things that it may be useful to understand:

  • why is itscrd03 behaving differently? is there something we need to configure in a different way?
  • also, do these calls come from the timer.h inside the code? if they do, it may well be that even today the issue is present, I do not think we changed that part of the code

Nothing urgent, but useful to keep in mind...

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions