-
Notifications
You must be signed in to change notification settings - Fork 199
Description
When benchmarking files of different sizes, I saw a huge variation in memcpy
performance. On my machine "large" memcpy
(i.e,. much larger than L3, like 100 MB) runs at about 10 - 11 GB/s, and many times lzbench
reports that, but then for even larger files the performance often drops by an order of magnitude (e.g., 1 GB/s). The effect isn't consistent - for very large files (say 1 GB) it usually happens, and for smaller files it usually doesn't, but there are exceptions on both sides (e.g., if you run it a few times with smaller files you'll get some runs with bad performance, etc).
Back to back runs often tend to show improvements, e.g, run 1 might get you 1 GB/s, then 2 GB/s, then 5 GB/s, then it will stay there.
Similarly, the performance sometimes affected only the "compression" side of memcpy
, sometimes only the "decompression" side, and often both (i.e., you'd get something like 1 GB/s compression, 10 GB/s decomp, or vice-versa).
I traced this down to the way the buffers are allocated - the file, comp buffers use malloc
and the decomp uses calloc
. The issue is that for large malloc
s (and sometimes, for large calloc
s) the memory isn't commited by the OS - it will be committed on first access. So the first algorithm to run pays a large penalty to page-in the entire buffer.
So why doesn't this always bite? Why does the performance differ from run to run? It comes down to the DEFAULT_LOOP_TIME
(100 ms) - if an algorithm executes in less than that it gets a second run, which will run at full speed, and since FASTEST
is the default mode for picking a time, you get a full speed result. So somewhere between 100 MB and 1,000 MB on my box, the first memcpy
run starts taking more than 100 ms, and hence doesn't get a second run and the slow time is reported.