Description
Across a couple of workloads ( Node-DC-EIS and Ghost) I noticed that practically all the page walks are for 4K pages
Here is a specific example from Node-DC-EIS (normalized per transaction) on a Xeon Platinum 8180 server.
ITLB_MISSES.WALK_COMPLETED | 6,872.3739 |
---|---|
ITLB_MISSES.WALK_COMPLETED_2M_4M | 2.3691 |
ITLB_MISSES.WALK_COMPLETED_4K | 6,869.9723 |
This results in about 16% of the cycles stalled in the CPU Front End performing page walks using the TMAM Methodology
Several (Java JVM, PHP, HHVM) runtimes have support for Large Pages. They allocate either the hot static code segments and/or dynamic JIT code segments in Large 2M pages. There is typically several percentage performance improvement depending on how much the stall cycles are for page walks.
I wanted to have a discussion of what the community thinks of this I would also be interested in seeing some more data from other workloads. The following perf command is an easy way to get this data for your workload.
perf stat -e cpu/event=0x85,umask=0xe,name=itlb_misses_walk_completed/ -- sleep 30
perf stat -e cpu/event=0x85,umask=0x2,name=itlb_misses_walk_completed_4k/ -- sleep 30
perf stat -e cpu/event=0x85,umask=0x4,name=itlb_misses_walk_completed_2m_4m/ --sleep 30
A simple implementation would start with mapping all the .text
segment code into large pages (this would be about 20 lines of code on Linux) and it would work reasonably well on modern CPU's. On older CPU's (such as SandyBridge) which have only a 1 level 2M TLB this is not efficient, and a more efficient implementation would only map the hot .text
segment to large pages.