Skip to content

Commit 60fce05

Browse files
committed
design/59960-heap-hugepage-util.md: add design
For golang/go#59960. Change-Id: I4c6ee87db54952ccacdfa6c66b419356e5842620 Reviewed-on: https://go-review.googlesource.com/c/proposal/+/492018 Reviewed-by: Michael Pratt <mpratt@google.com>
1 parent 3094c93 commit 60fce05

File tree

3 files changed

+244
-0
lines changed

3 files changed

+244
-0
lines changed

design/59960-heap-hugepage-util.md

Lines changed: 244 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,244 @@
1+
# A more hugepage-aware Go heap
2+
3+
Authors: Michael Knyszek, Michael Pratt
4+
5+
## Background
6+
7+
[Transparent huge pages (THP) admin
8+
guide](https://www.kernel.org/doc/html/latest/admin-guide/mm/transhuge.html).
9+
10+
[Go scavenging
11+
policy](30333-smarter-scavenging.md#which-memory-should-we-scavenge).
12+
(Implementation details are out-of-date, but linked policy is relevant.)
13+
14+
[THP flag behavior](#appendix_thp-flag-behavior).
15+
16+
## Motivation
17+
18+
Currently, Go's hugepage-related policies [do not play well
19+
together](https://github.com/golang/go/issues/55328) and have bit-rotted.[^1]
20+
The result is that the memory regions the Go runtime chooses to mark as
21+
`MADV_NOHUGEPAGE` and `MADV_HUGEPAGE` are somewhat haphazard, resulting in
22+
memory overuse for small heaps.
23+
The memory overuse is upwards of 40% memory overhead in some cases.
24+
Turning off huge pages entirely fixes the problem, but leaves CPU performance on
25+
the table.
26+
This policy also means large heaps might have dense sections that are
27+
erroneously mapped as `MADV_NOHUGEPAGE`, costing up to 1% throughput.
28+
29+
The goal of this work is to eliminate this overhead for small heaps while
30+
improving huge page utilization for large heaps.
31+
32+
[^1]: [Large allocations](https://cs.opensource.google/go/go/+/master:src/runtime/mheap.go;l=1344;drc=c70fd4b30aba5db2df7b5f6b0833c62b909f50eb)
33+
will force [a call to `MADV_HUGEPAGE` for any aligned huge pages
34+
within](https://cs.opensource.google/go/go/+/master:src/runtime/mem_linux.go;l=148;drc=9839668b5619f45e293dd40339bf0ac614ea6bee),
35+
while small allocations tend to leave memory in an undetermined state for
36+
huge pages.
37+
The scavenger will try to release entire aligned hugepages at a time.
38+
Also, when any memory is released, [we `MADV_NOHUGEPAGE` any aligned pages
39+
in the range we
40+
release](https://cs.opensource.google/go/go/+/master:src/runtime/mem_linux.go;l=40;drc=9839668b5619f45e293dd40339bf0ac614ea6bee).
41+
However, the scavenger will [only release 64 KiB at a time unless it finds
42+
an aligned huge page to
43+
release](https://cs.opensource.google/go/go/+/master:src/runtime/mgcscavenge.go;l=564;drc=c70fd4b30aba5db2df7b5f6b0833c62b909f50eb),
44+
and even then it'll [only `MADV_NOHUGEPAGE` the corresponding huge pages if
45+
the region it's scavenging crosses a huge page
46+
boundary](https://cs.opensource.google/go/go/+/master:src/runtime/mem_linux.go;l=70;drc=9839668b5619f45e293dd40339bf0ac614ea6bee).
47+
48+
## Proposal
49+
50+
One key insight in the design of the scavenger is that the runtime always has a
51+
good idea of how much memory will be used soon: the total heap footprint for a
52+
GC cycle is determined by the heap goal. [^2]
53+
54+
[^2]: The runtime also has a first-fit page allocator so that the scavenger can
55+
take pages from the high addresses in the heap, again to reduce the chance
56+
of conflict.
57+
The scavenger tries to return memory to the OS such that it leaves enough
58+
paged-in memory around to reach the heap goal (adjusted for fragmentation
59+
within spans and a 10% buffer for fragmentation outside of spans, or capped
60+
by the memory limit).
61+
The purpose behind this is to reduce the chance that the scavenger will
62+
return memory to the OS that will be used soon.
63+
64+
Indeed, by [tracing page allocations and watching page state over
65+
time](#appendix_page-traces) we can see that Go heaps tend to get very dense
66+
toward the end of a GC cycle; this makes all of that memory a decent candidate
67+
for huge pages from the perspective of fragmentation.
68+
However, it's also clear this density fluctuates significantly within a GC
69+
cycle.
70+
71+
Therefore, I propose the following policy:
72+
1. All new memory is initially marked as `MADV_HUGEPAGE` with the expectation
73+
that it will be used.
74+
1. Before the scavenger releases pages in an aligned 4 MiB region of memory [^3]
75+
it [first](#appendix_thp-flag-behavior) marks it as `MADV_NOHUGEPAGE` if it
76+
isn't already marked as such.
77+
- If `max_ptes_none` is 0, then skip this step.
78+
1. Aligned 4 MiB regions of memory are only available to scavenge if they
79+
weren't at least 96% [^4] full at the end of the last GC cycle. [^5]
80+
- Scavenging for `GOMEMLIMIT` or `runtime/debug.FreeOSMemory` ignores this
81+
rule.
82+
1. Any aligned 4 MiB region of memory that exceeds 96% occupancy is immediately
83+
marked as `MADV_HUGEPAGE`.
84+
- If `max_ptes_none` is 0, then use `MADV_COLLAPSE` instead, if available.
85+
- Memory scavenged for `GOMEMLIMIT` or `runtime/debug.FreeOSMemory` is not
86+
marked `MADV_HUGEPAGE` until the next allocation that causes this
87+
condition after the end of the current GC cycle. [^6]
88+
89+
[^3]: 4 MiB doesn't align with linux/amd64 huge page sizes, but is a very
90+
convenient number of the runtime because the page allocator manages memory
91+
in 4 MiB chunks.
92+
93+
[^4]: The bar for explicit (non-default) backing by huge pages must be very
94+
high.
95+
The main issue is the default value of
96+
`/sys/kernel/mm/transparent_hugepage/defrag` on Linux: it forces regions
97+
marked as `MADV_HUGEPAGE` to be immediately backed, stalling in the kernel
98+
until it can compact and rearrange things to provide a huge page.
99+
Meanwhile the combination of `MADV_NOHUGEPAGE` and `MADV_DONTNEED` does the
100+
opposite.
101+
Switching between these two states often creates really expensive churn.
102+
103+
[^5]: Note that `runtime/debug.FreeOSMemory` and the mechanism to maintain
104+
`GOMEMLIMIT` must still be able to release all memory to be effective.
105+
For that reason, this rule does not apply to those two situations.
106+
Basically, these cases get to skip waiting until the end of the GC cycle,
107+
optimistically assuming that memory won't be used.
108+
109+
[^6]: It might happen that the wrong memory was scavenged (memory that soon
110+
after exceeds 96% occupancy).
111+
This delay helps reduce churn.
112+
113+
The goal of these changes is to ensure that when sparse regions of the heap have
114+
their memory returned to the OS, it stays that way regardless of
115+
`max_ptes_none`.
116+
Meanwhile, the policy avoids expensive churn by delaying the release of pages
117+
that were part of dense memory regions by at least a full GC cycle.
118+
119+
Note that there's potentially quite a lot of hysteresis here, which could impact
120+
memory reclaim for, for example, a brief memory spike followed by a long-lived
121+
idle low-memory state.
122+
In the worst case, the time between GC cycles is 2 minutes, and the scavenger's
123+
slowest return rate is ~256 MiB/sec. [^7] I suspect this isn't slow enough to be
124+
a problem in practice.
125+
Furthermore, `GOMEMLIMIT` can still be employed to maintain a memory maximum.
126+
127+
[^7]: The scavenger is much more aggressive than it once was, targeting 1% of
128+
total CPU usage.
129+
Spending 1% of one CPU core in 2018 on `MADV_DONTNEED` meant roughly 8 KiB
130+
released per millisecond in the worst case.
131+
For a `GOMAXPROCS=32` process, this worst case is now approximately 256 KiB
132+
per millisecond.
133+
In the best case, wherein the scavenger can identify whole unreleased huge
134+
pages, it would release 2 MiB per millisecond in 2018, so 64 MiB per
135+
millisecond today.
136+
137+
## Alternative attempts
138+
139+
Initially, I attempted a design where all heap memory up to the heap goal
140+
(address-ordered) is marked as `MADV_HUGEPAGE` and ineligible for scavenging.
141+
The rest is always eligible for scavenging, and the scavenger marks that memory
142+
as `MADV_NOHUGEPAGE`.
143+
144+
This approach had a few problems:
145+
1. The heap goal tends to fluctuate, creating churn at the boundary.
146+
1. When the heap is actively growing, the aftermath of this churn actually ends
147+
up in the middle of the fully-grown heap, as the scavenger works on memory
148+
beyond the boundary in between GC cycles.
149+
1. Any fragmentation that does exist in the middle of the heap, for example if
150+
most allocations are large, is never looked at by the scavenger.
151+
152+
I also tried a simple heuristic to turn off the scavenger when it looks like the
153+
heap is growing, but not all heaps grow monotonically, so a small amount of
154+
churn still occurred.
155+
It's difficult to come up with a good heuristic without assuming monotonicity.
156+
157+
My next attempt was more direct: mark high density chunks as `MADV_HUGEPAGE`,
158+
and allow low density chunks to be scavenged and set as `MADV_NOHUGEPAGE`.
159+
A chunk would become high density if it was observed to have at least 80%
160+
occupancy, and would later switch back to low density if it had less than 20%
161+
occupancy.
162+
This gap existed for hysteresis to reduce churn.
163+
Unfortunately, this also didn't work: GC-heavy programs often have memory
164+
regions that go from extremely low (near 0%) occupancy to 100% within a single
165+
GC cycle, creating a lot of churn.
166+
167+
The design above is ultimately a combination of these two designs: assume that
168+
the heap gets generally dense within a GC cycle, but handle it on a
169+
chunk-by-chunk basis.
170+
171+
Where all this differs from other huge page efforts, such as [what TCMalloc
172+
did](https://google.github.io/tcmalloc/temeraire.html), is the lack of
173+
bin-packing of allocated memory in huge pages (which is really the majority and
174+
key part of the design).
175+
Bin-packing provides the benefit of increasing the likelihood that an entire
176+
huge page will be free by putting new memory in existing huge pages over some
177+
global policy that may put it anywhere like "best-fit."
178+
This not only improves the efficiency of releasing memory, but makes the overall
179+
footprint smaller due to less fragmentation.
180+
181+
This is unlikely to be that useful for Go since Go's heap already, at least
182+
transiently, gets very dense.
183+
Another thing that gets in the way of doing the same kind of bin-packing for Go
184+
is that the allocator's slow path gets hit much harder than TCMalloc's slow
185+
path.
186+
The reason for this boils down to the GC memory reuse pattern (essentially, FIFO
187+
vs. LIFO reuse).
188+
Slowdowns in this path will likely create scalability problems.
189+
190+
## Appendix: THP flag behavior
191+
192+
Whether or not pages are eligible for THP is controlled by a combination of
193+
settings:
194+
195+
`/sys/kernel/mm/transparent_hugepage/enabled`: system-wide control, possible
196+
values:
197+
- `never`: THP disabled
198+
- `madvise`: Only pages with `MADV_HUGEPAGE` are eligible
199+
- `always`: All pages are eligible, unless marked `MADV_NOHUGEPAGE`
200+
201+
`prctl(PR_SET_THP_DISABLE)`: process-wide control to disable THP
202+
203+
`madvise`: per-mapping control, possible values:
204+
- `MADV_NOHUGEPAGE`: mapping not eligible for THP
205+
- Note that existing huge pages will not be split if this flag is set.
206+
- `MADV_HUGEPAGE`: mapping eligible for THP unless there is a process- or
207+
system-wide disable.
208+
- Unset: mapping eligible for THP if system-wide control is set to “always”.
209+
210+
`/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none`: system-wide
211+
control that specifies how many extra small pages can be allocated when
212+
collapsing a group of pages into a huge page.
213+
In other words, how many small pages in a candidate huge page can be
214+
not-faulted-in or faulted-in zero pages.
215+
216+
`MADV_DONTNEED` on a smaller range within a huge page will split the huge page
217+
to zero the range.
218+
However, the full huge page range will still be immediately eligible for
219+
coalescing by `khugepaged` if `max_ptes_none > 0`, which is true for the default
220+
open source Linux configuration.
221+
Thus to both disable future THP and split an existing huge page race-free, you
222+
must first set `MADV_NOHUGEPAGE` and then call `MADV_DONTNEED`.
223+
224+
Another consideration is the newly-upstreamed `MADV_COLLAPSE`, which collapses
225+
memory regions into huge pages unconditionally.
226+
`MADV_DONTNEED` can then used to break them up.
227+
This scheme represents effectively complete control over huge pages, provided
228+
`khugepaged` doesn't coalesce pages in a way that undoes the `MADV_DONTNEED`.
229+
(For example by setting `max_ptes_none` to zero.)
230+
231+
## Appendix: Page traces
232+
233+
To investigate this issue I built a
234+
[low-overhead](https://perf.golang.org/search?q=upload:20221024.9) [page event
235+
tracer](https://go.dev/cl/444157) and [visualization
236+
utility](https://go.dev/cl/444158) to check assumptions of application and GC
237+
behavior.
238+
Below are a bunch of traces and conclusions from them.
239+
- [Tile38 K-Nearest benchmark](./59960/tile38.png): GC-heavy benchmark.
240+
Note the fluctuation between very low occupancy and very high occupancy.
241+
During a single GC cycle, the page heap gets at least transiently very dense.
242+
This benchmark caused me the most trouble when trying out ideas.
243+
- [Go compiler building a massive package](./59960/compiler.png): Note again the
244+
high density.

design/59960/compiler.png

46.5 KB
Loading

design/59960/tile38.png

331 KB
Loading

0 commit comments

Comments
 (0)