|
| 1 | +# A more hugepage-aware Go heap |
| 2 | + |
| 3 | +Authors: Michael Knyszek, Michael Pratt |
| 4 | + |
| 5 | +## Background |
| 6 | + |
| 7 | +[Transparent huge pages (THP) admin |
| 8 | +guide](https://www.kernel.org/doc/html/latest/admin-guide/mm/transhuge.html). |
| 9 | + |
| 10 | +[Go scavenging |
| 11 | +policy](30333-smarter-scavenging.md#which-memory-should-we-scavenge). |
| 12 | +(Implementation details are out-of-date, but linked policy is relevant.) |
| 13 | + |
| 14 | +[THP flag behavior](#appendix_thp-flag-behavior). |
| 15 | + |
| 16 | +## Motivation |
| 17 | + |
| 18 | +Currently, Go's hugepage-related policies [do not play well |
| 19 | +together](https://github.com/golang/go/issues/55328) and have bit-rotted.[^1] |
| 20 | +The result is that the memory regions the Go runtime chooses to mark as |
| 21 | +`MADV_NOHUGEPAGE` and `MADV_HUGEPAGE` are somewhat haphazard, resulting in |
| 22 | +memory overuse for small heaps. |
| 23 | +The memory overuse is upwards of 40% memory overhead in some cases. |
| 24 | +Turning off huge pages entirely fixes the problem, but leaves CPU performance on |
| 25 | +the table. |
| 26 | +This policy also means large heaps might have dense sections that are |
| 27 | +erroneously mapped as `MADV_NOHUGEPAGE`, costing up to 1% throughput. |
| 28 | + |
| 29 | +The goal of this work is to eliminate this overhead for small heaps while |
| 30 | +improving huge page utilization for large heaps. |
| 31 | + |
| 32 | +[^1]: [Large allocations](https://cs.opensource.google/go/go/+/master:src/runtime/mheap.go;l=1344;drc=c70fd4b30aba5db2df7b5f6b0833c62b909f50eb) |
| 33 | + will force [a call to `MADV_HUGEPAGE` for any aligned huge pages |
| 34 | + within](https://cs.opensource.google/go/go/+/master:src/runtime/mem_linux.go;l=148;drc=9839668b5619f45e293dd40339bf0ac614ea6bee), |
| 35 | + while small allocations tend to leave memory in an undetermined state for |
| 36 | + huge pages. |
| 37 | + The scavenger will try to release entire aligned hugepages at a time. |
| 38 | + Also, when any memory is released, [we `MADV_NOHUGEPAGE` any aligned pages |
| 39 | + in the range we |
| 40 | + release](https://cs.opensource.google/go/go/+/master:src/runtime/mem_linux.go;l=40;drc=9839668b5619f45e293dd40339bf0ac614ea6bee). |
| 41 | + However, the scavenger will [only release 64 KiB at a time unless it finds |
| 42 | + an aligned huge page to |
| 43 | + release](https://cs.opensource.google/go/go/+/master:src/runtime/mgcscavenge.go;l=564;drc=c70fd4b30aba5db2df7b5f6b0833c62b909f50eb), |
| 44 | + and even then it'll [only `MADV_NOHUGEPAGE` the corresponding huge pages if |
| 45 | + the region it's scavenging crosses a huge page |
| 46 | + boundary](https://cs.opensource.google/go/go/+/master:src/runtime/mem_linux.go;l=70;drc=9839668b5619f45e293dd40339bf0ac614ea6bee). |
| 47 | + |
| 48 | +## Proposal |
| 49 | + |
| 50 | +One key insight in the design of the scavenger is that the runtime always has a |
| 51 | +good idea of how much memory will be used soon: the total heap footprint for a |
| 52 | +GC cycle is determined by the heap goal. [^2] |
| 53 | + |
| 54 | +[^2]: The runtime also has a first-fit page allocator so that the scavenger can |
| 55 | + take pages from the high addresses in the heap, again to reduce the chance |
| 56 | + of conflict. |
| 57 | + The scavenger tries to return memory to the OS such that it leaves enough |
| 58 | + paged-in memory around to reach the heap goal (adjusted for fragmentation |
| 59 | + within spans and a 10% buffer for fragmentation outside of spans, or capped |
| 60 | + by the memory limit). |
| 61 | + The purpose behind this is to reduce the chance that the scavenger will |
| 62 | + return memory to the OS that will be used soon. |
| 63 | + |
| 64 | +Indeed, by [tracing page allocations and watching page state over |
| 65 | +time](#appendix_page-traces) we can see that Go heaps tend to get very dense |
| 66 | +toward the end of a GC cycle; this makes all of that memory a decent candidate |
| 67 | +for huge pages from the perspective of fragmentation. |
| 68 | +However, it's also clear this density fluctuates significantly within a GC |
| 69 | +cycle. |
| 70 | + |
| 71 | +Therefore, I propose the following policy: |
| 72 | +1. All new memory is initially marked as `MADV_HUGEPAGE` with the expectation |
| 73 | + that it will be used. |
| 74 | +1. Before the scavenger releases pages in an aligned 4 MiB region of memory [^3] |
| 75 | + it [first](#appendix_thp-flag-behavior) marks it as `MADV_NOHUGEPAGE` if it |
| 76 | + isn't already marked as such. |
| 77 | + - If `max_ptes_none` is 0, then skip this step. |
| 78 | +1. Aligned 4 MiB regions of memory are only available to scavenge if they |
| 79 | + weren't at least 96% [^4] full at the end of the last GC cycle. [^5] |
| 80 | + - Scavenging for `GOMEMLIMIT` or `runtime/debug.FreeOSMemory` ignores this |
| 81 | + rule. |
| 82 | +1. Any aligned 4 MiB region of memory that exceeds 96% occupancy is immediately |
| 83 | + marked as `MADV_HUGEPAGE`. |
| 84 | + - If `max_ptes_none` is 0, then use `MADV_COLLAPSE` instead, if available. |
| 85 | + - Memory scavenged for `GOMEMLIMIT` or `runtime/debug.FreeOSMemory` is not |
| 86 | + marked `MADV_HUGEPAGE` until the next allocation that causes this |
| 87 | + condition after the end of the current GC cycle. [^6] |
| 88 | + |
| 89 | +[^3]: 4 MiB doesn't align with linux/amd64 huge page sizes, but is a very |
| 90 | + convenient number of the runtime because the page allocator manages memory |
| 91 | + in 4 MiB chunks. |
| 92 | + |
| 93 | +[^4]: The bar for explicit (non-default) backing by huge pages must be very |
| 94 | + high. |
| 95 | + The main issue is the default value of |
| 96 | + `/sys/kernel/mm/transparent_hugepage/defrag` on Linux: it forces regions |
| 97 | + marked as `MADV_HUGEPAGE` to be immediately backed, stalling in the kernel |
| 98 | + until it can compact and rearrange things to provide a huge page. |
| 99 | + Meanwhile the combination of `MADV_NOHUGEPAGE` and `MADV_DONTNEED` does the |
| 100 | + opposite. |
| 101 | + Switching between these two states often creates really expensive churn. |
| 102 | + |
| 103 | +[^5]: Note that `runtime/debug.FreeOSMemory` and the mechanism to maintain |
| 104 | + `GOMEMLIMIT` must still be able to release all memory to be effective. |
| 105 | + For that reason, this rule does not apply to those two situations. |
| 106 | + Basically, these cases get to skip waiting until the end of the GC cycle, |
| 107 | + optimistically assuming that memory won't be used. |
| 108 | + |
| 109 | +[^6]: It might happen that the wrong memory was scavenged (memory that soon |
| 110 | + after exceeds 96% occupancy). |
| 111 | + This delay helps reduce churn. |
| 112 | + |
| 113 | +The goal of these changes is to ensure that when sparse regions of the heap have |
| 114 | +their memory returned to the OS, it stays that way regardless of |
| 115 | +`max_ptes_none`. |
| 116 | +Meanwhile, the policy avoids expensive churn by delaying the release of pages |
| 117 | +that were part of dense memory regions by at least a full GC cycle. |
| 118 | + |
| 119 | +Note that there's potentially quite a lot of hysteresis here, which could impact |
| 120 | +memory reclaim for, for example, a brief memory spike followed by a long-lived |
| 121 | +idle low-memory state. |
| 122 | +In the worst case, the time between GC cycles is 2 minutes, and the scavenger's |
| 123 | +slowest return rate is ~256 MiB/sec. [^7] I suspect this isn't slow enough to be |
| 124 | +a problem in practice. |
| 125 | +Furthermore, `GOMEMLIMIT` can still be employed to maintain a memory maximum. |
| 126 | + |
| 127 | +[^7]: The scavenger is much more aggressive than it once was, targeting 1% of |
| 128 | + total CPU usage. |
| 129 | + Spending 1% of one CPU core in 2018 on `MADV_DONTNEED` meant roughly 8 KiB |
| 130 | + released per millisecond in the worst case. |
| 131 | + For a `GOMAXPROCS=32` process, this worst case is now approximately 256 KiB |
| 132 | + per millisecond. |
| 133 | + In the best case, wherein the scavenger can identify whole unreleased huge |
| 134 | + pages, it would release 2 MiB per millisecond in 2018, so 64 MiB per |
| 135 | + millisecond today. |
| 136 | + |
| 137 | +## Alternative attempts |
| 138 | + |
| 139 | +Initially, I attempted a design where all heap memory up to the heap goal |
| 140 | +(address-ordered) is marked as `MADV_HUGEPAGE` and ineligible for scavenging. |
| 141 | +The rest is always eligible for scavenging, and the scavenger marks that memory |
| 142 | +as `MADV_NOHUGEPAGE`. |
| 143 | + |
| 144 | +This approach had a few problems: |
| 145 | +1. The heap goal tends to fluctuate, creating churn at the boundary. |
| 146 | +1. When the heap is actively growing, the aftermath of this churn actually ends |
| 147 | + up in the middle of the fully-grown heap, as the scavenger works on memory |
| 148 | + beyond the boundary in between GC cycles. |
| 149 | +1. Any fragmentation that does exist in the middle of the heap, for example if |
| 150 | + most allocations are large, is never looked at by the scavenger. |
| 151 | + |
| 152 | +I also tried a simple heuristic to turn off the scavenger when it looks like the |
| 153 | +heap is growing, but not all heaps grow monotonically, so a small amount of |
| 154 | +churn still occurred. |
| 155 | +It's difficult to come up with a good heuristic without assuming monotonicity. |
| 156 | + |
| 157 | +My next attempt was more direct: mark high density chunks as `MADV_HUGEPAGE`, |
| 158 | +and allow low density chunks to be scavenged and set as `MADV_NOHUGEPAGE`. |
| 159 | +A chunk would become high density if it was observed to have at least 80% |
| 160 | +occupancy, and would later switch back to low density if it had less than 20% |
| 161 | +occupancy. |
| 162 | +This gap existed for hysteresis to reduce churn. |
| 163 | +Unfortunately, this also didn't work: GC-heavy programs often have memory |
| 164 | +regions that go from extremely low (near 0%) occupancy to 100% within a single |
| 165 | +GC cycle, creating a lot of churn. |
| 166 | + |
| 167 | +The design above is ultimately a combination of these two designs: assume that |
| 168 | +the heap gets generally dense within a GC cycle, but handle it on a |
| 169 | +chunk-by-chunk basis. |
| 170 | + |
| 171 | +Where all this differs from other huge page efforts, such as [what TCMalloc |
| 172 | +did](https://google.github.io/tcmalloc/temeraire.html), is the lack of |
| 173 | +bin-packing of allocated memory in huge pages (which is really the majority and |
| 174 | +key part of the design). |
| 175 | +Bin-packing provides the benefit of increasing the likelihood that an entire |
| 176 | +huge page will be free by putting new memory in existing huge pages over some |
| 177 | +global policy that may put it anywhere like "best-fit." |
| 178 | +This not only improves the efficiency of releasing memory, but makes the overall |
| 179 | +footprint smaller due to less fragmentation. |
| 180 | + |
| 181 | +This is unlikely to be that useful for Go since Go's heap already, at least |
| 182 | +transiently, gets very dense. |
| 183 | +Another thing that gets in the way of doing the same kind of bin-packing for Go |
| 184 | +is that the allocator's slow path gets hit much harder than TCMalloc's slow |
| 185 | +path. |
| 186 | +The reason for this boils down to the GC memory reuse pattern (essentially, FIFO |
| 187 | +vs. LIFO reuse). |
| 188 | +Slowdowns in this path will likely create scalability problems. |
| 189 | + |
| 190 | +## Appendix: THP flag behavior |
| 191 | + |
| 192 | +Whether or not pages are eligible for THP is controlled by a combination of |
| 193 | +settings: |
| 194 | + |
| 195 | +`/sys/kernel/mm/transparent_hugepage/enabled`: system-wide control, possible |
| 196 | +values: |
| 197 | +- `never`: THP disabled |
| 198 | +- `madvise`: Only pages with `MADV_HUGEPAGE` are eligible |
| 199 | +- `always`: All pages are eligible, unless marked `MADV_NOHUGEPAGE` |
| 200 | + |
| 201 | +`prctl(PR_SET_THP_DISABLE)`: process-wide control to disable THP |
| 202 | + |
| 203 | +`madvise`: per-mapping control, possible values: |
| 204 | +- `MADV_NOHUGEPAGE`: mapping not eligible for THP |
| 205 | + - Note that existing huge pages will not be split if this flag is set. |
| 206 | +- `MADV_HUGEPAGE`: mapping eligible for THP unless there is a process- or |
| 207 | + system-wide disable. |
| 208 | +- Unset: mapping eligible for THP if system-wide control is set to “always”. |
| 209 | + |
| 210 | +`/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none`: system-wide |
| 211 | +control that specifies how many extra small pages can be allocated when |
| 212 | +collapsing a group of pages into a huge page. |
| 213 | +In other words, how many small pages in a candidate huge page can be |
| 214 | +not-faulted-in or faulted-in zero pages. |
| 215 | + |
| 216 | +`MADV_DONTNEED` on a smaller range within a huge page will split the huge page |
| 217 | +to zero the range. |
| 218 | +However, the full huge page range will still be immediately eligible for |
| 219 | +coalescing by `khugepaged` if `max_ptes_none > 0`, which is true for the default |
| 220 | +open source Linux configuration. |
| 221 | +Thus to both disable future THP and split an existing huge page race-free, you |
| 222 | +must first set `MADV_NOHUGEPAGE` and then call `MADV_DONTNEED`. |
| 223 | + |
| 224 | +Another consideration is the newly-upstreamed `MADV_COLLAPSE`, which collapses |
| 225 | +memory regions into huge pages unconditionally. |
| 226 | +`MADV_DONTNEED` can then used to break them up. |
| 227 | +This scheme represents effectively complete control over huge pages, provided |
| 228 | +`khugepaged` doesn't coalesce pages in a way that undoes the `MADV_DONTNEED`. |
| 229 | +(For example by setting `max_ptes_none` to zero.) |
| 230 | + |
| 231 | +## Appendix: Page traces |
| 232 | + |
| 233 | +To investigate this issue I built a |
| 234 | +[low-overhead](https://perf.golang.org/search?q=upload:20221024.9) [page event |
| 235 | +tracer](https://go.dev/cl/444157) and [visualization |
| 236 | +utility](https://go.dev/cl/444158) to check assumptions of application and GC |
| 237 | +behavior. |
| 238 | +Below are a bunch of traces and conclusions from them. |
| 239 | +- [Tile38 K-Nearest benchmark](./59960/tile38.png): GC-heavy benchmark. |
| 240 | + Note the fluctuation between very low occupancy and very high occupancy. |
| 241 | + During a single GC cycle, the page heap gets at least transiently very dense. |
| 242 | + This benchmark caused me the most trouble when trying out ideas. |
| 243 | +- [Go compiler building a massive package](./59960/compiler.png): Note again the |
| 244 | + high density. |
0 commit comments