-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: purge jemalloc when cgo mem overhead is high #141575
Conversation
Epic: none Release note: None
Whenever we sample the environment (every 10s by default), we check if the CGo overhead is above 20%; in which case we tell jemalloc to purge dirty pages. We limit the frequency of the purges to one every 2 minutes (these settings are configurable). This is meant to be a backup safety mechanism to avoid OOMs. With the updated jemalloc default configuration, I don't expect this to happen much in practice (if at all). Fixes: cockroachdb#141379 Release note: None
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r1, 6 of 6 files at r2, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @dhartunian)
pkg/server/status/runtime.go
line 803 at r2 (raw file):
minPeriod time.Duration, ) { if cgoMemMaybePurge != nil {
just wondering why we need this indirection with function pointers that can be nil.
pkg/server/status/runtime_jemalloc.go
line 172 at r2 (raw file):
// We don't use just cgoTargetMem because we don't want to purge unnecessarily // if we actually allocate more memory than our target (e.g. because of // internal fragmentation).
Nice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR+
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @dhartunian and @sumeerbhola)
pkg/server/status/runtime.go
line 803 at r2 (raw file):
Previously, sumeerbhola wrote…
just wondering why we need this indirection with function pointers that can be nil.
Just followed the existing pattern, since jemalloc can be disabled by a build tag. I don't think we ever use that build tag so maybe we can remove the mechanism altogether.
bors r+ |
server: pass BaseConfig to startSampleEnvironment
Epic: none
Release note: None
server: purge jemalloc when cgo mem overhead is high
Whenever we sample the environment (every 10s by default), we check if
the CGo overhead is above 20%; in which case we tell jemalloc to purge
dirty pages. We limit the frequency of the purges to one every 2
minutes (these settings are configurable).
This is meant to be a backup safety mechanism to avoid OOMs. With the
updated jemalloc default configuration, I don't expect this to happen
much in practice (if at all).
Fixes: #141379
Release note: None
Here is an example of a 30min TPCC run (left is master, right is with the change). I used a terrible malloc conf (
background_thread:false,dirty_decay_ms:1000000,muzzy_decay_ms:500000,narenas:128
) to get a lot of overhead. The overhead graphs show the CGo Total minus 7.6GB which is the high watermark for CGO Alloc. The purge happens when the overhead goes above ~1.5Gb.Purge logs from one of the nodes: