Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: purge jemalloc when cgo mem overhead is high #141575

Merged
merged 2 commits into from
Feb 19, 2025

Conversation

RaduBerinde
Copy link
Member

server: pass BaseConfig to startSampleEnvironment

Epic: none
Release note: None

server: purge jemalloc when cgo mem overhead is high

Whenever we sample the environment (every 10s by default), we check if
the CGo overhead is above 20%; in which case we tell jemalloc to purge
dirty pages. We limit the frequency of the purges to one every 2
minutes (these settings are configurable).

This is meant to be a backup safety mechanism to avoid OOMs. With the
updated jemalloc default configuration, I don't expect this to happen
much in practice (if at all).

Fixes: #141379
Release note: None


Here is an example of a 30min TPCC run (left is master, right is with the change). I used a terrible malloc conf (background_thread:false,dirty_decay_ms:1000000,muzzy_decay_ms:500000,narenas:128) to get a lot of overhead. The overhead graphs show the CGo Total minus 7.6GB which is the high watermark for CGO Alloc. The purge happens when the overhead goes above ~1.5Gb.

image

Purge logs from one of the nodes:

295:I250216 18:24:27.988224 546 server/status/runtime_jemalloc.go:191 ⋮ [T1,Vsystem,n1] 185  jemalloc arenas purged (took 203.915292ms)
303:I250216 18:29:18.462705 546 server/status/runtime_jemalloc.go:191 ⋮ [T1,Vsystem,n1] 193  jemalloc arenas purged (took 206.945658ms)
307:I250216 18:35:38.638441 546 server/status/runtime_jemalloc.go:191 ⋮ [T1,Vsystem,n1] 197  jemalloc arenas purged (took 235.908998ms)
311:I250216 18:42:28.758720 546 server/status/runtime_jemalloc.go:191 ⋮ [T1,Vsystem,n1] 201  jemalloc arenas purged (took 245.213056ms)
323:I250216 18:48:58.986148 546 server/status/runtime_jemalloc.go:191 ⋮ [T1,Vsystem,n1] 213  jemalloc arenas purged (took 98.60313ms)

Whenever we sample the environment (every 10s by default), we check if
the CGo overhead is above 20%; in which case we tell jemalloc to purge
dirty pages. We limit the frequency of the purges to one every 2
minutes (these settings are configurable).

This is meant to be a backup safety mechanism to avoid OOMs. With the
updated jemalloc default configuration, I don't expect this to happen
much in practice (if at all).

Fixes: cockroachdb#141379
Release note: None
@RaduBerinde RaduBerinde requested review from a team as code owners February 16, 2025 18:58
@RaduBerinde RaduBerinde requested review from dhartunian and removed request for a team February 16, 2025 18:58
Copy link

blathers-crl bot commented Feb 16, 2025

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 3 of 3 files at r1, 6 of 6 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @dhartunian)


pkg/server/status/runtime.go line 803 at r2 (raw file):

	minPeriod time.Duration,
) {
	if cgoMemMaybePurge != nil {

just wondering why we need this indirection with function pointers that can be nil.


pkg/server/status/runtime_jemalloc.go line 172 at r2 (raw file):

	// We don't use just cgoTargetMem because we don't want to purge unnecessarily
	// if we actually allocate more memory than our target (e.g. because of
	// internal fragmentation).

Nice

Copy link
Member Author

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR+

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @dhartunian and @sumeerbhola)


pkg/server/status/runtime.go line 803 at r2 (raw file):

Previously, sumeerbhola wrote…

just wondering why we need this indirection with function pointers that can be nil.

Just followed the existing pattern, since jemalloc can be disabled by a build tag. I don't think we ever use that build tag so maybe we can remove the mechanism altogether.

@RaduBerinde
Copy link
Member Author

bors r+

@craig craig bot merged commit 6e6f5fe into cockroachdb:master Feb 19, 2025
23 of 24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

jemalloc: monitor overhead and purge dirty pages
3 participants