feat: gated tracemalloc heap-dump for one-pod prod profiling#96
Closed
ServerSideHannes wants to merge 1 commit into
Closed
feat: gated tracemalloc heap-dump for one-pod prod profiling#96ServerSideHannes wants to merge 1 commit into
ServerSideHannes wants to merge 1 commit into
Conversation
Diagnostic to find what actually holds the resident memory under backup load (the OOM), instead of inferring. Enabled only when S3PROXY_TRACEMALLOC is set (zero overhead otherwise): starts tracemalloc at startup and logs the top live Python allocations (size + call site) every S3PROXY_TRACEMALLOC_INTERVAL secs and on SIGUSR1. Chart gains an extraConfig passthrough so one replica can set the flag via values; revert after capture.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
We've been inferring the cause of the prod OOM (fragmentation, allocators, copies — all measured wrong or partial). This adds a definitive tool: log the top live Python allocations under real backup load, so we can see exactly which call sites hold the resident memory.
What
S3PROXY_TRACEMALLOC=1enables it (unset = zero overhead, no tracing). On startup ittracemalloc.start()and logsTRACEMALLOC_SNAPSHOT+ top-NTRACEMALLOC_TOP(size_mb, count, file:line) everyS3PROXY_TRACEMALLOC_INTERVAL(default 15s) and onSIGUSR1.extraConfigmap → injected into the config ConfigMap (envFrom), so a single profiling replica can set the flag via values, then revert.Use (one-pod, time-boxed)
extraConfig: { S3PROXY_TRACEMALLOC: "1" }and a raised memory limit (so it dumps before OOM).TRACEMALLOC_TOPfromkubectl logs.Diagnostic only; no behavior change when the flag is unset. 449 unit tests pass.