Skip to content

feat: gated tracemalloc heap-dump for one-pod prod profiling#96

Closed
ServerSideHannes wants to merge 1 commit into
mainfrom
profile/tracemalloc-heap-dump
Closed

feat: gated tracemalloc heap-dump for one-pod prod profiling#96
ServerSideHannes wants to merge 1 commit into
mainfrom
profile/tracemalloc-heap-dump

Conversation

@ServerSideHannes

Copy link
Copy Markdown
Owner

Why

We've been inferring the cause of the prod OOM (fragmentation, allocators, copies — all measured wrong or partial). This adds a definitive tool: log the top live Python allocations under real backup load, so we can see exactly which call sites hold the resident memory.

What

  • S3PROXY_TRACEMALLOC=1 enables it (unset = zero overhead, no tracing). On startup it tracemalloc.start() and logs TRACEMALLOC_SNAPSHOT + top-N TRACEMALLOC_TOP (size_mb, count, file:line) every S3PROXY_TRACEMALLOC_INTERVAL (default 15s) and on SIGUSR1.
  • Chart: extraConfig map → injected into the config ConfigMap (envFrom), so a single profiling replica can set the flag via values, then revert.

Use (one-pod, time-boxed)

  1. Deploy this image to one replica with extraConfig: { S3PROXY_TRACEMALLOC: "1" } and a raised memory limit (so it dumps before OOM).
  2. Let backup load hit it; read TRACEMALLOC_TOP from kubectl logs.
  3. Revert.

Diagnostic only; no behavior change when the flag is unset. 449 unit tests pass.

Diagnostic to find what actually holds the resident memory under backup load
(the OOM), instead of inferring. Enabled only when S3PROXY_TRACEMALLOC is set
(zero overhead otherwise): starts tracemalloc at startup and logs the top live
Python allocations (size + call site) every S3PROXY_TRACEMALLOC_INTERVAL secs
and on SIGUSR1. Chart gains an extraConfig passthrough so one replica can set
the flag via values; revert after capture.
@ServerSideHannes ServerSideHannes deleted the profile/tracemalloc-heap-dump branch June 30, 2026 13:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant