fix: govern copy memory + fix passthrough-copy ClientResponse.read crash#97
Merged
Conversation
Server-side copies (CopyObject / UploadPartCopy) carry no request body, so the request-level memory limiter reserved ~nothing for them -- yet each copy decrypts the source and re-encrypts it in RAM. Under a Scylla dedup flood, dozens ran concurrently with nothing throttling them and the pod was OOMKilled (exit 137) -> backup failures. Gate both copy paths through concurrency.reserve_memory(copy_pipeline_peak) so copies are bounded by the same budget as uploads. copy_pipeline_peak returns the real peak: streamed copies ~4x MAX_BUFFER_SIZE (size-independent), small buffered copies ~3x the object. Also fix _iter_copy_source: it called body.read(8MB) but body is an aiohttp ClientResponse whose read() takes no size arg, so every passthrough copy 500'd with TypeError. Stream via body.content.read(n). Verified locally: 64-concurrent copy flood at a 256MiB cap OOM-killed the pod (exit 137, 0/64 ok) before; after, peak ~195MiB and 64/64 copies succeed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Two fixes in the server-side copy path, both root causes of the
concurrent-backup failures.
1. Govern copy memory (the OOM fix)
A
CopyObject/UploadPartCopycarries no request body, so the request-levelmemory limiter reserved ~nothing for it — yet each copy decrypts the source
and re-encrypts it in RAM. Under a Scylla dedup flood, dozens of these ran at
once with nothing throttling them → pod blew past its memory cap → kernel
OOM-kill (exit 137) → backup failures.
Both copy paths (
_copy_encryptedand_streaming_copy_part) are now wrapped inconcurrency.reserve_memory(copy_pipeline_peak(size)), so copies are throttledby the same budget as uploads. When the budget is full they wait instead of
piling up. New
crypto.copy_pipeline_peak()reports the real peak:4 × MAX_BUFFER_SIZE— pipeline-bounded, independent of object size~3 ×the object size (floored at one buffer)2. Fix the passthrough-copy crash
_iter_copy_sourcecalledbody.read(8MB), butbodyis an aiohttpClientResponsewhose.read()takes no size argument → every such copy500'd with
TypeError: ClientResponse.read() takes 1 positional argument but 2 were given. Changed tobody.content.read(8MB)(stream via the StreamReader).Seen in prod:
Proof (local repro, 256MiB cap, 48MiB governor budget)
Same 64-concurrent
CopyObjectflood over 8×64MB encrypted objects:Running=falseRunning=trueOOMKilled=trueOOMKilled=falseTests
tests/unit/test_copy_memory_governing.py: pinscopy_pipeline_peaksizingand proves
reserve_memorybounds concurrent copies to the budget (≤2 at oncefor a 64MB budget / 32MB peak; active never overruns; all released).
tests/conftest.pymockMockS3Responsenow exposes.content(mirrors realaiohttp), so the streaming-copy tests exercise the fixed
content.read(n)path.ruff check+ruff format --checkclean.