Eliminate OOM during VCR recording: streaming, zero-copy body handling, and full test suite#1
Merged
matyas-jirat-keboola merged 21 commits intomainfrom Feb 24, 2026
Conversation
DefaultSanitizer now accepts an optional `config` parameter that auto-extracts #-prefixed Keboola secrets. record_debug_run reads config.json from KBC_DATADIR and passes it directly — no separate ConfigSecretsSanitizer needed in the chain. Remove unused build_sanitizer_chain function. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace vcrpy's in-memory accumulation with a streaming approach: each recorded interaction is immediately serialized to a JSONL temp file and popped from cassette.data, keeping memory usage constant regardless of job size. The final cassette JSON is reconstructed from the temp file after the component finishes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Primary fix: bypass original_append in streaming_append to avoid copy.deepcopy on every response body. Instead, apply sanitizer hooks directly with a shallow copy of the response dict — new dict containers are allocated but the (immutable) body bytes object is shared, cutting per-interaction overhead from 3× to 2× body size. Secondary fix: stream-write the final cassette JSON line-by-line from the JSONL temp file instead of loading all interactions into memory at once. Metadata is now written inline as the first key, so the _inject_metadata() read-entire-cassette pass is eliminated entirely. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
VCRHTTPResponse previously created a BytesIO copy of every response body (copy 1) on top of the original bytes already held in the cassette dict (copy 0), then urllib3.read() produced a third copy in requests.Response. Peak persistent bodies per interaction: 3×. Fix: add _VCRRecordingReader, a minimal BytesIO-compatible wrapper that holds the original bytes without copying. Its full read() returns data[self._pos:] which, due to CPython's slice optimisation, is the same bytes object when pos==0 — so urllib3 → requests.Response gets the same object, not a new allocation. The VCRHTTPResponse.__init__ is monkey-patched inside record() only and restored in a finally block, leaving replay completely unaffected. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1. Replace compat.convert_to_unicode + json.dumps with json.dump(cls=_BytesEncoder) in streaming_append. The old path created (a) a mutated unicode response dict and (b) a full JSON string in memory simultaneously — 2× the body size extra. _BytesEncoder decodes bytes inline during streaming json.dump so neither intermediate object is allocated. 2. Remove json.loads + json.dumps(sort_keys=True) from _build_cassette_from_jsonl. For each JSONL line this created a full Python dict AND a full JSON string at the same time (3× the line size). For a large response (e.g. 50 MB) this spiked 150 MB at the end of recording. Now lines are written as-is directly to the output file (no key sorting, but identical semantics for replay). Combined with the earlier _VCRRecordingReader fix, peak VCR overhead per response is now ~1× the body size (down from ~5-6× before all fixes). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add `from __future__ import annotations` to all four files; replace
`Dict`/`List`/`Optional` with built-in generics (`dict`, `list`, `X | None`)
- recorder.py: promote deferred imports (`base64`, `os`, `datetime/timezone`,
`VCRHTTPResponse`) to module-level; move `_VCRHTTPResponse` into the
`try: import vcr` guard block
- sanitizers.py: promote `pathlib.Path` to module-level; remove redundant
`import json` inside `BodyFieldSanitizer._sanitize_body`; fix
`QueryParameterTokenSanitizer` default annotation (`list[str] | None`);
pre-compile `ResponseUrlSanitizer._url_re` in `__init__`; add deprecation
notice to `ConfigSecretsSanitizer` pointing to `DefaultSanitizer(config=...)`
- validator.py: promote `import fnmatch` to module-level; narrow
`ValidationDiff.diff_type` to `Literal["hash_mismatch", "missing", "unexpected"]`
- scaffolder.py: promote `import sys` and `from runpy import run_path` to
module-level; remove dead `_ = definition.get("description", "")` assignment;
remove empty `TestScaffolder.__init__`
- Add `.flake8` config with `extend-ignore = E203` (ruff slice-spacing false positive)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
During recording, urllib3 reads each response body in 64 KB chunks via stream(). Previously _VCRRecordingReader._data was never cleared, so every VCRHTTPResponse held two copies of its body simultaneously: - _VCRRecordingReader._data (our reader) - requests.Response._content (assembled by b"".join(chunks)) For a job with 17 large responses (~15 MB each) this caused ~170 MB of extra resident memory — enough to OOM at 512 MB. Fix: clear _data = b"" once all bytes are consumed (either full read() or chunked reads reaching the end). After that point urllib3/requests already holds the assembled response._content, so _data can be released. Net overhead vs a plain run: zero persistent extra copies per response. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- recorder.py: extract _zero_copy_vcr_init as module-level _zero_copy_vcr_response_init (patched via functools.partial); extract streaming_append as VCRRecorder._append_interaction (assigned via functools.partial); rename _build_cassette_from_jsonl -> _write_cassette; reorder VCRRecorder methods and module-level classes with most-important-first ordering - sanitizers.py: rename QueryParameterTokenSanitizer -> QueryParamSanitizer and _walk_for_hash_keys -> _collect_hash_values; reorder classes (BaseSanitizer, DefaultSanitizer, CompositeSanitizer first) - validator.py: reorder dataclasses (ValidationResult first) and OutputSnapshot methods (validate/capture before private helpers) - scaffolder.py: rename _detect_and_transform_raw_configs -> _normalize_definitions and _replace_secrets_with_placeholders -> _mask_secrets; extract nested replace_in_value as static _replace_value; reorder TestScaffolder methods public-first - __init__.py: update QueryParameterTokenSanitizer -> QueryParamSanitizer Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
functools.partial does not implement the descriptor protocol (__get__), so assigning it as __init__ meant Python called it without automatically binding the instance as the first argument. This caused: _zero_copy_vcr_response_init() missing 1 required positional argument: 'recorded_response' functools.partialmethod is a descriptor and handles self-binding correctly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fixes: - seek(): raise ValueError for invalid whence (issue 4) - getbuffer(): expand docstring for post-release empty memoryview (issue 5) - extract_values(): make `values` param optional with None default (issue 6) - _sanitize_body(): preserve original JSON formatting when no redaction (issue 7) - record_debug_run(): load config.secrets.json and pass values to DefaultSanitizer (issue 8) Tests (119 passing): - tests/conftest.py: shared fixtures - tests/test_sanitizers.py: all 11 sanitizer classes + extract_values (51 tests) - tests/test_recorder.py: _VCRRecordingReader, JsonIndentedSerializer, VCRRecorder (35 tests) - tests/test_validator.py: OutputSnapshot, ValidationResult, _get_csv_metadata (16 tests) - tests/test_scaffolder.py: TestScaffolder methods, scaffold_tests (17 tests) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ed change tracking - pyproject.toml: pin vcrpy<9 with comment explaining private API dependency - record(): wrap _write_cassette in try/finally so temp_path is always cleaned up - record(): add inline comment noting global VCRHTTPResponse patch is not thread-safe - _zero_copy_vcr_response_init: add CRITICAL comment on body["string"] = b"" explaining why the empty placeholder must not be changed (prevents BytesIO copy of every body) - _append_interaction: add sort_keys=True so interaction JSON keys are always sorted, keeping cassette diffs stable across recordings - _sanitize_json_value: replace post-hoc sanitized == data O(n) equality check with a mutable changed=[False] flag; eliminates a full second traversal of clean responses Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- .github/workflows/ci.yml: run pytest on every push and PR across Python 3.10, 3.11, 3.12, 3.13 using uv sync --frozen --extra dev - pyproject.toml: add [dev] extras (pytest, pytest-cov); tighten requires-python to >=3.10 to match vcrpy 8.1.1's own requirement; drop 3.8/3.9 classifiers that were already unsatisfiable - uv.lock: generated lockfile for reproducible CI installs - .gitignore: unblock uv.lock from the *.lock glob Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When cassette.append is patched with _append_interaction, vcrpy's Cassette.append (which houses the full _before_record_response chain, including the decode_compressed_response decompression hook) is bypassed. As a result, gzip response bodies were stored un-decompressed as corrupted UTF-8 strings in cassettes, causing replay failures (e.g. zeep failing to parse a SOAP WSDL). Fix: capture cassette._before_record_response at the point of patching and pass it as an argument to _append_interaction. _append_interaction now calls this chain (which includes both decode_compressed_response and our sanitizer) instead of calling _before_record_response directly. Verified with CEPS component: WSDL cassette entry now stores valid XML and all 9 functional tests pass in replay mode. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Hardcoding a past date as the default freeze time was misleading and became stale over time. Change all scaffold_from_json / scaffold_from_dict / scaffold_tests defaults to None and resolve to datetime.now() at call time. Users can still pass an explicit --freeze-time to pin to a specific timestamp. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Replace ci.yml with push.yml (branches-ignore: main, ruff + flake8 steps, --all-groups) - Add deploy.yml: publishes to TestPyPI on workflow_dispatch, to PyPI on release - Move dev deps to [dependency-groups], add ruff and flake8 - Add [[tool.uv.index]] for TestPyPI - Reset version to 0.0.0 (replaced at release time from git tag via uv version) - Regenerate uv.lock Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
soustruh
approved these changes
Feb 24, 2026
soustruh
left a comment
There was a problem hiding this comment.
Approved with suggestions for future improvements, implement now or later, it's ↑2U!
| .pytest_cache/ | ||
| .ruff_cache/ | ||
| *.lock | ||
| !uv.lock |
Collaborator
Author
There was a problem hiding this comment.
I will get rid of it. It is longer to explain then get rid of it.
There was a problem hiding this comment.
I know this is inspired by another of our projects, but I think we can safely ditch flake8 in 2026, it was great while it lasted! 🫡 😁
Collaborator
Author
There was a problem hiding this comment.
Yup that will be the next update where I test pipeline
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Recording large jobs (Facebook Ads/Pages, 100+ queries, 15 MB+ responses) would OOM at even 512 MB. This PR fixes the root causes and adds a comprehensive unit test suite.
Memory fixes
_VCRRecordingReader— replacesBytesIO(data)(which copies bytes into its own buffer) with a wrapper that reads directly from the original bytes; drops VCR's extra body copy from 3× to ~1× per response_dataafter read — clears_VCRRecordingReader._dataonce urllib3 finishes consuming the response, so onlyrequests.Response._contentholds the body (same as a plain run)_BytesEncoder+json.dumpstreaming — eliminates two extra allocations per interaction thatcompat.convert_to_unicode + json.dumpscreated_build_cassette_from_jsonlno longer re-parses + re-serializes each line (eliminated 3× spike at recording end)copy.deepcopy— replaced with a targeted shallow copy instreaming_appendProfiler results (component-ceps, 6 years of 15-min electricity data, 4 endpoints, 36.8 MB cassette)
+4.2 MB overhead — essentially noise.
Bug fix
functools.partialassigned as__init__doesn't implement the descriptor protocol (__get__), soselfwas never bound. Fixed by usingfunctools.partialmethod.Code quality
Tests
test_recorder.py,test_sanitizers.py,test_scaffolder.py,test_validator.py(1142 lines total)Test plan
_metadata,version, andinteractionskeys🤖 Generated with Claude Code