Eliminate OOM during VCR recording: streaming, zero-copy body handling, and full test suite by matyas-jirat-keboola · Pull Request #1 · keboola/python-vcr-tests

matyas-jirat-keboola · 2026-02-23T07:33:57Z

Summary

Recording large jobs (Facebook Ads/Pages, 100+ queries, 15 MB+ responses) would OOM at even 512 MB. This PR fixes the root causes and adds a comprehensive unit test suite.

Memory fixes

Stream interactions to JSONL during recording — each HTTP interaction is written to a temp file immediately after capture and removed from vcrpy's in-memory buffer, keeping memory flat regardless of job size
Zero-copy response body with _VCRRecordingReader — replaces BytesIO(data) (which copies bytes into its own buffer) with a wrapper that reads directly from the original bytes; drops VCR's extra body copy from 3× to ~1× per response
Release _data after read — clears _VCRRecordingReader._data once urllib3 finishes consuming the response, so only requests.Response._content holds the body (same as a plain run)
_BytesEncoder + json.dump streaming — eliminates two extra allocations per interaction that compat.convert_to_unicode + json.dumps created
Write JSONL lines as-is — _build_cassette_from_jsonl no longer re-parses + re-serializes each line (eliminated 3× spike at recording end)
Eliminate copy.deepcopy — replaced with a targeted shallow copy in streaming_append

Profiler results (component-ceps, 6 years of 15-min electricity data, 4 endpoints, 36.8 MB cassette)

Run	Peak RSS	Elapsed
Plain (no VCR)	127.9 MB	67.3 s
VCR recording	132.1 MB	72.4 s

+4.2 MB overhead — essentially noise.

Bug fix

functools.partial assigned as __init__ doesn't implement the descriptor protocol (__get__), so self was never bound. Fixed by using functools.partialmethod.

Code quality

Extracted nested functions to module level, renamed symbols for clarity, reordered by importance
Applied flake8/code review fixes across all four source modules

Tests

Added comprehensive unit tests: test_recorder.py, test_sanitizers.py, test_scaffolder.py, test_validator.py (1142 lines total)

Test plan

Unit tests pass locally
Memory profiler confirms ~0 MB overhead vs plain run (component-ceps)
Facebook Ads and Pages recording jobs complete successfully at 512 MB (component-meta)
Cassette is valid JSON with correct _metadata, version, and interactions keys

🤖 Generated with Claude Code

DefaultSanitizer now accepts an optional `config` parameter that auto-extracts #-prefixed Keboola secrets. record_debug_run reads config.json from KBC_DATADIR and passes it directly — no separate ConfigSecretsSanitizer needed in the chain. Remove unused build_sanitizer_chain function. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace vcrpy's in-memory accumulation with a streaming approach: each recorded interaction is immediately serialized to a JSONL temp file and popped from cassette.data, keeping memory usage constant regardless of job size. The final cassette JSON is reconstructed from the temp file after the component finishes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Primary fix: bypass original_append in streaming_append to avoid copy.deepcopy on every response body. Instead, apply sanitizer hooks directly with a shallow copy of the response dict — new dict containers are allocated but the (immutable) body bytes object is shared, cutting per-interaction overhead from 3× to 2× body size. Secondary fix: stream-write the final cassette JSON line-by-line from the JSONL temp file instead of loading all interactions into memory at once. Metadata is now written inline as the first key, so the _inject_metadata() read-entire-cassette pass is eliminated entirely. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

VCRHTTPResponse previously created a BytesIO copy of every response body (copy 1) on top of the original bytes already held in the cassette dict (copy 0), then urllib3.read() produced a third copy in requests.Response. Peak persistent bodies per interaction: 3×. Fix: add _VCRRecordingReader, a minimal BytesIO-compatible wrapper that holds the original bytes without copying. Its full read() returns data[self._pos:] which, due to CPython's slice optimisation, is the same bytes object when pos==0 — so urllib3 → requests.Response gets the same object, not a new allocation. The VCRHTTPResponse.__init__ is monkey-patched inside record() only and restored in a finally block, leaving replay completely unaffected. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

1. Replace compat.convert_to_unicode + json.dumps with json.dump(cls=_BytesEncoder) in streaming_append. The old path created (a) a mutated unicode response dict and (b) a full JSON string in memory simultaneously — 2× the body size extra. _BytesEncoder decodes bytes inline during streaming json.dump so neither intermediate object is allocated. 2. Remove json.loads + json.dumps(sort_keys=True) from _build_cassette_from_jsonl. For each JSONL line this created a full Python dict AND a full JSON string at the same time (3× the line size). For a large response (e.g. 50 MB) this spiked 150 MB at the end of recording. Now lines are written as-is directly to the output file (no key sorting, but identical semantics for replay). Combined with the earlier _VCRRecordingReader fix, peak VCR overhead per response is now ~1× the body size (down from ~5-6× before all fixes). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Add `from __future__ import annotations` to all four files; replace `Dict`/`List`/`Optional` with built-in generics (`dict`, `list`, `X | None`) - recorder.py: promote deferred imports (`base64`, `os`, `datetime/timezone`, `VCRHTTPResponse`) to module-level; move `_VCRHTTPResponse` into the `try: import vcr` guard block - sanitizers.py: promote `pathlib.Path` to module-level; remove redundant `import json` inside `BodyFieldSanitizer._sanitize_body`; fix `QueryParameterTokenSanitizer` default annotation (`list[str] | None`); pre-compile `ResponseUrlSanitizer._url_re` in `__init__`; add deprecation notice to `ConfigSecretsSanitizer` pointing to `DefaultSanitizer(config=...)` - validator.py: promote `import fnmatch` to module-level; narrow `ValidationDiff.diff_type` to `Literal["hash_mismatch", "missing", "unexpected"]` - scaffolder.py: promote `import sys` and `from runpy import run_path` to module-level; remove dead `_ = definition.get("description", "")` assignment; remove empty `TestScaffolder.__init__` - Add `.flake8` config with `extend-ignore = E203` (ruff slice-spacing false positive) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

During recording, urllib3 reads each response body in 64 KB chunks via stream(). Previously _VCRRecordingReader._data was never cleared, so every VCRHTTPResponse held two copies of its body simultaneously: - _VCRRecordingReader._data (our reader) - requests.Response._content (assembled by b"".join(chunks)) For a job with 17 large responses (~15 MB each) this caused ~170 MB of extra resident memory — enough to OOM at 512 MB. Fix: clear _data = b"" once all bytes are consumed (either full read() or chunked reads reaching the end). After that point urllib3/requests already holds the assembled response._content, so _data can be released. Net overhead vs a plain run: zero persistent extra copies per response. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- recorder.py: extract _zero_copy_vcr_init as module-level _zero_copy_vcr_response_init (patched via functools.partial); extract streaming_append as VCRRecorder._append_interaction (assigned via functools.partial); rename _build_cassette_from_jsonl -> _write_cassette; reorder VCRRecorder methods and module-level classes with most-important-first ordering - sanitizers.py: rename QueryParameterTokenSanitizer -> QueryParamSanitizer and _walk_for_hash_keys -> _collect_hash_values; reorder classes (BaseSanitizer, DefaultSanitizer, CompositeSanitizer first) - validator.py: reorder dataclasses (ValidationResult first) and OutputSnapshot methods (validate/capture before private helpers) - scaffolder.py: rename _detect_and_transform_raw_configs -> _normalize_definitions and _replace_secrets_with_placeholders -> _mask_secrets; extract nested replace_in_value as static _replace_value; reorder TestScaffolder methods public-first - __init__.py: update QueryParameterTokenSanitizer -> QueryParamSanitizer Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

functools.partial does not implement the descriptor protocol (__get__), so assigning it as __init__ meant Python called it without automatically binding the instance as the first argument. This caused: _zero_copy_vcr_response_init() missing 1 required positional argument: 'recorded_response' functools.partialmethod is a descriptor and handles self-binding correctly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Fixes: - seek(): raise ValueError for invalid whence (issue 4) - getbuffer(): expand docstring for post-release empty memoryview (issue 5) - extract_values(): make `values` param optional with None default (issue 6) - _sanitize_body(): preserve original JSON formatting when no redaction (issue 7) - record_debug_run(): load config.secrets.json and pass values to DefaultSanitizer (issue 8) Tests (119 passing): - tests/conftest.py: shared fixtures - tests/test_sanitizers.py: all 11 sanitizer classes + extract_values (51 tests) - tests/test_recorder.py: _VCRRecordingReader, JsonIndentedSerializer, VCRRecorder (35 tests) - tests/test_validator.py: OutputSnapshot, ValidationResult, _get_csv_metadata (16 tests) - tests/test_scaffolder.py: TestScaffolder methods, scaffold_tests (17 tests) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ed change tracking - pyproject.toml: pin vcrpy<9 with comment explaining private API dependency - record(): wrap _write_cassette in try/finally so temp_path is always cleaned up - record(): add inline comment noting global VCRHTTPResponse patch is not thread-safe - _zero_copy_vcr_response_init: add CRITICAL comment on body["string"] = b"" explaining why the empty placeholder must not be changed (prevents BytesIO copy of every body) - _append_interaction: add sort_keys=True so interaction JSON keys are always sorted, keeping cassette diffs stable across recordings - _sanitize_json_value: replace post-hoc sanitized == data O(n) equality check with a mutable changed=[False] flag; eliminates a full second traversal of clean responses Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- .github/workflows/ci.yml: run pytest on every push and PR across Python 3.10, 3.11, 3.12, 3.13 using uv sync --frozen --extra dev - pyproject.toml: add [dev] extras (pytest, pytest-cov); tighten requires-python to >=3.10 to match vcrpy 8.1.1's own requirement; drop 3.8/3.9 classifiers that were already unsatisfiable - uv.lock: generated lockfile for reproducible CI installs - .gitignore: unblock uv.lock from the *.lock glob Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

When cassette.append is patched with _append_interaction, vcrpy's Cassette.append (which houses the full _before_record_response chain, including the decode_compressed_response decompression hook) is bypassed. As a result, gzip response bodies were stored un-decompressed as corrupted UTF-8 strings in cassettes, causing replay failures (e.g. zeep failing to parse a SOAP WSDL). Fix: capture cassette._before_record_response at the point of patching and pass it as an argument to _append_interaction. _append_interaction now calls this chain (which includes both decode_compressed_response and our sanitizer) instead of calling _before_record_response directly. Verified with CEPS component: WSDL cassette entry now stores valid XML and all 9 functional tests pass in replay mode. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Hardcoding a past date as the default freeze time was misleading and became stale over time. Change all scaffold_from_json / scaffold_from_dict / scaffold_tests defaults to None and resolve to datetime.now() at call time. Users can still pass an explicit --freeze-time to pin to a specific timestamp. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Replace ci.yml with push.yml (branches-ignore: main, ruff + flake8 steps, --all-groups) - Add deploy.yml: publishes to TestPyPI on workflow_dispatch, to PyPI on release - Move dev deps to [dependency-groups], add ruff and flake8 - Add [[tool.uv.index]] for TestPyPI - Reset version to 0.0.0 (replaced at release time from git tag via uv version) - Regenerate uv.lock Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

soustruh

Approved with suggestions for future improvements, implement now or later, it's ↑2U!

soustruh · 2026-02-24T13:57:02Z

.gitignore

 .pytest_cache/
 .ruff_cache/
 *.lock
+!uv.lock


I will get rid of it. It is longer to explain then get rid of it.

soustruh · 2026-02-24T13:58:37Z

.flake8

I know this is inspired by another of our projects, but I think we can safely ditch flake8 in 2026, it was great while it lasted! 🫡 😁

Yup that will be the next update where I test pipeline

matyas-jirat-keboola and others added 11 commits February 17, 2026 18:27

freezetime update

c37bdb4

matyas-jirat-keboola requested review from ZdenekSrotyr, kudj and soustruh February 23, 2026 07:35

matyas-jirat-keboola and others added 10 commits February 23, 2026 08:46

Tighten vcrpy pin to >=8.1.1,<9

5247f58

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Tighten freezegun pin to >=1.5.5,<2

5d2be49

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

PyPi and Pipeline updates

2933574

Updates for PyPi and Pipeline

4a44ff4

Published version

d54a94c

soustruh approved these changes Feb 24, 2026

View reviewed changes

matyas-jirat-keboola merged commit 173d382 into main Feb 24, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Eliminate OOM during VCR recording: streaming, zero-copy body handling, and full test suite#1

Eliminate OOM during VCR recording: streaming, zero-copy body handling, and full test suite#1
matyas-jirat-keboola merged 21 commits intomainfrom
feature/default-sanitizer-config

matyas-jirat-keboola commented Feb 23, 2026

Uh oh!

soustruh left a comment

Uh oh!

soustruh Feb 24, 2026

Uh oh!

matyas-jirat-keboola Feb 24, 2026

Uh oh!

soustruh Feb 24, 2026

Uh oh!

matyas-jirat-keboola Feb 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

matyas-jirat-keboola commented Feb 23, 2026

Summary

Memory fixes

Profiler results (component-ceps, 6 years of 15-min electricity data, 4 endpoints, 36.8 MB cassette)

Bug fix

Code quality

Tests

Test plan

Uh oh!

soustruh left a comment

Choose a reason for hiding this comment

Uh oh!

soustruh Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

matyas-jirat-keboola Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

soustruh Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

matyas-jirat-keboola Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants