Skip to content

Comments

Eliminate OOM during VCR recording: streaming, zero-copy body handling, and full test suite#1

Merged
matyas-jirat-keboola merged 21 commits intomainfrom
feature/default-sanitizer-config
Feb 24, 2026
Merged

Eliminate OOM during VCR recording: streaming, zero-copy body handling, and full test suite#1
matyas-jirat-keboola merged 21 commits intomainfrom
feature/default-sanitizer-config

Conversation

@matyas-jirat-keboola
Copy link
Collaborator

Summary

Recording large jobs (Facebook Ads/Pages, 100+ queries, 15 MB+ responses) would OOM at even 512 MB. This PR fixes the root causes and adds a comprehensive unit test suite.

Memory fixes

  • Stream interactions to JSONL during recording — each HTTP interaction is written to a temp file immediately after capture and removed from vcrpy's in-memory buffer, keeping memory flat regardless of job size
  • Zero-copy response body with _VCRRecordingReader — replaces BytesIO(data) (which copies bytes into its own buffer) with a wrapper that reads directly from the original bytes; drops VCR's extra body copy from 3× to ~1× per response
  • Release _data after read — clears _VCRRecordingReader._data once urllib3 finishes consuming the response, so only requests.Response._content holds the body (same as a plain run)
  • _BytesEncoder + json.dump streaming — eliminates two extra allocations per interaction that compat.convert_to_unicode + json.dumps created
  • Write JSONL lines as-is_build_cassette_from_jsonl no longer re-parses + re-serializes each line (eliminated 3× spike at recording end)
  • Eliminate copy.deepcopy — replaced with a targeted shallow copy in streaming_append

Profiler results (component-ceps, 6 years of 15-min electricity data, 4 endpoints, 36.8 MB cassette)

Run Peak RSS Elapsed
Plain (no VCR) 127.9 MB 67.3 s
VCR recording 132.1 MB 72.4 s

+4.2 MB overhead — essentially noise.

Bug fix

  • functools.partial assigned as __init__ doesn't implement the descriptor protocol (__get__), so self was never bound. Fixed by using functools.partialmethod.

Code quality

  • Extracted nested functions to module level, renamed symbols for clarity, reordered by importance
  • Applied flake8/code review fixes across all four source modules

Tests

  • Added comprehensive unit tests: test_recorder.py, test_sanitizers.py, test_scaffolder.py, test_validator.py (1142 lines total)

Test plan

  • Unit tests pass locally
  • Memory profiler confirms ~0 MB overhead vs plain run (component-ceps)
  • Facebook Ads and Pages recording jobs complete successfully at 512 MB (component-meta)
  • Cassette is valid JSON with correct _metadata, version, and interactions keys

🤖 Generated with Claude Code

matyas-jirat-keboola and others added 11 commits February 17, 2026 18:27
DefaultSanitizer now accepts an optional `config` parameter that
auto-extracts #-prefixed Keboola secrets. record_debug_run reads
config.json from KBC_DATADIR and passes it directly — no separate
ConfigSecretsSanitizer needed in the chain.

Remove unused build_sanitizer_chain function.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace vcrpy's in-memory accumulation with a streaming approach:
each recorded interaction is immediately serialized to a JSONL temp
file and popped from cassette.data, keeping memory usage constant
regardless of job size. The final cassette JSON is reconstructed
from the temp file after the component finishes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Primary fix: bypass original_append in streaming_append to avoid
copy.deepcopy on every response body. Instead, apply sanitizer hooks
directly with a shallow copy of the response dict — new dict containers
are allocated but the (immutable) body bytes object is shared, cutting
per-interaction overhead from 3× to 2× body size.

Secondary fix: stream-write the final cassette JSON line-by-line from
the JSONL temp file instead of loading all interactions into memory at
once. Metadata is now written inline as the first key, so the
_inject_metadata() read-entire-cassette pass is eliminated entirely.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
VCRHTTPResponse previously created a BytesIO copy of every response body
(copy 1) on top of the original bytes already held in the cassette dict
(copy 0), then urllib3.read() produced a third copy in requests.Response.
Peak persistent bodies per interaction: 3×.

Fix: add _VCRRecordingReader, a minimal BytesIO-compatible wrapper that
holds the original bytes without copying.  Its full read() returns
data[self._pos:] which, due to CPython's slice optimisation, is the same
bytes object when pos==0 — so urllib3 → requests.Response gets the same
object, not a new allocation.

The VCRHTTPResponse.__init__ is monkey-patched inside record() only and
restored in a finally block, leaving replay completely unaffected.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1. Replace compat.convert_to_unicode + json.dumps with json.dump(cls=_BytesEncoder)
   in streaming_append.  The old path created (a) a mutated unicode response dict
   and (b) a full JSON string in memory simultaneously — 2× the body size extra.
   _BytesEncoder decodes bytes inline during streaming json.dump so neither
   intermediate object is allocated.

2. Remove json.loads + json.dumps(sort_keys=True) from _build_cassette_from_jsonl.
   For each JSONL line this created a full Python dict AND a full JSON string at the
   same time (3× the line size).  For a large response (e.g. 50 MB) this spiked
   150 MB at the end of recording.  Now lines are written as-is directly to the
   output file (no key sorting, but identical semantics for replay).

Combined with the earlier _VCRRecordingReader fix, peak VCR overhead per response
is now ~1× the body size (down from ~5-6× before all fixes).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add `from __future__ import annotations` to all four files; replace
  `Dict`/`List`/`Optional` with built-in generics (`dict`, `list`, `X | None`)
- recorder.py: promote deferred imports (`base64`, `os`, `datetime/timezone`,
  `VCRHTTPResponse`) to module-level; move `_VCRHTTPResponse` into the
  `try: import vcr` guard block
- sanitizers.py: promote `pathlib.Path` to module-level; remove redundant
  `import json` inside `BodyFieldSanitizer._sanitize_body`; fix
  `QueryParameterTokenSanitizer` default annotation (`list[str] | None`);
  pre-compile `ResponseUrlSanitizer._url_re` in `__init__`; add deprecation
  notice to `ConfigSecretsSanitizer` pointing to `DefaultSanitizer(config=...)`
- validator.py: promote `import fnmatch` to module-level; narrow
  `ValidationDiff.diff_type` to `Literal["hash_mismatch", "missing", "unexpected"]`
- scaffolder.py: promote `import sys` and `from runpy import run_path` to
  module-level; remove dead `_ = definition.get("description", "")` assignment;
  remove empty `TestScaffolder.__init__`
- Add `.flake8` config with `extend-ignore = E203` (ruff slice-spacing false positive)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
During recording, urllib3 reads each response body in 64 KB chunks via
stream().  Previously _VCRRecordingReader._data was never cleared, so
every VCRHTTPResponse held two copies of its body simultaneously:
  - _VCRRecordingReader._data  (our reader)
  - requests.Response._content  (assembled by b"".join(chunks))

For a job with 17 large responses (~15 MB each) this caused ~170 MB
of extra resident memory — enough to OOM at 512 MB.

Fix: clear _data = b"" once all bytes are consumed (either full read()
or chunked reads reaching the end).  After that point urllib3/requests
already holds the assembled response._content, so _data can be released.
Net overhead vs a plain run: zero persistent extra copies per response.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- recorder.py: extract _zero_copy_vcr_init as module-level
  _zero_copy_vcr_response_init (patched via functools.partial);
  extract streaming_append as VCRRecorder._append_interaction
  (assigned via functools.partial); rename _build_cassette_from_jsonl
  -> _write_cassette; reorder VCRRecorder methods and module-level
  classes with most-important-first ordering
- sanitizers.py: rename QueryParameterTokenSanitizer -> QueryParamSanitizer
  and _walk_for_hash_keys -> _collect_hash_values; reorder classes
  (BaseSanitizer, DefaultSanitizer, CompositeSanitizer first)
- validator.py: reorder dataclasses (ValidationResult first) and
  OutputSnapshot methods (validate/capture before private helpers)
- scaffolder.py: rename _detect_and_transform_raw_configs ->
  _normalize_definitions and _replace_secrets_with_placeholders ->
  _mask_secrets; extract nested replace_in_value as static
  _replace_value; reorder TestScaffolder methods public-first
- __init__.py: update QueryParameterTokenSanitizer -> QueryParamSanitizer

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
functools.partial does not implement the descriptor protocol (__get__),
so assigning it as __init__ meant Python called it without automatically
binding the instance as the first argument.  This caused:

  _zero_copy_vcr_response_init() missing 1 required positional argument:
  'recorded_response'

functools.partialmethod is a descriptor and handles self-binding correctly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fixes:
- seek(): raise ValueError for invalid whence (issue 4)
- getbuffer(): expand docstring for post-release empty memoryview (issue 5)
- extract_values(): make `values` param optional with None default (issue 6)
- _sanitize_body(): preserve original JSON formatting when no redaction (issue 7)
- record_debug_run(): load config.secrets.json and pass values to DefaultSanitizer (issue 8)

Tests (119 passing):
- tests/conftest.py: shared fixtures
- tests/test_sanitizers.py: all 11 sanitizer classes + extract_values (51 tests)
- tests/test_recorder.py: _VCRRecordingReader, JsonIndentedSerializer, VCRRecorder (35 tests)
- tests/test_validator.py: OutputSnapshot, ValidationResult, _get_csv_metadata (16 tests)
- tests/test_scaffolder.py: TestScaffolder methods, scaffold_tests (17 tests)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
matyas-jirat-keboola and others added 10 commits February 23, 2026 08:46
…ed change tracking

- pyproject.toml: pin vcrpy<9 with comment explaining private API dependency
- record(): wrap _write_cassette in try/finally so temp_path is always cleaned up
- record(): add inline comment noting global VCRHTTPResponse patch is not thread-safe
- _zero_copy_vcr_response_init: add CRITICAL comment on body["string"] = b"" explaining
  why the empty placeholder must not be changed (prevents BytesIO copy of every body)
- _append_interaction: add sort_keys=True so interaction JSON keys are always sorted,
  keeping cassette diffs stable across recordings
- _sanitize_json_value: replace post-hoc sanitized == data O(n) equality check with
  a mutable changed=[False] flag; eliminates a full second traversal of clean responses

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- .github/workflows/ci.yml: run pytest on every push and PR across
  Python 3.10, 3.11, 3.12, 3.13 using uv sync --frozen --extra dev
- pyproject.toml: add [dev] extras (pytest, pytest-cov); tighten
  requires-python to >=3.10 to match vcrpy 8.1.1's own requirement;
  drop 3.8/3.9 classifiers that were already unsatisfiable
- uv.lock: generated lockfile for reproducible CI installs
- .gitignore: unblock uv.lock from the *.lock glob

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When cassette.append is patched with _append_interaction, vcrpy's
Cassette.append (which houses the full _before_record_response chain,
including the decode_compressed_response decompression hook) is
bypassed. As a result, gzip response bodies were stored un-decompressed
as corrupted UTF-8 strings in cassettes, causing replay failures (e.g.
zeep failing to parse a SOAP WSDL).

Fix: capture cassette._before_record_response at the point of patching
and pass it as an argument to _append_interaction. _append_interaction
now calls this chain (which includes both decode_compressed_response and
our sanitizer) instead of calling _before_record_response directly.

Verified with CEPS component: WSDL cassette entry now stores valid XML
and all 9 functional tests pass in replay mode.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Hardcoding a past date as the default freeze time was misleading and
became stale over time. Change all scaffold_from_json / scaffold_from_dict
/ scaffold_tests defaults to None and resolve to datetime.now() at call
time. Users can still pass an explicit --freeze-time to pin to a
specific timestamp.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Replace ci.yml with push.yml (branches-ignore: main, ruff + flake8 steps, --all-groups)
- Add deploy.yml: publishes to TestPyPI on workflow_dispatch, to PyPI on release
- Move dev deps to [dependency-groups], add ruff and flake8
- Add [[tool.uv.index]] for TestPyPI
- Reset version to 0.0.0 (replaced at release time from git tag via uv version)
- Regenerate uv.lock

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link

@soustruh soustruh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with suggestions for future improvements, implement now or later, it's ↑2U!

.pytest_cache/
.ruff_cache/
*.lock
!uv.lock

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? 🤔

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will get rid of it. It is longer to explain then get rid of it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is inspired by another of our projects, but I think we can safely ditch flake8 in 2026, it was great while it lasted! 🫡 😁

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup that will be the next update where I test pipeline

@matyas-jirat-keboola matyas-jirat-keboola merged commit 173d382 into main Feb 24, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants