feat: precompute per_node.db in recce init --cloud [DRC-3295]#1334
Conversation
Introduces recce/util/per_node_db.py: a dependency-light SQLite emitter module for DRC-3295 Phase III (lineage API data transfer optimization). The emitter reads a dbt manifest (and optional catalog) and produces a per_node.db SQLite file with meta/nodes/columns/edges/node_tests tables, WAL journaling, and user_version = SCHEMA_VERSION. This is PR 1 of 3 (stacked): - PR 1 (this): pure emitter module + unit tests, no CLI wiring. - PR 2: wire the emitter into `recce init --cloud`. - PR 3: scale tests against real jaffle-shop-expand artifacts. Primary-key and test derivation mirror DbtAdapter.get_model() (recce/adapter/dbt_adapter/__init__.py:447-499) so PR 2 can wire the emitter into the existing upload path with no behavioural drift. node_json uses JSON bytes rather than msgpack because msgpack is not currently a runtime dep of recce; the encoder is isolated behind a private helper so we can swap it later without touching the schema. Tested in isolation with 8 unit tests (schema/WAL, round-trips for each table, primary_key derivation, null catalog, empty manifest). No CLI or cloud code is touched. Refs: DRC-3295 Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: even-wei <evenwei@infuseai.io>
Codecov Report❌ Patch coverage is
🚀 New features to boost your workflow:
|
PR 2/3 of DRC-3295. Wires PR 1's per-node SQLite emitter into the `recce init --cloud` command so Cloud can stream lineage rows without proxying to an ephemeral Recce instance. Key changes: - In cloud mode, write `cll_cache.db` to a `tempfile.mkdtemp(prefix= "recce-cll-")` scratch dir so build_full_cll_map keeps its warm-cache perf, but the cache.db never leaves the container (ECS task GC on exit). A user-provided --cache-db is intentionally ignored in cloud mode. - After build_full_cll_map(), emit per_node.db (manifest + catalog rows, not_null/unique tests, primary_key, edges) to the same scratch dir and upload it via the new `per_node_db_url` upload key. - Graceful degradation: if Cloud has not yet added per_node_db_url, log a warning and continue — old CLI + new Cloud and vice versa stay compatible. - Remove the cll_cache.db cloud upload block; cache.db is now local-only in cloud mode. - Clean up the scratch dir on successful upload; leave it on failure for debugging (ECS reclaims it anyway). Tests: - tests/test_cli_per_node_db.py: integration tests for per_node.db upload, local-mode non-regression, tempdir cleanup, mode-switching safety, and a contract test vs. DbtAdapter.get_model() using the jaffle_shop fixture (manifest + catalog round-trip → reconstructed get_model() payload matches live adapter output). - Update test_cli_cache.py::test_init_cloud_upload_partial_failure to use per_node_db_url instead of the removed cll_cache_url upload path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: even-wei <evenwei@infuseai.io>
Original PR 2 removed the cll_cache.db cloud upload path under the mistaken assumption that cache.db was unwanted in cloud mode. cache.db is load-bearing for cross-session warm-cache reuse (OSS local users and cloud sessions alike — downloaded at init, uploaded at completion, reused by build_full_cll_map). Restore the cll_cache_upload_url path and return cache_db to its pre-PR-2 default path. per_node.db remains tempdir-scoped (genuinely throwaway per-task artifact) and continues to upload via per_node_db_url. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: even-wei <evenwei@infuseai.io>
PR 3 scope for DRC-3295. Gated via RECCE_SCALE_TESTS=1 and marked `@pytest.mark.scale` so CI and default `make test` runs skip it. Validates size + extraction/query latency on the 1918-node jaffle-shop-expand project; skips gracefully if the on-disk fixture directory is absent. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: even-wei <evenwei@infuseai.io>
Code Review: PR #1334 (paired with DataRecce/recce-cloud-infra#1239)Scope: feat — precompute Reviewed alongside the cloud-infra side (PR #1239) for contract parity. Validation ResultsPass A: Correctness & Logic — PASS
Pass C: Cross-Reference Consistency — PASS
Pass D: Error Handling & Edge Cases — PASS (with one NOTE)
Pass E: Test Coverage & Quality — PASS (with one ISSUE)
ISSUE — hardcoded user-specific path in scale test ( _SCALE_FIXTURE_DIR = Path("/Users/evenwei/InfuseAI/workspace/jaffle-shop-expand-main")This path exists only on the author's workstation. The test is gated behind _SCALE_FIXTURE_DIR = Path(os.getenv("RECCE_SCALE_FIXTURE_DIR", Path(__file__).parent / "fixtures" / "jaffle-shop-expand"))Non-blocking; functional today, but the test is a permanent dead path for everyone else. Pass F: Diff-Specific Checks — PASS
Pass G: Performance — PASS
Additional Notes
Verification Results
Verdict: GONotes
Ready to merge after the cloud-infra companion (#1239) lands, since the CLI gracefully tolerates either merge order. |
There was a problem hiding this comment.
Pull request overview
Adds a new per_node.db SQLite artifact emitted by recce init --cloud, precomputing per-node lineage/model data from dbt artifacts so Recce Cloud can serve model/lineage APIs without proxying to an ephemeral Recce instance.
Changes:
- Introduces
recce/util/per_node_db.pyto extract rows from dbt manifest/catalog and write a denormalized SQLite schema. - Wires per-node DB emission + upload into
recce init --cloudwhile preserving existingcll_cache.dbupload behavior. - Adds unit + integration/contract tests (including an opt-in scale test marker).
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
recce/util/per_node_db.py |
New SQLite emitter + artifact-to-row extraction logic. |
recce/cli.py |
Emits per_node.db in cloud init and uploads it via per_node_db_url. |
tests/test_per_node_db.py |
Unit tests for schema creation and row extraction/writes. |
tests/test_cli_per_node_db.py |
Cloud/local CLI integration tests + contract test vs DbtAdapter.get_model() + opt-in scale tests. |
tests/test_cli_cache.py |
Updates a comment to reflect cll_cache.db upload failure scenario. |
pyproject.toml |
Adds scale pytest marker registration. |
Comments suppressed due to low confidence (1)
recce/cli.py:367
per_node_scratchtempdir is created before validating--cloud-token/--session-id, but the function callsexit(1)on missing/invalid inputs. That will leak therecce-per-node-*directory on those early-exit paths. Create the tempdir only after all required cloud args/session checks succeed, or ensure it is cleaned up via atry/finally(or Click exception handling) on all exit paths.
if is_cloud:
from recce.util.recce_cloud import RecceCloud, RecceCloudException
per_node_scratch = Path(tempfile.mkdtemp(prefix="recce-per-node-"))
cloud_token = kwargs.get("cloud_token") or kwargs.get("api_token")
if not cloud_token:
console.print("[[red]Error[/red]] --cloud requires --cloud-token or --api-token (or GITHUB_TOKEN env var).")
exit(1)
Code Review — PR #1334SummaryAdds a Verified locally: Findings[Critical]
|
gcko
left a comment
There was a problem hiding this comment.
Claude Code Review: Critical issues found — 3 Critical + 4 Warning. See #1334 (comment) for details.
Headline findings: (1) primary_key tiebreaker in per_node_db.py disagrees with DbtAdapter.get_model() on models with multiple unique tests; (2) scratch tempdir leaks on every cloud upload failure path; (3) scale test fixture is hardcoded to the author's workstation and never runs for anyone else.
Generated by Claude Code
Critical fixes: - primary_key divergence from DbtAdapter.get_model: derive primary_key by scanning columns in catalog order (which matches warehouse column order from `dbt docs generate`) rather than manifest child_map iteration order. Adds tests with multiple unique tests to pin the tiebreaker. - Scratch tempdir leaked on every cloud upload failure path. Move creation to right before the emit block, wrap emit+upload in try/finally, and unconditionally `shutil.rmtree` the scratch. Adds tests covering HTTP 500 on per_node.db PUT and get_upload_urls_by_session_id raising. - Scale-fixture path was hardcoded to a developer workstation, making the scale class dead code for everyone else. Read RECCE_SCALE_FIXTURE_DIR with a committed-fallback at tests/fixtures/jaffle-shop-expand. Warnings: - Dropped `PRAGMA journal_mode = WAL` on the writer. One-shot writers that are immediately uploaded as a single file must be self-contained at close time; WAL would require an explicit truncating checkpoint for that and offers no benefit without concurrent readers. Test now asserts rollback journal mode and that no -wal/-shm sidecars survive. - Removed `assert per_node_scratch is not None` (stripped under `python -O`) in favor of structurally guaranteeing the assignment inside `if is_cloud`. - Added a rollback test for PerNodeDbWriter.__exit__. - Added a CLI test for the emit-failure graceful-degradation path. Docstring: dropped hard-coded line numbers pointing at DbtAdapter.get_model. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: even-wei <evenwei@infuseai.io>
|
@gcko thanks for the careful pass — all three Criticals and the four Warnings addressed in 29bbc99. Critical findings1.
Two new tests in
2. Scratch tempdir leak on upload failure. Fully agree — the
Three new tests in
3. Dead scale fixture path. Replaced the hardcoded path with: _SCALE_FIXTURE_DIR = Path(
os.getenv("RECCE_SCALE_FIXTURE_DIR") or (Path(__file__).parent / "fixtures" / "jaffle-shop-expand")
)Contributors can point Warnings4. WAL sidecars not checkpointed before upload. Took your first option — dropped WAL entirely. One-shot writer has no concurrent-reader use case, and the default rollback journal keeps the main file self-contained at close with no extra ceremony. Test now asserts 5. 6. Rollback branch of 7. No test for CLI emit-failure graceful degradation. Covered by 8. Scratch cleanup on failure untested. Covered by the two new scratch-cleanup tests above. Additional notes you raised
VerificationCoverage on |
| envs_to_emit = [ | ||
| ("current", dbt_adapter.curr_manifest, dbt_adapter.curr_catalog), | ||
| ("base", dbt_adapter.base_manifest, dbt_adapter.base_catalog), | ||
| ] |
There was a problem hiding this comment.
In cloud mode, envs_to_emit includes ("base", dbt_adapter.base_manifest, ...) whenever base_manifest is non-None. Earlier in init you may intentionally point target_base_path at target_path when base artifacts are missing (so load_context doesn’t fail). In that case base_manifest can exist but actually represents the current env, causing per_node.db to incorrectly contain a fabricated env='base' copy of current data. Consider building envs_to_emit from the has_target / has_base flags (or the earlier envs list) so you only emit rows for environments that truly have artifacts.
| envs_to_emit = [ | |
| ("current", dbt_adapter.curr_manifest, dbt_adapter.curr_catalog), | |
| ("base", dbt_adapter.base_manifest, dbt_adapter.base_catalog), | |
| ] | |
| envs_to_emit = [] | |
| has_target_artifacts = locals().get("has_target") | |
| has_base_artifacts = locals().get("has_base") | |
| if has_target_artifacts is None: | |
| has_target_artifacts = dbt_adapter.curr_manifest is not None | |
| if has_base_artifacts is None: | |
| has_base_artifacts = dbt_adapter.base_manifest is not None | |
| if has_target_artifacts: | |
| envs_to_emit.append(("current", dbt_adapter.curr_manifest, dbt_adapter.curr_catalog)) | |
| if has_base_artifacts: | |
| envs_to_emit.append(("base", dbt_adapter.base_manifest, dbt_adapter.base_catalog)) |
There was a problem hiding this comment.
Addressed in 6766eca. The emit loop now builds envs_to_emit from has_target / has_base (matching the CLL cache loop above). This kills the phantom-base-env case where the earlier path-swap leaves both manifests populated but pointing at the same env.
Added a CLI test test_no_base_rows_when_only_target_has_artifacts that reproduces the path-swap outcome (curr_manifest and base_manifest set to the same object, base_download_urls empty) and asserts SELECT COUNT(*) FROM nodes WHERE env='base' is 0 in the uploaded per_node.db.
| if is_cloud: | ||
| per_node_scratch = Path(tempfile.mkdtemp(prefix="recce-per-node-")) | ||
| try: | ||
| upload_urls = cloud_client.get_upload_urls_by_session_id(cloud_org_id, cloud_project_id, session_id) | ||
|
|
||
| # Upload CLL map | ||
| cll_map_upload_url = upload_urls.get("cll_map_url") | ||
| if cll_map_upload_url and cll_map_path.is_file(): | ||
| try: | ||
| with open(cll_map_path, "rb") as f: | ||
| resp = requests.put( | ||
| cll_map_upload_url, | ||
| data=f, | ||
| headers={"Content-Type": "application/json"}, | ||
| timeout=_UPLOAD_TIMEOUT, | ||
| per_node_db_path: Optional[Path] = per_node_scratch / "per_node.db" | ||
| console.print("\n[bold]Emitting per-node SQLite...[/bold]") | ||
| t_pn_start = time.perf_counter() | ||
| try: |
There was a problem hiding this comment.
per_node.db is emitted before fetching upload URLs, so when running against an older Cloud server that doesn’t return per_node_db_url the CLI still spends time generating the SQLite file only to warn and discard it. If the goal is graceful backward compatibility, consider fetching upload_urls first and skipping emission entirely when per_node_db_url is absent (still uploading cll_map.json / cll_cache.db).
There was a problem hiding this comment.
Addressed in 6766eca. Upload URLs are now fetched before emission, and per_node.db is emitted only when per_node_db_url is present in the response. Against an older Cloud without the key, the CLI now prints "No per_node_db_url in upload URLs (Cloud server may need update) — skipping per_node.db emit" and doesn't do the emit work at all. cll_map.json and cll_cache.db still upload independently.
| The default rollback-journal mode is used rather than WAL: this file is | ||
| written once by a single process and immediately uploaded, so the main | ||
| file must be self-contained at close time. WAL would require an explicit | ||
| truncating checkpoint to guarantee that, with no benefit for a one-shot | ||
| writer that has no concurrent readers. |
There was a problem hiding this comment.
The PR description states per_node.db is written in WAL mode, but the implementation (and tests) deliberately rely on rollback-journal mode to avoid -wal/-shm sidecars. Please reconcile the PR description with the actual behavior so reviewers/users don’t get conflicting information.
There was a problem hiding this comment.
Fixed in 6766eca — updated the PR description to say rollback-journal mode, noting that the main file is self-contained at close (no -wal/-shm sidecars) so the uploader can't miss them.
| - ``manifest`` must be a raw dict (e.g. ``json.load(manifest.json)``). | ||
| - ``catalog`` may be ``None``; when absent, all ``data_type`` are ``None``. | ||
| - Column names come from the catalog when available; otherwise from the | ||
| manifest's ``columns`` section (tests-as-columns). |
There was a problem hiding this comment.
The extract_rows_from_artifacts docstring says columns come from the manifest's columns section "(tests-as-columns)", but this function is actually reading node.get("columns") (model/source column metadata), not tests. Please update the docstring to describe the real behavior to avoid confusing future maintainers.
| manifest's ``columns`` section (tests-as-columns). | |
| node's manifest ``columns`` metadata for model/source columns. | |
| - Tests are derived separately into ``node_tests`` rows via ``_derive_tests``. |
There was a problem hiding this comment.
Fixed in 6766eca. Docstring now reads:
- Column names come from the catalog when available; otherwise from each
node's ``columns`` metadata in the manifest (model/source column
definitions). Tests are NOT part of columns — they are derived
separately into ``node_tests`` rows via ``_derive_tests``.
No more "tests-as-columns" — they're clearly two separate row streams.
Updated Review — PR #1334SummaryHostile-pass review of the Status of prior review findings (from earlier comment): all 3 Criticals and 4 Warnings appear genuinely fixed in
Verified locally at head: But one new correctness bug has surfaced and is unaddressed, plus two documentation drifts that Copilot also flagged. Findings[Critical] Phantom
|
gcko
left a comment
There was a problem hiding this comment.
Claude Code Review (hostile pass): Critical issue found. See #1334 (comment) for details.
Prior 3 Criticals + 4 Warnings from the earlier review were genuinely fixed in 29bbc993 (verified locally — 62 passed, 5 skipped; flake8 clean). But one new correctness bug surfaced:
- Phantom
env='base'rows when only one env has artifacts (recce/cli.py:700-717). Whenhas_base == False,context_kwargsswapstarget_base_pathto point attarget_pathsoload_contextdoes not fail, leavingbase_manifestnon-Nonebut identical tocurr_manifest. The emit loop only guards onif manifest is None: continue, so every row is emitted twice with a fakeenv='base'label. Copilot independently flagged this (r3135356810, still unresolved). All 7 CLI tests setbase_manifest = Noneon the mock adapter, short-circuiting the path production hits.
Two doc drifts also carried over from the Copilot pass and remain open: PR description still says "WAL mode" after 29bbc99 dropped it, and the extract_rows_from_artifacts docstring still has a misleading (tests-as-columns) parenthetical.
Generated by Claude Code
Critical: - Phantom env='base' rows when only one env has artifacts. The emit loop keyed off `manifest is None`, but in the single-env path-swap case both curr_manifest and base_manifest are loaded pointing at the SAME env — so the missing-env guard never fired and we shipped duplicate rows under a fabricated env label. The emit loop now uses `has_target` / `has_base`, matching the CLL cache loop's truth source. Added a CLI test that reproduces the path-swap scenario and asserts no env='base' rows ship. Addressed Copilot suggestion: - Skip per_node.db emission entirely when Cloud doesn't advertise `per_node_db_url`. Fetch upload URLs first; only emit the SQLite file when there's somewhere to upload it. Saves 1-2s of pure-waste emit work against older Cloud servers during the migration window. Docs: - `extract_rows_from_artifacts` docstring no longer says "tests-as-columns" — columns come from each node's manifest `columns` metadata; tests are derived separately into `node_tests` via `_derive_tests`. - Updated PR description: schema is rollback-journal (not WAL) mode. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: even-wei <evenwei@infuseai.io>
Critical: phantom
|
Updated Review (round 3) — PR #1334SummaryHostile re-pass at head Verified locally: Round 2 findings — all addressed
Refactor-review: new upload blockThe fix pulled the emit step inside the upload block and made it conditional on What still works:
[Nit] Non-
|
gcko
left a comment
There was a problem hiding this comment.
Claude Code Review (hostile re-pass, round 3): All prior Criticals and Warnings are legitimately addressed in 6766eca. See #1334 (comment) for details.
- Phantom
env='base'bug fixed —envs_to_emitnow built fromhas_target/has_base(notmanifest is None), andtest_no_base_rows_when_only_target_has_artifactsreproduces the production path-swap scenario withcurr_manifestandbase_manifestset to the same object. - PR description corrected (rollback-journal, not WAL).
extract_rows_from_artifactsdocstring corrected.- Refactor pulled emit inside the upload block and made it conditional on
per_node_db_url— saves wasted emit against older Cloud servers.
Two non-blocking nits noted (non-RequestException during upload no longer degrades to warnings; redundant if cloud_client: superstition check) — neither blocks merge.
Verified: 63 passed, 5 skipped; flake8 clean.
Generated by Claude Code
PR checklist
What type of PR is this?
feat— new precompute artifact emitted byrecce init --cloud.What this PR does / why we need it:
Adds a new
per_node.dbSQLite artifact to therecce init --cloudoutput. Given dbt manifest + catalog (both envs), emits a per-session SQLite file keyed by(node_id, env)with denormalized columns/edges/tests tables. Upcoming Recce Cloud work (Linear B1 / DRC-3297) will serve/models/{id}from this file instead of proxying to an ephemeral instance — eliminating a recurring class of Snowflake002003 / 42S02500s caused by dropped PR base schemas (see DRC-3294 for the full incident + design).Local (non-cloud)
recce initbehavior is unchanged.Please review commit-by-commit (Commits tab):
0bf28753recce/util/per_node_db.py) + 8 unit tests. No CLI wiring.cb8d74f1recce init --cloud. Initial attempt also suppressedcll_cache.dbin cloud mode — turned out to be wrong.eff86d6ecll_cache.dbmust still be uploaded in cloud mode since it serves cross-session warm-cache reuse.per_node.dbis the only new thing produced, scoped to its own throwaway tempdir.Final behavior
recce init --cloudnow uploads:cll_map.json+cll_cache.db(unchanged) +per_node.db(new)per_node.db:tempfile.mkdtemp(prefix="recce-per-node-")— genuinely throwaway per-task artifactper_node_db_urlpresigned URL keyshutil.rmtree'd unconditionally in atry/finally(covers success, upload failure, andget_upload_urlsraising)cll_cache.db: unchanged — stays at~/.recce/cll_cache.db(or--cache-db), still uploaded viacll_cache_upload_url. Cross-session warm-cache behavior preserved for both OSS local users and Cloud sessions.Schema (
per_node.db, 5 tables, rollback-journal mode — no-wal/-shmsidecars, so the main file is self-contained at close and upload):meta(k/v version tracking)nodesPK(node_id, env)— name, resource_type, package_name, raw_code, compiled_code, primary_key, node_json BLOB (JSON-encoded manifest dict)columnsPK(node_id, env, column_name)— data_type (from catalog when present, else NULL)edgesPK(parent_id, child_id, env)— withidx_edges_child_envfor reverse lookupsnode_testsPK(node_id, env, column_name, test_name)— withtest_typeparsed fromnot_null_/unique_prefixesWhich issue(s) this PR fixes:
DRC-3295 (A1 of DRC-3294, project "Lineage API Data Transfer Optimization" Phase III).
Special notes for your reviewer:
node_jsonuses JSON-encoded bytes, not msgpack —msgpackisn't inpyproject.tomland I didn't want to add a dep. Isolated behind a private helper for future swap.api_server.recce_cloudneeds to addper_node_db_urltoget_upload_urls_by_session_id's response (same pattern ascll_map_url, content typeapplication/octet-stream). Tracked on DRC-3295 as prerequisite for B1. CLI handles missing key gracefully so OSS can ship independently.tests/test_per_node_db.pyandtests/test_cli_per_node_db.pyrespectively — including contract parity vs.DbtAdapter.get_model()on the jaffle_shop fixture. Full suite: 1106 passed, 5 unchanged pre-existing SPA failures (test_spa_route_*— requires a built frontend, unrelated).Does this PR introduce a user-facing change?:
Cloud users (implicit): running
recce init --cloudnow uploads an additionalper_node.dbartifact.cll_cache.dbupload behavior unchanged. Nothing breaks — Cloud side consumes the new file once the URL key is wired.Local users: no change.
NONE user-visible, but the cloud artifact set is now a strict superset.