Skip to content

feat: precompute per_node.db in recce init --cloud [DRC-3295]#1334

Merged
even-wei merged 8 commits into
mainfrom
feature/drc-3295-pr1-per-node-db-emitter
Apr 24, 2026
Merged

feat: precompute per_node.db in recce init --cloud [DRC-3295]#1334
even-wei merged 8 commits into
mainfrom
feature/drc-3295-pr1-per-node-db-emitter

Conversation

@even-wei
Copy link
Copy Markdown
Contributor

@even-wei even-wei commented Apr 23, 2026

PR checklist

  • Ensure you have added or ran the appropriate tests for your PR.
  • DCO signed

What type of PR is this?

feat — new precompute artifact emitted by recce init --cloud.

What this PR does / why we need it:

Adds a new per_node.db SQLite artifact to the recce init --cloud output. Given dbt manifest + catalog (both envs), emits a per-session SQLite file keyed by (node_id, env) with denormalized columns/edges/tests tables. Upcoming Recce Cloud work (Linear B1 / DRC-3297) will serve /models/{id} from this file instead of proxying to an ephemeral instance — eliminating a recurring class of Snowflake 002003 / 42S02 500s caused by dropped PR base schemas (see DRC-3294 for the full incident + design).

Local (non-cloud) recce init behavior is unchanged.

Please review commit-by-commit (Commits tab):

# Commit What it does
1 0bf28753 Pure SQLite emitter module (recce/util/per_node_db.py) + 8 unit tests. No CLI wiring.
2 cb8d74f1 Wires the emitter into recce init --cloud. Initial attempt also suppressed cll_cache.db in cloud mode — turned out to be wrong.
3 eff86d6e Fixes commit 2: cll_cache.db must still be uploaded in cloud mode since it serves cross-session warm-cache reuse. per_node.db is the only new thing produced, scoped to its own throwaway tempdir.

Final behavior

recce init --cloud now uploads: cll_map.json + cll_cache.db (unchanged) + per_node.db (new)

per_node.db:

  • Written to tempfile.mkdtemp(prefix="recce-per-node-") — genuinely throwaway per-task artifact
  • Uploaded via a new per_node_db_url presigned URL key
  • Gracefully degrades if Cloud hasn't added the key yet (logs warning, exits 0 — emission is skipped entirely against older Cloud, not just discarded)
  • Tempdir shutil.rmtree'd unconditionally in a try/finally (covers success, upload failure, and get_upload_urls raising)

cll_cache.db: unchanged — stays at ~/.recce/cll_cache.db (or --cache-db), still uploaded via cll_cache_upload_url. Cross-session warm-cache behavior preserved for both OSS local users and Cloud sessions.

Schema (per_node.db, 5 tables, rollback-journal mode — no -wal/-shm sidecars, so the main file is self-contained at close and upload):

  • meta (k/v version tracking)
  • nodes PK (node_id, env) — name, resource_type, package_name, raw_code, compiled_code, primary_key, node_json BLOB (JSON-encoded manifest dict)
  • columns PK (node_id, env, column_name) — data_type (from catalog when present, else NULL)
  • edges PK (parent_id, child_id, env) — with idx_edges_child_env for reverse lookups
  • node_tests PK (node_id, env, column_name, test_name) — with test_type parsed from not_null_ / unique_ prefixes

Which issue(s) this PR fixes:

DRC-3295 (A1 of DRC-3294, project "Lineage API Data Transfer Optimization" Phase III).

Special notes for your reviewer:

  • Commits kept separate on purpose — commit 3 tells reviewers what the cache.db constraint actually is. Happy to squash before merge if you prefer linear history.
  • node_json uses JSON-encoded bytes, not msgpack — msgpack isn't in pyproject.toml and I didn't want to add a dep. Isolated behind a private helper for future swap.
  • Cloud-side follow-up: api_server.recce_cloud needs to add per_node_db_url to get_upload_urls_by_session_id's response (same pattern as cll_map_url, content type application/octet-stream). Tracked on DRC-3295 as prerequisite for B1. CLI handles missing key gracefully so OSS can ship independently.
  • Tests: 8 + 7 new in tests/test_per_node_db.py and tests/test_cli_per_node_db.py respectively — including contract parity vs. DbtAdapter.get_model() on the jaffle_shop fixture. Full suite: 1106 passed, 5 unchanged pre-existing SPA failures (test_spa_route_* — requires a built frontend, unrelated).

Does this PR introduce a user-facing change?:

Cloud users (implicit): running recce init --cloud now uploads an additional per_node.db artifact. cll_cache.db upload behavior unchanged. Nothing breaks — Cloud side consumes the new file once the URL key is wired.

Local users: no change.

NONE user-visible, but the cloud artifact set is now a strict superset.

Introduces recce/util/per_node_db.py: a dependency-light SQLite emitter
module for DRC-3295 Phase III (lineage API data transfer optimization).
The emitter reads a dbt manifest (and optional catalog) and produces a
per_node.db SQLite file with meta/nodes/columns/edges/node_tests tables,
WAL journaling, and user_version = SCHEMA_VERSION.

This is PR 1 of 3 (stacked):
  - PR 1 (this): pure emitter module + unit tests, no CLI wiring.
  - PR 2: wire the emitter into `recce init --cloud`.
  - PR 3: scale tests against real jaffle-shop-expand artifacts.

Primary-key and test derivation mirror DbtAdapter.get_model()
(recce/adapter/dbt_adapter/__init__.py:447-499) so PR 2 can wire the
emitter into the existing upload path with no behavioural drift.

node_json uses JSON bytes rather than msgpack because msgpack is not
currently a runtime dep of recce; the encoder is isolated behind a
private helper so we can swap it later without touching the schema.

Tested in isolation with 8 unit tests (schema/WAL, round-trips for each
table, primary_key derivation, null catalog, empty manifest). No CLI
or cloud code is touched.

Refs: DRC-3295

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: even-wei <evenwei@infuseai.io>
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 23, 2026

Codecov Report

❌ Patch coverage is 90.17857% with 88 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
tests/test_cli_per_node_db.py 85.54% 75 Missing ⚠️
recce/cli.py 89.88% 9 Missing ⚠️
recce/util/per_node_db.py 97.27% 4 Missing ⚠️
Files with missing lines Coverage Δ
tests/test_cli_cache.py 99.34% <ø> (ø)
tests/test_per_node_db.py 100.00% <100.00%> (ø)
recce/util/per_node_db.py 97.27% <97.27%> (ø)
recce/cli.py 66.13% <89.88%> (+0.18%) ⬆️
tests/test_cli_per_node_db.py 85.54% <85.54%> (+1.30%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

PR 2/3 of DRC-3295. Wires PR 1's per-node SQLite emitter into the
`recce init --cloud` command so Cloud can stream lineage rows without
proxying to an ephemeral Recce instance.

Key changes:
- In cloud mode, write `cll_cache.db` to a `tempfile.mkdtemp(prefix=
  "recce-cll-")` scratch dir so build_full_cll_map keeps its warm-cache
  perf, but the cache.db never leaves the container (ECS task GC on
  exit). A user-provided --cache-db is intentionally ignored in cloud
  mode.
- After build_full_cll_map(), emit per_node.db (manifest + catalog rows,
  not_null/unique tests, primary_key, edges) to the same scratch dir
  and upload it via the new `per_node_db_url` upload key.
- Graceful degradation: if Cloud has not yet added per_node_db_url, log
  a warning and continue — old CLI + new Cloud and vice versa stay
  compatible.
- Remove the cll_cache.db cloud upload block; cache.db is now local-only
  in cloud mode.
- Clean up the scratch dir on successful upload; leave it on failure for
  debugging (ECS reclaims it anyway).

Tests:
- tests/test_cli_per_node_db.py: integration tests for per_node.db
  upload, local-mode non-regression, tempdir cleanup, mode-switching
  safety, and a contract test vs. DbtAdapter.get_model() using the
  jaffle_shop fixture (manifest + catalog round-trip → reconstructed
  get_model() payload matches live adapter output).
- Update test_cli_cache.py::test_init_cloud_upload_partial_failure to
  use per_node_db_url instead of the removed cll_cache_url upload path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: even-wei <evenwei@infuseai.io>
Original PR 2 removed the cll_cache.db cloud upload path
under the mistaken assumption that cache.db was unwanted
in cloud mode. cache.db is load-bearing for cross-session
warm-cache reuse (OSS local users and cloud sessions alike
— downloaded at init, uploaded at completion, reused by
build_full_cll_map). Restore the cll_cache_upload_url path
and return cache_db to its pre-PR-2 default path.

per_node.db remains tempdir-scoped (genuinely throwaway
per-task artifact) and continues to upload via
per_node_db_url.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: even-wei <evenwei@infuseai.io>
@even-wei even-wei changed the title feat(cll): add per-node SQLite emitter for cloud precompute [PR 1/3] feat: precompute per_node.db in recce init --cloud [DRC-3295] Apr 23, 2026
PR 3 scope for DRC-3295. Gated via RECCE_SCALE_TESTS=1 and marked
`@pytest.mark.scale` so CI and default `make test` runs skip it.
Validates size + extraction/query latency on the 1918-node
jaffle-shop-expand project; skips gracefully if the on-disk fixture
directory is absent.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: even-wei <evenwei@infuseai.io>
@even-wei
Copy link
Copy Markdown
Contributor Author

Code Review: PR #1334 (paired with DataRecce/recce-cloud-infra#1239)

Scope: feat — precompute per_node.db in recce init --cloud.
Files reviewed: 6 (pyproject.toml, recce/cli.py, recce/util/per_node_db.py, tests/test_cli_cache.py, tests/test_cli_per_node_db.py, tests/test_per_node_db.py).
Categories: business logic, CLI integration, tests, config. Passes run: A, C, D, E, F, G.

Reviewed alongside the cloud-infra side (PR #1239) for contract parity.

Validation Results

Pass A: Correctness & Logic — PASS

  • Emitter transactions: __enter__ commits schema DDL; __exit__ commits on clean exit and rolls back on exception. _txn() is a connection accessor (not a real SQLite transaction — SQLite autocommits outside BEGIN).
  • _iter_manifest_nodes iterates nodes section which includes tests/seeds/snapshots — intentional, consumers filter by resource_type.
  • _derive_tests_and_primary_key correctly parses not_null_<name>_<col> / unique_<name>_<col> children — mirrors DbtAdapter.get_model as documented at recce/util/per_node_db.py:258-262.
  • Column merge in extract_rows_from_artifacts is deterministic: catalog first (authoritative dtype), manifest after for unseen names.
  • PRAGMA user_version = {SCHEMA_VERSION} uses an int module constant — no injection risk.
  • All INSERT OR REPLACE statements use ? placeholders.

Pass C: Cross-Reference Consistency — PASS

  • API field per_node_db_url (api_server/apis/sessions_api.py:135) matches CLI .get("per_node_db_url") (recce/cli.py in commit cb8d74f1).
  • S3 key per_node.db matches CLI tempfile name per_node_scratch / "per_node.db".
  • _UPLOAD_TIMEOUT and requests.put follow the same pattern used for cll_cache.db / cll_map.json uploads — no drift.
  • Deprecated snapshot routes forward via await get_session_download_url(...), inheriting the new field automatically — no orphan compat shim.

Pass D: Error Handling & Edge Cases — PASS (with one NOTE)

  • Emit failure falls through cleanly: per_node_db_path = None → upload loop skips it.
  • Upload failure is caught per-artifact, appended to upload_failures, non-fatal.
  • NOTE: if get_upload_urls_by_session_id raises, the outer except Exception at the cloud-upload block sets upload_succeeded = False and the per_node_scratch tempdir is not removed. That's a tempdir leak in tempfile.gettempdir() on each failed upload. Trivial to fix by moving cleanup out of the upload_succeeded guard (or letting OS cleanup handle it — /tmp is bounded enough to not block merge).

Pass E: Test Coverage & Quality — PASS (with one ISSUE)

  • Emitter: 8 unit tests covering schema, write ops, rollback, WAL, meta, each row type. All pass (tests/test_per_node_db.py).
  • CLI integration: 7 tests covering cloud upload, missing per_node_db_url (graceful warning), scratch cleanup, local-mode no-op, mode switching. All pass.
  • Contract test against live DbtAdapter.get_model() structurally verifies parity for columns / primary_key / raw_code / test flags — excellent depth.
  • tests/test_cli_cache.py:1258 comment-only update clarifying which upload is asserted.

ISSUE — hardcoded user-specific path in scale test (tests/test_cli_per_node_db.py:685):

_SCALE_FIXTURE_DIR = Path("/Users/evenwei/InfuseAI/workspace/jaffle-shop-expand-main")

This path exists only on the author's workstation. The test is gated behind RECCE_SCALE_TESTS=1 and skips when the directory is missing, so it will never execute for other contributors or CI — which makes the scale test effectively dead code upstream. Suggest replacing with an env-var override with sensible fallback, e.g.

_SCALE_FIXTURE_DIR = Path(os.getenv("RECCE_SCALE_FIXTURE_DIR", Path(__file__).parent / "fixtures" / "jaffle-shop-expand"))

Non-blocking; functional today, but the test is a permanent dead path for everyone else.

Pass F: Diff-Specific Checks — PASS

  • tests/test_cli_cache.py kept in sync (comment nuance only).
  • pyproject.toml adds the scale pytest marker — consistent with @pytest.mark.scale gating in the scale class.
  • PR splits well across 4 commits; commit 3 (eff86d6e) correctly restores cll_cache.db upload after commit 2's incorrect suppression — the "why" the cache must keep uploading is captured both in the commit message and inline comment at recce/cli.py.

Pass G: Performance — PASS

  • executemany for bulk inserts; single connection kept open across writes.
  • WAL mode enables concurrent readers (future Cloud worker reads).
  • Scale budgets (test_db_file_size_under_budget, test_point_read_p95_under_5ms, test_range_query_under_50ms) are concrete and reproduce-able when fixture is available.

Additional Notes

  • NOTE: Column case-sensitivity — if catalog has CUSTOMER_ID (Snowflake/BigQuery convention) and manifest has customer_id, both will be inserted as separate rows under PK (node_id, env, column_name). Mirrors existing DbtAdapter behavior so it isn't a regression, but future Cloud consumers of per_node.db should case-fold on lookup (or a later PR can normalize at write time).
  • NOTE: _txn() name suggests a DB transaction; it's really a guarded connection accessor. Cosmetic.

Verification Results

  • black --check — clean (all 4 files).
  • flake8 — clean.
  • pytest tests/test_per_node_db.py tests/test_cli_per_node_db.py tests/test_cli_cache.py55 passed, 5 skipped (scale + contract-w/-fixture skips). No unexpected failures.

Verdict: GO

Notes

  1. Tempdir leak on upload-URL fetch failure — when get_upload_urls_by_session_id raises, recce-per-node-* tempdir survives. Non-blocking.
  2. Hardcoded scale-fixture pathtests/test_cli_per_node_db.py:685 uses /Users/evenwei/...; dead for all other contributors. Suggest env-var override.
  3. Column case-folding — consumers of per_node.db should be aware of case-sensitive column rows (Snowflake/BigQuery).
  4. _txn() naming — cosmetic; not a real transaction guard.

Ready to merge after the cloud-infra companion (#1239) lands, since the CLI gracefully tolerates either merge order.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new per_node.db SQLite artifact emitted by recce init --cloud, precomputing per-node lineage/model data from dbt artifacts so Recce Cloud can serve model/lineage APIs without proxying to an ephemeral Recce instance.

Changes:

  • Introduces recce/util/per_node_db.py to extract rows from dbt manifest/catalog and write a denormalized SQLite schema.
  • Wires per-node DB emission + upload into recce init --cloud while preserving existing cll_cache.db upload behavior.
  • Adds unit + integration/contract tests (including an opt-in scale test marker).

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
recce/util/per_node_db.py New SQLite emitter + artifact-to-row extraction logic.
recce/cli.py Emits per_node.db in cloud init and uploads it via per_node_db_url.
tests/test_per_node_db.py Unit tests for schema creation and row extraction/writes.
tests/test_cli_per_node_db.py Cloud/local CLI integration tests + contract test vs DbtAdapter.get_model() + opt-in scale tests.
tests/test_cli_cache.py Updates a comment to reflect cll_cache.db upload failure scenario.
pyproject.toml Adds scale pytest marker registration.
Comments suppressed due to low confidence (1)

recce/cli.py:367

  • per_node_scratch tempdir is created before validating --cloud-token / --session-id, but the function calls exit(1) on missing/invalid inputs. That will leak the recce-per-node-* directory on those early-exit paths. Create the tempdir only after all required cloud args/session checks succeed, or ensure it is cleaned up via a try/finally (or Click exception handling) on all exit paths.
    if is_cloud:
        from recce.util.recce_cloud import RecceCloud, RecceCloudException

        per_node_scratch = Path(tempfile.mkdtemp(prefix="recce-per-node-"))

        cloud_token = kwargs.get("cloud_token") or kwargs.get("api_token")
        if not cloud_token:
            console.print("[[red]Error[/red]] --cloud requires --cloud-token or --api-token (or GITHUB_TOKEN env var).")
            exit(1)

Comment thread tests/test_cli_per_node_db.py Outdated
Comment thread recce/util/per_node_db.py Outdated
Copy link
Copy Markdown
Contributor

gcko commented Apr 24, 2026

Code Review — PR #1334

Summary

Adds a per_node.db SQLite artifact emitted by recce init --cloud. The emitter module is well-scoped and the happy-path tests pass, but several issues are untested and a few are actively incorrect on realistic inputs. Issues found — not ready as-is.

Verified locally: pytest tests/test_per_node_db.py tests/test_cli_per_node_db.py tests/test_cli_cache.py → 55 passed, 5 skipped. flake8 clean on all modified files.


Findings

[Critical] primary_key derivation silently disagrees with DbtAdapter.get_model() when a model has multiple unique tests

File: recce/util/per_node_db.py:290-302
Issue: The module claims to "mirror DbtAdapter.get_model" (docstring at recce/util/per_node_db.py:258-262), but the tiebreaker differs:

  • Adapter (recce/adapter/dbt_adapter/__init__.py:477-485) picks the first column in warehouse column order (self.get_columns()) that has a unique_ test.
  • Emitter picks the first child in manifest child_map iteration order that has a unique_ test prefix.

These orderings are unrelated. On any model with ≥2 unique tests (composite candidate keys, surrogate-id migrations, etc.), the two code paths will pick different primary_key values. The contract test (tests/test_cli_per_node_db.py:472-540) does not catch this because jaffle_shop has at most one unique test per model.

Suggestion: Either (a) defer primary_key to read time by computing it from column_tests against a real column order source, or (b) add a test with a model having two unique tests, pick the expected value per the adapter's rule (first in warehouse column order), and update the emitter to match — which likely means storing all unique columns and resolving primary_key at the consumer side.


[Critical] Scratch tempdir leaks on every cloud upload failure path

File: recce/cli.py:833-838
Issue: Cleanup is gated on upload_succeeded:

finally:
    if upload_succeeded and per_node_scratch is not None:
        shutil.rmtree(per_node_scratch, ignore_errors=True)

upload_succeeded is only set to True when upload_failures is empty and no exception was caught. That means:

  • get_upload_urls_by_session_id raising → leak
  • Any single requests.put 500 / timeout → leak (upload_failures non-empty)
  • Missing per_node_db_url (graceful-degradation path) actually doesn't leak, but only by accident — if any other upload fails it does
  • Any uncaught exception inside the try → leak

On a Cloud deploy that exercises this code path repeatedly (scheduled cron, retried failures), /tmp fills with multi-MB recce-per-node-* directories that nothing reaps. Self-review dismisses this as "trivial" and "OS will clean /tmp" — it won't on long-lived containers or servers where /tmp is persistent disk.

Suggestion: Always clean up the scratch (it's ours, it's throwaway by definition):

finally:
    if per_node_scratch is not None:
        shutil.rmtree(per_node_scratch, ignore_errors=True)

[Critical] Scale test is permanently dead for everyone except the author

File: tests/test_cli_per_node_db.py:685
Issue: _SCALE_FIXTURE_DIR = Path("/Users/evenwei/InfuseAI/workspace/jaffle-shop-expand-main"). Even with RECCE_SCALE_TESTS=1, the class fixture skips when the directory is missing (tests/test_cli_per_node_db.py:698-699), so the five scale tests never execute on CI, other contributors, or any machine that is not the author's workstation. The P95/size/row-count budgets that are the stated goal of PR 3 ship as dead code upstream.

Suggestion: Env-var override with a committed-fixture fallback, e.g.:

_SCALE_FIXTURE_DIR = Path(
    os.getenv("RECCE_SCALE_FIXTURE_DIR")
    or Path(__file__).parent / "fixtures" / "jaffle-shop-expand"
)

Optionally commit a slim synthetic manifest so the scale class actually exercises something in CI with RECCE_SCALE_TESTS=1.


[Warning] WAL sidecars are not checkpointed before upload

File: recce/util/per_node_db.py:156-172
Issue: __enter__ sets PRAGMA journal_mode = WAL, which persists on the file. __exit__ calls self._conn.commit() then self._conn.close(). On the last connection close SQLite normally runs a passive checkpoint, but it is not guaranteed to be a truncating checkpoint — residual data can remain in per_node.db-wal on some platforms / SQLite builds. The uploader only PUTs per_node.db; the -wal and -shm sidecars are discarded. On the Cloud read side this will be fine in the common case and silently wrong on the edge case where the main file is missing pages that were only reachable via WAL.

For a one-shot writer that is immediately uploaded, there is no reason to use WAL — it only helps concurrent readers on the same file. Either:

  1. Drop WAL for this writer (journal_mode = DELETE or leave default), or
  2. Explicitly PRAGMA wal_checkpoint(TRUNCATE) before close().

tests/test_per_node_db.py:47-66 asserts WAL mode is set but does not verify the main file is self-contained after close — so a silent breakage here would ship.


[Warning] assert per_node_scratch is not None is stripped under python -O

File: recce/cli.py:691
Issue: Using assert for a runtime invariant check is a footgun: python -O (and PYTHONOPTIMIZE=1) strips assertions, and in that mode per_node_db_path = per_node_scratch / "per_node.db" on the next line becomes None / "per_node.db"TypeError. Recce isn't commonly run under -O, but packaging or third-party wheels may enable it.

Suggestion: if per_node_scratch is None: raise RuntimeError(...) or restructure so per_node_scratch is assigned unconditionally inside the is_cloud branch (which it already is — the assert is redundant and fragile).


[Warning] Rollback branch of PerNodeDbWriter.__exit__ is untested

File: recce/util/per_node_db.py:163-172
Issue: __exit__ has two branches (commit on clean, rollback on exception), but tests/test_per_node_db.py has no test that raises inside the with block and asserts no rows persisted. A regression in the rollback path — or in the self._conn.close() call after rollback — would ship silently.

Suggestion: Add a test like:

def test_rollback_on_exception(db_path):
    with pytest.raises(RuntimeError):
        with PerNodeDbWriter(db_path) as w:
            w.write_nodes([_node_row("model.pkg.a")])
            raise RuntimeError("boom")
    conn = _open(db_path)
    try:
        assert conn.execute("SELECT COUNT(*) FROM nodes").fetchone()[0] == 0
    finally:
        conn.close()

[Warning] No test exercises the CLI emit-failure graceful-degradation path

File: recce/cli.py:727-730, tests/test_cli_per_node_db.py
Issue: The try/except around the emit (warns, sets per_node_db_path = None, continues) is untested. None of the 7 CLI tests simulates PerNodeDbWriter raising (disk full, corrupt manifest). If a regression reversed this to propagate the exception, cloud init would start failing outright and no test catches it.

Suggestion: One test patching PerNodeDbWriter to raise on __enter__/a write, asserting exit_code == 0, the warning text appears, and cll_cache.db/cll_map.json still upload.


[Warning] Scratch-cleanup-on-failure is untested; the leak bug above lives here

File: tests/test_cli_per_node_db.py:260-336
Issue: The one "cleanup" test (test_per_node_scratch_cleaned_cache_preserved) only covers the happy path where all uploads succeed. There is no test for cleanup when upload_failures is non-empty or when get_upload_urls_by_session_id raises — which is exactly the failure-mode pair where the tempdir currently leaks. Adding a test here would have caught the Critical finding on cleanup above before the PR opened.

Suggestion: Add a test that makes requests.put return HTTP 500 for the per_node.db URL and asserts the scratch dir is cleaned anyway.


Additional Notes

  • compiled_code=None for source nodes is inserted into nodes row (line recce/util/per_node_db.py:335-337). Consumers need to know that sources/exposures/etc. don't round-trip raw_code meaningfully. Worth documenting at the SQL consumer side.
  • manifest.to_dict() (recce/cli.py:711) is called once per env — fine in this PR, but note that on very large manifests this is a significant allocation. The emitter could accept a pre-materialised dict to let callers share one.

Verdict

Issues found — request changes. The two Critical findings (primary_key divergence and tempdir leak) are real bugs that the test suite doesn't catch. The dead scale-fixture path is a separate Critical — it ships tests that can never run for anyone downstream. The Warnings are untested code paths that should have tests before this lands.


Generated by Claude Code

Copy link
Copy Markdown
Contributor

@gcko gcko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review: Critical issues found — 3 Critical + 4 Warning. See #1334 (comment) for details.

Headline findings: (1) primary_key tiebreaker in per_node_db.py disagrees with DbtAdapter.get_model() on models with multiple unique tests; (2) scratch tempdir leaks on every cloud upload failure path; (3) scale test fixture is hardcoded to the author's workstation and never runs for anyone else.


Generated by Claude Code

Critical fixes:
- primary_key divergence from DbtAdapter.get_model: derive primary_key by
  scanning columns in catalog order (which matches warehouse column order
  from `dbt docs generate`) rather than manifest child_map iteration order.
  Adds tests with multiple unique tests to pin the tiebreaker.
- Scratch tempdir leaked on every cloud upload failure path. Move creation
  to right before the emit block, wrap emit+upload in try/finally, and
  unconditionally `shutil.rmtree` the scratch. Adds tests covering HTTP 500
  on per_node.db PUT and get_upload_urls_by_session_id raising.
- Scale-fixture path was hardcoded to a developer workstation, making the
  scale class dead code for everyone else. Read RECCE_SCALE_FIXTURE_DIR
  with a committed-fallback at tests/fixtures/jaffle-shop-expand.

Warnings:
- Dropped `PRAGMA journal_mode = WAL` on the writer. One-shot writers that
  are immediately uploaded as a single file must be self-contained at close
  time; WAL would require an explicit truncating checkpoint for that and
  offers no benefit without concurrent readers. Test now asserts rollback
  journal mode and that no -wal/-shm sidecars survive.
- Removed `assert per_node_scratch is not None` (stripped under `python -O`)
  in favor of structurally guaranteeing the assignment inside `if is_cloud`.
- Added a rollback test for PerNodeDbWriter.__exit__.
- Added a CLI test for the emit-failure graceful-degradation path.

Docstring: dropped hard-coded line numbers pointing at DbtAdapter.get_model.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: even-wei <evenwei@infuseai.io>
@even-wei
Copy link
Copy Markdown
Contributor Author

@gcko thanks for the careful pass — all three Criticals and the four Warnings addressed in 29bbc99.

Critical findings

1. primary_key divergence from DbtAdapter.get_model(). You were right that the two code paths disagreed on models with ≥2 unique tests. Refactor:

  • Split _derive_tests_and_primary_key into _derive_tests + _pick_primary_key.
  • extract_rows_from_artifacts now builds an ordered column list (catalog first, then manifest) and picks primary_key by scanning that list — the same order the adapter walks via get_columns(), since catalog is generated from warehouse queries by dbt docs generate.
  • All unique test rows are still persisted in node_tests, so consumers that prefer their own tiebreaker can re-derive from warehouse column order directly.

Two new tests in tests/test_per_node_db.py:

  • test_primary_key_follows_catalog_column_order_with_multiple_unique_tests — model with unique tests on a and b, child_map lists b first, catalog lists a first. Asserts primary_key == "a". This is the case the old code got wrong.
  • test_primary_key_falls_back_to_manifest_column_order_without_catalog — same shape but no catalog; asserts manifest column order wins over child_map order.

2. Scratch tempdir leak on upload failure. Fully agree — the upload_succeeded guard was defensive-theatre and actively harmful under retry pressure. Fix:

  • per_node_scratch is now created right before the emit block (after all auth validation), so early exit(1) paths never create it at all.
  • Emit + upload are wrapped in one try/finally that unconditionally shutil.rmtree's the scratch.

Three new tests in tests/test_cli_per_node_db.py::TestCloudModeEmitsPerNodeDb:

  • test_scratch_cleaned_even_when_per_node_upload_fails — HTTP 500 on per_node.db PUT, assert scratch is gone.
  • test_scratch_cleaned_when_upload_url_fetch_raisesget_upload_urls_by_session_id throws, assert scratch is gone.
  • test_emit_failure_degrades_gracefullyPerNodeDbWriter raises at construction; CLI exits 0, cll_map.json + cll_cache.db still upload, no PUT to per_node.db URL.

3. Dead scale fixture path. Replaced the hardcoded path with:

_SCALE_FIXTURE_DIR = Path(
    os.getenv("RECCE_SCALE_FIXTURE_DIR") or (Path(__file__).parent / "fixtures" / "jaffle-shop-expand")
)

Contributors can point RECCE_SCALE_FIXTURE_DIR at any dbt target directory, or drop artifacts into tests/fixtures/jaffle-shop-expand/. Still skips gracefully when neither is populated.

Warnings

4. WAL sidecars not checkpointed before upload. Took your first option — dropped WAL entirely. One-shot writer has no concurrent-reader use case, and the default rollback journal keeps the main file self-contained at close with no extra ceremony. Test now asserts journal_mode != "wal" and a new test_no_wal_sidecars_after_writes verifies no -wal/-shm files survive after writes + close.

5. assert per_node_scratch is not None stripped under python -O. Removed. With the Critical-2 fix, per_node_scratch is unconditionally assigned inside if is_cloud:, so the assert was redundant and fragile. Now the type is structurally guaranteed.

6. Rollback branch of PerNodeDbWriter.__exit__ untested. Added test_rollback_on_exception — raises inside the with block, asserts zero rows persisted.

7. No test for CLI emit-failure graceful degradation. Covered by test_emit_failure_degrades_gracefully above.

8. Scratch cleanup on failure untested. Covered by the two new scratch-cleanup tests above.

Additional notes you raised

  • compiled_code=None for sources: I'll add a note at the SQL consumer side (PR 2/3 cloud-infra) — outside this PR's scope.
  • manifest.to_dict() called once per env: agree it's a significant allocation on large manifests. Left as-is for this PR; a future optimization could hoist it.

Verification

$ uv run pytest tests/test_per_node_db.py tests/test_cli_per_node_db.py tests/test_cli_cache.py
62 passed, 5 skipped in 29.23s
$ make flake8
(clean)

Coverage on recce/util/per_node_db.py is 97% (matches pre-fix baseline).

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Comment thread recce/cli.py Outdated
Comment on lines +700 to +703
envs_to_emit = [
("current", dbt_adapter.curr_manifest, dbt_adapter.curr_catalog),
("base", dbt_adapter.base_manifest, dbt_adapter.base_catalog),
]
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In cloud mode, envs_to_emit includes ("base", dbt_adapter.base_manifest, ...) whenever base_manifest is non-None. Earlier in init you may intentionally point target_base_path at target_path when base artifacts are missing (so load_context doesn’t fail). In that case base_manifest can exist but actually represents the current env, causing per_node.db to incorrectly contain a fabricated env='base' copy of current data. Consider building envs_to_emit from the has_target / has_base flags (or the earlier envs list) so you only emit rows for environments that truly have artifacts.

Suggested change
envs_to_emit = [
("current", dbt_adapter.curr_manifest, dbt_adapter.curr_catalog),
("base", dbt_adapter.base_manifest, dbt_adapter.base_catalog),
]
envs_to_emit = []
has_target_artifacts = locals().get("has_target")
has_base_artifacts = locals().get("has_base")
if has_target_artifacts is None:
has_target_artifacts = dbt_adapter.curr_manifest is not None
if has_base_artifacts is None:
has_base_artifacts = dbt_adapter.base_manifest is not None
if has_target_artifacts:
envs_to_emit.append(("current", dbt_adapter.curr_manifest, dbt_adapter.curr_catalog))
if has_base_artifacts:
envs_to_emit.append(("base", dbt_adapter.base_manifest, dbt_adapter.base_catalog))

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 6766eca. The emit loop now builds envs_to_emit from has_target / has_base (matching the CLL cache loop above). This kills the phantom-base-env case where the earlier path-swap leaves both manifests populated but pointing at the same env.

Added a CLI test test_no_base_rows_when_only_target_has_artifacts that reproduces the path-swap outcome (curr_manifest and base_manifest set to the same object, base_download_urls empty) and asserts SELECT COUNT(*) FROM nodes WHERE env='base' is 0 in the uploaded per_node.db.

Comment thread recce/cli.py Outdated
Comment on lines +686 to +692
if is_cloud:
per_node_scratch = Path(tempfile.mkdtemp(prefix="recce-per-node-"))
try:
upload_urls = cloud_client.get_upload_urls_by_session_id(cloud_org_id, cloud_project_id, session_id)

# Upload CLL map
cll_map_upload_url = upload_urls.get("cll_map_url")
if cll_map_upload_url and cll_map_path.is_file():
try:
with open(cll_map_path, "rb") as f:
resp = requests.put(
cll_map_upload_url,
data=f,
headers={"Content-Type": "application/json"},
timeout=_UPLOAD_TIMEOUT,
per_node_db_path: Optional[Path] = per_node_scratch / "per_node.db"
console.print("\n[bold]Emitting per-node SQLite...[/bold]")
t_pn_start = time.perf_counter()
try:
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per_node.db is emitted before fetching upload URLs, so when running against an older Cloud server that doesn’t return per_node_db_url the CLI still spends time generating the SQLite file only to warn and discard it. If the goal is graceful backward compatibility, consider fetching upload_urls first and skipping emission entirely when per_node_db_url is absent (still uploading cll_map.json / cll_cache.db).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 6766eca. Upload URLs are now fetched before emission, and per_node.db is emitted only when per_node_db_url is present in the response. Against an older Cloud without the key, the CLI now prints "No per_node_db_url in upload URLs (Cloud server may need update) — skipping per_node.db emit" and doesn't do the emit work at all. cll_map.json and cll_cache.db still upload independently.

Comment thread recce/util/per_node_db.py
Comment on lines +147 to +151
The default rollback-journal mode is used rather than WAL: this file is
written once by a single process and immediately uploaded, so the main
file must be self-contained at close time. WAL would require an explicit
truncating checkpoint to guarantee that, with no benefit for a one-shot
writer that has no concurrent readers.
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description states per_node.db is written in WAL mode, but the implementation (and tests) deliberately rely on rollback-journal mode to avoid -wal/-shm sidecars. Please reconcile the PR description with the actual behavior so reviewers/users don’t get conflicting information.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 6766eca — updated the PR description to say rollback-journal mode, noting that the main file is self-contained at close (no -wal/-shm sidecars) so the uploader can't miss them.

Comment thread recce/util/per_node_db.py Outdated
- ``manifest`` must be a raw dict (e.g. ``json.load(manifest.json)``).
- ``catalog`` may be ``None``; when absent, all ``data_type`` are ``None``.
- Column names come from the catalog when available; otherwise from the
manifest's ``columns`` section (tests-as-columns).
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The extract_rows_from_artifacts docstring says columns come from the manifest's columns section "(tests-as-columns)", but this function is actually reading node.get("columns") (model/source column metadata), not tests. Please update the docstring to describe the real behavior to avoid confusing future maintainers.

Suggested change
manifest's ``columns`` section (tests-as-columns).
node's manifest ``columns`` metadata for model/source columns.
- Tests are derived separately into ``node_tests`` rows via ``_derive_tests``.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 6766eca. Docstring now reads:

- Column names come from the catalog when available; otherwise from each
  node's ``columns`` metadata in the manifest (model/source column
  definitions). Tests are NOT part of columns — they are derived
  separately into ``node_tests`` rows via ``_derive_tests``.

No more "tests-as-columns" — they're clearly two separate row streams.

Copy link
Copy Markdown
Contributor

gcko commented Apr 24, 2026

Updated Review — PR #1334

Summary

Hostile-pass review of the per_node.db emitter and its recce init --cloud wiring, verified at head a768670d.

Status of prior review findings (from earlier comment): all 3 Criticals and 4 Warnings appear genuinely fixed in 29bbc993:

  • primary_key tiebreaker now walks the same column order (catalog → manifest) the adapter walks. test_primary_key_follows_catalog_column_order_with_multiple_unique_tests exercises the case the old code got wrong.
  • Scratch tempdir cleanup is now unconditional in a single try/finally (recce/cli.py:686-842). New tests cover both HTTP 500 on PUT and get_upload_urls_by_session_id raising.
  • Scale fixture path uses RECCE_SCALE_FIXTURE_DIR env-var with a tests/fixtures/jaffle-shop-expand/ fallback.
  • WAL dropped in favor of the rollback journal; test_no_wal_sidecars_after_writes verifies the main file is self-contained at close.
  • assert per_node_scratch is not None removed.
  • Rollback branch of PerNodeDbWriter.__exit__ covered by test_rollback_on_exception.
  • Emit-failure graceful-degradation covered by test_emit_failure_degrades_gracefully.

Verified locally at head: pytest tests/test_per_node_db.py tests/test_cli_per_node_db.py tests/test_cli_cache.py → 62 passed, 5 skipped. make flake8 clean.

But one new correctness bug has surfaced and is unaddressed, plus two documentation drifts that Copilot also flagged.


Findings

[Critical] Phantom env='base' rows when only one env has artifacts

File: recce/cli.py:700-717 (emit loop) vs. recce/cli.py:527-532 (path-swap fallback)

Issue: Earlier in init, when only one of target/ or target-base/ is populated, the missing env's path is swapped to point at the present env so load_context does not fail:

if has_target and not has_base:
    context_kwargs["target_base_path"] = kwargs.get("target_path", "target")
elif has_base and not has_target:
    context_kwargs["target_path"] = kwargs.get("target_base_path", "target-base")

After this, dbt_adapter.curr_manifest and dbt_adapter.base_manifest are both loaded but represent the same environment. The CLL cache loop at recce/cli.py:557-572 correctly guards with has_target / has_base, so only one env is cached. The per_node.db emit loop does not:

envs_to_emit = [
    ("current", dbt_adapter.curr_manifest, dbt_adapter.curr_catalog),
    ("base", dbt_adapter.base_manifest, dbt_adapter.base_catalog),
]
for env_name, manifest, catalog in envs_to_emit:
    if manifest is None:
        continue
    ...

manifest is never None in the fallback case (both were loaded from the same path), so the loop emits every row twice — once with env='current' and once with env='base' containing identical data labeled as the missing env. A downstream Cloud consumer of per_node.db would see a fabricated base environment that is actually a copy of current, and whatever UI/API is built on top would present a diff of zero rather than the correct "no base env" state.

In cloud mode, this triggers whenever a session has no base — the get_base_session_download_urls path silently returns empty, base artifacts are not downloaded, has_base == False, path-swap engages, and per_node.db ships with duplicate rows. None of the 7 CLI tests catches this because they all construct a base_manifest = None fixture (tests/test_cli_per_node_db.py:82), which short-circuits the manifest is None guard. The real production path has a non-None base_manifest that happens to be the same object as curr_manifest — the test scaffolding never reproduces this.

Copilot flagged the same issue at discussion r3135356810 and it is still unresolved.

Suggestion: Build envs_to_emit from has_target / has_base, matching the CLL cache loop:

envs_to_emit = []
if has_target:
    envs_to_emit.append(("current", dbt_adapter.curr_manifest, dbt_adapter.curr_catalog))
if has_base:
    envs_to_emit.append(("base", dbt_adapter.base_manifest, dbt_adapter.base_catalog))

Add a CLI test that emits with has_base == False (e.g. no base manifest in the downloaded target-base/) and asserts SELECT COUNT(*) FROM nodes WHERE env='base' is zero.


[Warning] PR description still advertises WAL mode

File: PR description / recce/util/per_node_db.py:140-152

Issue: The PR body says:

Schema (per_node.db, 5 tables, WAL mode):

but commit 29bbc993 deliberately dropped WAL in favor of the rollback journal to avoid -wal/-shm sidecars the uploader would miss. The test at tests/test_per_node_db.py:59 even asserts journal_mode.lower() != "wal". Future readers of this PR description (and of any downstream docs that get auto-generated from it) will get conflicting information about the on-disk format.

Copilot also flagged this at discussion r3135356832; still unresolved.

Suggestion: Update the PR description to say "rollback-journal mode" (or just "no WAL sidecars so the main file is self-contained at upload") so the merge commit message doesn't capture the wrong thing.


[Warning] extract_rows_from_artifacts docstring says "tests-as-columns"

File: recce/util/per_node_db.py:336

Issue:

    - Column names come from the catalog when available; otherwise from the
      manifest's ``columns`` section (tests-as-columns).

The (tests-as-columns) parenthetical is wrong. The code reads node.get("columns") (manifest model/source column metadata), not tests. Tests are derived separately into node_tests rows via _derive_tests. A maintainer reading this would reasonably believe column rows are hydrated from test definitions, which is the opposite of what happens.

Copilot flagged this at discussion r3135356841; still unresolved.

Suggestion: Replace with something like "otherwise from each node's manifest columns metadata. Tests are derived separately into node_tests via _derive_tests."


Additional Notes (non-blocking)

  • Edges include test and seed children. recce/util/per_node_db.py:392-395 emits every (parent, child) pair in child_map without filtering. So model.foo -> test.pkg.unique_foo_id edges land in edges alongside model-to-model edges. Schema doesn't promise model-only edges, so this is a documentation issue more than a bug — but it bloats edges by the number of tests and every downstream lineage query must filter by joining against nodes.resource_type. If Cloud consumers assume edges is lineage-only, they will show tests as lineage nodes.
  • write_meta(session_id=session_id or ""). Storing empty string for missing session is harmless today, but future consumers who do WHERE session_id != '' won't expect the sentinel. A NULL would be more honest; INSERT OR REPLACE handles it fine.
  • manifest.to_dict() called inside emit loop (recce/cli.py:707). For an 1918-node project, this is a non-trivial allocation and already happens once at load_context time for dbt's own manifest representation. Not a blocker — author acknowledged in prior thread. Worth hoisting in a follow-up if Cloud starts invoking recce init --cloud frequently.
  • PerNodeDbWriter._txn() (line 178-182) is a connection accessor, not a transaction guard. Cosmetic from the prior review; the naming still invites confusion about atomicity semantics. A simple rename to _conn_ctx or making _txn actually issue BEGIN/COMMIT would close the gap.
  • No schema migration plumbing. PRAGMA user_version = 1 is written but never read anywhere — the writer unconditionally sets it on every open. Fine for this PR (Cloud consumes what this CLI just wrote), but the moment B1 lands and the Cloud reader is versioned independently, you'll want an assertion on the reader side.

Verdict

Request changes. The phantom-base-env bug is a real correctness issue: it ships duplicate rows under a false label in the exact "only one environment" path that the CLI explicitly supports via the context_kwargs path-swap. It is not guarded by the test suite because all CLI tests set base_manifest = None on the mock adapter — short-circuiting the path that production hits. Fix that, pull the two doc drifts into the next push, and this is a GO.


Generated by Claude Code

Copy link
Copy Markdown
Contributor

@gcko gcko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review (hostile pass): Critical issue found. See #1334 (comment) for details.

Prior 3 Criticals + 4 Warnings from the earlier review were genuinely fixed in 29bbc993 (verified locally — 62 passed, 5 skipped; flake8 clean). But one new correctness bug surfaced:

  • Phantom env='base' rows when only one env has artifacts (recce/cli.py:700-717). When has_base == False, context_kwargs swaps target_base_path to point at target_path so load_context does not fail, leaving base_manifest non-None but identical to curr_manifest. The emit loop only guards on if manifest is None: continue, so every row is emitted twice with a fake env='base' label. Copilot independently flagged this (r3135356810, still unresolved). All 7 CLI tests set base_manifest = None on the mock adapter, short-circuiting the path production hits.

Two doc drifts also carried over from the Copilot pass and remain open: PR description still says "WAL mode" after 29bbc99 dropped it, and the extract_rows_from_artifacts docstring still has a misleading (tests-as-columns) parenthetical.


Generated by Claude Code

Critical:
- Phantom env='base' rows when only one env has artifacts. The emit loop
  keyed off `manifest is None`, but in the single-env path-swap case both
  curr_manifest and base_manifest are loaded pointing at the SAME env —
  so the missing-env guard never fired and we shipped duplicate rows under
  a fabricated env label. The emit loop now uses `has_target` / `has_base`,
  matching the CLL cache loop's truth source. Added a CLI test that
  reproduces the path-swap scenario and asserts no env='base' rows ship.

Addressed Copilot suggestion:
- Skip per_node.db emission entirely when Cloud doesn't advertise
  `per_node_db_url`. Fetch upload URLs first; only emit the SQLite file
  when there's somewhere to upload it. Saves 1-2s of pure-waste emit work
  against older Cloud servers during the migration window.

Docs:
- `extract_rows_from_artifacts` docstring no longer says "tests-as-columns"
  — columns come from each node's manifest `columns` metadata; tests are
  derived separately into `node_tests` via `_derive_tests`.
- Updated PR description: schema is rollback-journal (not WAL) mode.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: even-wei <evenwei@infuseai.io>
@even-wei
Copy link
Copy Markdown
Contributor Author

@gcko fixed in 6766eca.

Critical: phantom env='base' rows

You're right — the path-swap at recce/cli.py:527-532 puts both curr_manifest and base_manifest on the same underlying dict, so manifest is None was the wrong guard. The emit loop now keys off has_target / has_base — the same truth source the CLL cache loop uses:

envs_to_emit = []
if has_target:
    envs_to_emit.append(("current", dbt_adapter.curr_manifest, dbt_adapter.curr_catalog))
if has_base:
    envs_to_emit.append(("base", dbt_adapter.base_manifest, dbt_adapter.base_catalog))

New CLI test test_no_base_rows_when_only_target_has_artifacts reproduces the path-swap exactly:

  • base_download_urls={}target-base/manifest.json never written → has_base == False
  • adapter.base_manifest = adapter.curr_manifest (same object) — what the CLI actually sees after load_context
  • Captures the per_node.db PUT body via a stub requests.put and opens it with sqlite3 to assert SELECT COUNT(*) FROM nodes WHERE env='base' is 0

This is the test scaffolding gap you flagged — all prior CLI tests set base_manifest = None, which short-circuits the bug path entirely. The new test sets them to the same object so the guard actually has to do work.

Warnings

  • PR description: Updated — the schema section now says "rollback-journal mode — no -wal/-shm sidecars, so the main file is self-contained at close and upload".
  • extract_rows_from_artifacts docstring: Replaced the "tests-as-columns" parenthetical with an explicit "Tests are NOT part of columns — they are derived separately into node_tests rows via _derive_tests."

Bonus — addressed Copilot's emit-before-URL concern

While I was in there, reordered the cloud block so upload URLs are fetched before emission. When per_node_db_url is absent (older Cloud), the CLI now skips the emit entirely instead of generating the SQLite file and then discarding it. Saves 1-2s of pure-waste work during the migration window; cll_map.json and cll_cache.db still upload independently.

Non-blocking notes you raised (noted, deferred)

  • Edges include test/seed children — agreed this is a schema-docs issue; I'll document at the Cloud consumer side when B1 (DRC-3297) wires up the reader, and filter there rather than at emit time. Keeps this emitter lossless.
  • write_meta(session_id=session_id or "") — agreed NULL would be more honest; will swap when I add the first consumer query that cares about it.
  • manifest.to_dict() inside emit loop — deferred; one-shot recce init --cloud invocation, not yet on the hot path. If Cloud starts invoking this frequently, will hoist into a shared memo.
  • _txn() naming — will rename to _conn_ctx in a follow-up to avoid confusing it with a real transaction guard.
  • Schema migration plumbing — agreed; will add a PRAGMA user_version read assertion on the Cloud reader side in B1.

Verification

$ uv run pytest tests/test_per_node_db.py tests/test_cli_per_node_db.py tests/test_cli_cache.py
63 passed, 5 skipped in 29.22s
$ make flake8
(clean)

Coverage on recce/util/per_node_db.py still 97%.

@gcko gcko self-requested a review April 24, 2026 06:50
Copy link
Copy Markdown
Contributor

gcko commented Apr 24, 2026

Updated Review (round 3) — PR #1334

Summary

Hostile re-pass at head 6766eca. All three findings from the previous round are legitimately addressed; one minor new regression surfaced from the structural refactor but it is niche enough that I will not block on it.

Verified locally: pytest tests/test_per_node_db.py tests/test_cli_per_node_db.py tests/test_cli_cache.py → 63 passed (+1 new), 5 skipped. make flake8 clean.


Round 2 findings — all addressed

  • Critical (phantom env='base'): Fixed in 6766eca. envs_to_emit is now built from has_target / has_base flags (recce/cli.py:726-732), matching the CLL cache loop's truth source. Critically, the new test test_no_base_rows_when_only_target_has_artifacts (tests/test_cli_per_node_db.py:563-675) actually reproduces the production path-swap scenario — curr_manifest and base_manifest are set to the same MagicMock object, exactly what load_context would produce after context_kwargs path-swaps — and verifies SELECT COUNT(*) FROM nodes WHERE env='base' is 0 in the uploaded bytes. This is real coverage of the bug, not just a happy-path regression test.
  • Warning (WAL in PR description): PR body updated to "rollback-journal mode — no -wal/-shm sidecars, so the main file is self-contained at close and upload."
  • Warning (extract_rows_from_artifacts docstring): (tests-as-columns) parenthetical replaced with clear prose separating column rows (from manifest columns) and test rows (from _derive_tests).

Refactor-review: new upload block

The fix pulled the emit step inside the upload block and made it conditional on upload_urls.get("per_node_db_url"). I re-read the reshaped recce/cli.py:686-857 closely for regressions.

What still works:

  • Scratch tempdir cleanup remains unconditional in the outer try/finally (recce/cli.py:687-857). All five previously-added failure-mode tests still pass (HTTP 500 on PUT, URL fetch raising, emit raising, missing per_node_db_url, success path).
  • cll_map.json and cll_cache.db uploads are still independent of per_node.db emission — if emit fails or is skipped, those two still upload.
  • Saves the emit cost against older Cloud servers that don't yet advertise per_node_db_url.

[Nit] Non-RequestException errors during upload no longer degrade to warnings

File: recce/cli.py:768-846

Issue: The previous structure had one outer try / except Exception as e: around the entire upload block, so any exception — OSError from open(cll_map_path, "rb"), unexpected KeyError, etc. — degraded to a single "Cloud upload failed" warning and the CLI exited 0. The new structure keeps only except requests.RequestException per upload. A non-network error during upload (e.g. file vanished between is_file() and open(), or an adapter-side Exception from stat()) now propagates out of the whole if is_cloud: block and crashes the CLI with non-zero exit.

In practice this is extremely unlikely — open() on a file we just wrote rarely fails — and there's an argument that a true OSError should crash rather than be swallowed. Flagging for awareness, not blocking. If the project's posture is "upload issues are always warnings, never errors" (which the test docstrings imply), consider wrapping each upload's work in a broader try / except Exception.

[Nit] Redundant if cloud_client: check

File: recce/cli.py:689

Issue: When is_cloud is True (the branch we are in), cloud_client is guaranteed non-None by recce/cli.py:365 (the code exit(1)s on earlier auth failures, never reaching here with cloud_client = None). The if cloud_client: guard at line 689 is superstition. Not a bug; just dead defensive code that invites a future reader to wonder what invariant it is protecting.


Additional Notes carried from round 2 (still non-blocking)

  • Edges still include test and seed children (recce/util/per_node_db.py:392-395). Schema doesn't promise model-only, so this is documentation, not a bug — but Cloud consumers of edges must filter by joining against nodes.resource_type or they'll render tests as lineage nodes.
  • write_meta(session_id=session_id or "") stores an empty string for missing session. A NULL would be more honest, but INSERT OR REPLACE handles either fine.
  • manifest.to_dict() per env — O(manifest size) allocation on a 1918-node project; worth hoisting in a follow-up if recce init --cloud gets called frequently.
  • PerNodeDbWriter._txn() naming still suggests a DB transaction when it's a connection accessor. Cosmetic.
  • PRAGMA user_version = 1 is write-only — no reader validates it today. Fine while CLI and Cloud ship together, but the moment B1 lands and the reader versions independently, assert on it.

Verdict

Approved. The phantom-base-env bug is properly fixed and has a test that actually reproduces the production path-swap scenario. The two doc drifts are resolved. The refactor is tightened sensibly — emit work is now avoided against older Cloud servers. The open()/upload-error-propagation nit is edge-case enough that I won't block on it; call it out in the next upload-related change if you want tighter error-to-warning conversion.

Ready to merge once the cloud-infra companion (#1239) lands. CLI still tolerates either merge order.


Generated by Claude Code

Copy link
Copy Markdown
Contributor

@gcko gcko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review (hostile re-pass, round 3): All prior Criticals and Warnings are legitimately addressed in 6766eca. See #1334 (comment) for details.

  • Phantom env='base' bug fixed — envs_to_emit now built from has_target/has_base (not manifest is None), and test_no_base_rows_when_only_target_has_artifacts reproduces the production path-swap scenario with curr_manifest and base_manifest set to the same object.
  • PR description corrected (rollback-journal, not WAL).
  • extract_rows_from_artifacts docstring corrected.
  • Refactor pulled emit inside the upload block and made it conditional on per_node_db_url — saves wasted emit against older Cloud servers.

Two non-blocking nits noted (non-RequestException during upload no longer degrades to warnings; redundant if cloud_client: superstition check) — neither blocks merge.

Verified: 63 passed, 5 skipped; flake8 clean.


Generated by Claude Code

@even-wei even-wei merged commit 56478cc into main Apr 24, 2026
22 checks passed
@even-wei even-wei deleted the feature/drc-3295-pr1-per-node-db-emitter branch April 24, 2026 07:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants