feat(profile-distribution): polyglot SQL foundation (DRC-3390 PR 1) by danyelf · Pull Request #1389 · DataRecce/recce

danyelf · 2026-05-21T18:06:35Z

First of four PRs landing the paired-histogram feature on main. Scope is intentionally narrow: this PR adds the wire — RunType, CLI flag, capability detection, per-dialect SQL renderers — and nothing user-visible. PR 2 wires the orchestration; PRs 3–4 ship the frontend cells, schema view integration, in-session cache, and lineage pre-warm.

Full 4-PR plan + locked decisions are in DRC-3390. Data-pull strategy decision (approx_all) is in DRC-3389.

What's in this PR

Backend

RunType.PROFILE_DISTRIBUTION enum + registration in dbt_supported_registry
--inline-profile CLI flag (default off), plumbed to the server flag dict
recce/adapter/capabilities.py: per-adapter AdapterCapabilities lookup
recce/adapter/approx_aggregates.py: per-dialect SQL fragment renderers for approx_count_distinct / approx_top_k / approx_percentile. Raises UnsupportedAggregateError on adapters without native support
ProfileDistributionTask stub returning empty payload so PR 1 flag plumbing is end-to-end verifiable before PR 2 lands

Frontend (minimal — just keeping types coherent)

RecceServerFlags.inline_profile
RunType union + discriminated union + RUN_TYPES list + type guard for "profile_distribution"
Placeholder ProfileDistributionParams / ProfileDistributionResult (PR 2 will populate)
Stub entry in RunRegistry so existing components type-check

Adapter coverage matrix (locked in DRC-3389/DRC-3390)

Tier	Engines	Behavior
Full	Snowflake, BigQuery, DuckDB, Databricks/Spark 3.5+, Athena/Trino/Presto, ClickHouse	Histograms + top-K
Histograms only	Redshift	Continuous columns render; categorical cells blank
Disabled	Postgres, MySQL, SQLite, SQL Server (incl. Azure SQL pre/post-2022)	Feature unavailable; banner if flag is on

Caveats baked into the matrix that capability detection by adapter type can't enforce (deferred to runtime version detection):

sqlserver defaults to disabled. SQL Server 2022+ / Azure SQL added APPROX_PERCENTILE_CONT/_DISC; pre-2022 has neither. A version-aware refinement can move 2022+ users into percentile-only tier.
databricks / spark default to full. Databricks/Spark added approx_top_k in 3.5; older clusters would land in percentile-only.

Test plan

pytest tests/adapter/test_capabilities.py tests/adapter/test_approx_aggregates.py — 69/69 pass (capability mapping + per-dialect SQL snapshots)
pytest tests/tasks/test_{histogram,profile,top_k}.py tests/test_cli.py — 35/35 pass (no regression)
recce server --help — --inline-profile flag present and described
Smoke: RunType.PROFILE_DISTRIBUTION in registry, stub task instantiates and returns {columns: {}, status: "ok"}
pnpm run --filter @datarecce/ui type:check — pre-existing baseline errors only; no new type errors introduced

Why this split

Reviewers can audit each dialect's SQL in isolation. Lots of small dialect mistakes possible here — snapshot tests are load-bearing. Landing the SQL foundation alone makes them easy to catch before PR 2 wires the orchestration around them.

Out of scope (lands in subsequent PRs)

Full ProfileDistributionTask (HLL probe + APPROX_PERCENTILE + APPROX_TOP_K) — PR 2
Frontend cells, hook, schema view integration, in-session memoization — PR 3
Lineage-click pre-warm + GA flip — PR 4

🤖 Generated with Claude Code

Foundation layer for the paired-histogram feature. No user-facing behavior yet — wires the RunType, CLI flag, capability detection, and per-dialect SQL renderers that PR 2 will orchestrate into the approx_all pipeline. Backend - RunType.PROFILE_DISTRIBUTION enum + registration in dbt_supported_registry - --inline-profile CLI flag (default off), plumbed to the server flag dict - recce/adapter/capabilities.py: per-adapter AdapterCapabilities lookup. Snowflake / BigQuery / DuckDB / Databricks / Spark / Athena / Trino / Presto / ClickHouse -> full; Redshift -> percentile-only; everything else (Postgres / MySQL / SQLite / SQL Server) -> disabled tier. - recce/adapter/approx_aggregates.py: per-dialect SQL fragment renderers for approx_count_distinct / approx_top_k / approx_percentile. Raises UnsupportedAggregateError on adapters without native support. - ProfileDistributionTask stub returning an empty payload so PR 1 flag plumbing is end-to-end verifiable before PR 2 lands. Frontend - RecceServerFlags.inline_profile boolean - RunType union + Run discriminated union + RUN_TYPES list + type guard for "profile_distribution"; minimal ProfileDistributionParams / ProfileDistributionResult placeholders (PR 2 will populate the shape) - Stub entry in run RunRegistry so existing components type-check Tests - 50 per-dialect SQL snapshot tests covering every supported adapter for each renderer; +disabled-tier raises; +input-validation - 19 capability-detection tests asserting the tier matrix is intact All 104 tests pass (69 new + 35 existing histogram/profile/top_k/CLI). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Danyel Fisher <danyel@gmail.com>

codecov · 2026-05-21T18:13:44Z

Codecov Report

❌ Patch coverage is 97.61905% with 4 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
recce/tasks/profile_distribution.py	75.00%	3 Missing ⚠️
recce/cli.py	66.66%	1 Missing ⚠️

Files with missing lines	Coverage Δ
recce/adapter/approx_aggregates.py	`100.00% <100.00%> (ø)`
recce/adapter/capabilities.py	`100.00% <100.00%> (ø)`
recce/adapter/dbt_adapter/__init__.py	`79.44% <ø> (ø)`
recce/models/types.py	`98.25% <100.00%> (+0.01%)`	⬆️
recce/tasks/__init__.py	`100.00% <100.00%> (ø)`
tests/adapter/test_approx_aggregates.py	`100.00% <100.00%> (ø)`
tests/adapter/test_capabilities.py	`100.00% <100.00%> (ø)`
recce/cli.py	`68.73% <66.66%> (-0.01%)`	⬇️
recce/tasks/profile_distribution.py	`75.00% <75.00%> (ø)`

... and 4 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…UI) subset Address review feedback on PR 1: profile_distribution isn't a user-facing checklist run — it's a backend data fetch whose result feeds into schema view cells. Giving it a title/icon/registry entry conflated wire-level run types with user-facing run-with-result-view types. Splits the two concepts: - ``RunType`` (api/types/run.ts) — every wire-level run type the API knows, including profile_distribution. Used by the task dispatcher. - ``RegisteredRunType`` (components/run/registry.ts) — the subset that surfaces in the UI as a checklist item with title / icon / result view. - ``RunTypeWithRef`` (api/types/run.ts) — the further subset whose result view supports forwardRef (screenshots). Concrete changes: - Remove the placeholder profile_distribution entry from RunRegistry + registry const. - Add ``RegisteredRunType``, ``REGISTERED_RUN_TYPES``, and ``isRegisteredRunType`` guard. - Narrow ``findByRunType`` / ``createBoundFindByRunType`` to ``RegisteredRunType`` (was ``RunType``). - Promote ``runTypeHasRef`` to a real type guard returning ``runType is RunTypeWithRef`` (was bare ``boolean``). Two callers (RunResultPaneOss, useRun) get automatic narrowing for free. - Update RunTypeRegistry (in types.ts) to be keyed on registered types. - Add explicit isRegisteredRunType guards at the two remaining callsites (CheckDetailOss, RecceActionAdapter) that pass arbitrary RunType values. - Update registry.test.ts to iterate REGISTERED_RUN_TYPES (was RUN_TYPES). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Danyel Fisher <danyel@gmail.com>

…unblock type:check PR 1 narrowed ``findByRunType``'s parameter from ``RunType`` to ``RegisteredRunType`` (a subset that excludes ``profile_distribution``) but left the test arrays in ``js/src/components/run/__tests__/registry.test.ts`` typed as ``RunType[]``. tsc flags every call site since ``profile_distribution`` isn't a valid ``keyof RunRegistry``. Re-typing the test-side arrays to ``RegisteredRunType[]`` lines them up with what ``findByRunType`` actually accepts, which is what they were already intending — every value in those arrays is already a registered type. Unblocks the pre-push ``pnpm type:check`` hook for the DRC-3390 PR 3 stack. Refs DRC-3390. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Danyel Fisher <danyel@gmail.com>

…mock PR 1 (DRC-3390 polyglot foundation, commit e6b4cf5) added a runtime ``isRegisteredRunType`` guard inside ``RecceActionAdapter`` to filter out backend-only run types like ``profile_distribution`` before passing them to ``findByRunType``. The existing RecceActionAdapter test mock replaces the whole ``@datarecce/ui/components/run`` module, which silently broke 9 tests when ``isRegisteredRunType`` started returning undefined → falsy → the adapter threw "Run type X does not have a result view" before ever calling ``submitRun``. Adding the guard to the mock (defaulted to true, which preserves prior behavior for the ``query_diff`` / ``row_count`` test fixtures) restores all 9. Verified by running ``pnpm vitest run RecceActionAdapter`` — 26/26 pass on this commit, vs 17/26 on the parent branch. Unblocks the pre-push hook for the DRC-3390 PR 3 stack. Refs DRC-3390. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Danyel Fisher <danyel@gmail.com>

… (DRC-3390 PR 3) PR 3 of the paired-histograms GA productionization. Adds the frontend cells, hook, schema-row integration, and unsupported-adapter banner that consume the per-column payloads PR 2 will produce. Stacks on PR 1 (#1389) which already shipped RunType.PROFILE_DISTRIBUTION + the discriminated-union plumbing. Includes: - PairedHistogramContinuous — quantile-bin, constant-area paired bar chart. Heights = density, widths = bin span, so bar area reads as the proportion of rows in that bin. Replaces the prototype's uniform-bin renderer (closed PR #1380). - PairedHistogramDiscrete — top-K paired chart with gap-on-absent semantics. When a value is missing on one side, that bar isn't drawn, leaving a visible empty half-slot that reads as 'absent here'. - useInlineProfileDistribution — TanStack-Query-backed hook that fires a single PROFILE_DISTRIBUTION run per model, parses the discriminated- union payload, and returns {distributions, loading, error, isUnsupported}. Emits Amplitude analytics events at request and result (PostHog wiring noted in DRC-3390 but the codebase uses Amplitude — same trail). - InlineProfileDistributionCell — Compact-mode wrapper that picks between loading / error / empty (per-column failure) / continuous / discrete renders in a fixed-size slot, so adjacent rows don't reflow. Grid mode is intentionally absent at GA (per DRC-3390 lock). - ProfileDistributionUnsupportedBanner — once-per-task MUI Alert for the {status: 'unsupported', reason} envelope (Postgres / MySQL / SQLite / SQL Server pre-2022). Payload contract (frozen — PR 2 produces exactly this): - Continuous: {kind:'histogram', bin_edges:[12], base_density:[11], current_density:[11], base_total, current_total} - Categorical: {kind:'topk', values, base_counts, current_counts, base_total, current_total, trimmed} - Per-column failure: {kind:null} - Unsupported envelope: {status:'unsupported', reason:...} Refines the ProfileDistributionResult type in api/types/run.ts from the PR 1 stub (Record<string, unknown>) to the full discriminated union — PR 1 left this as a placeholder for PR 2/3. Tests + storybook: - 57 unit tests across 5 files covering both cells, the integration wrapper, the banner, and the hook (mocked at the submit boundary so PR 2 doesn't need to be merged). - 20 storybook stories covering both cells (cell + card density, light + dark, gap-on-absent, trimmed, empty payload, degenerate range) and all four integration-cell states; verified visually on port 6011. - pnpm type:check clean; remaining errors are pre-existing baseline (CSS module declarations, lineage hooks) and unrelated to this PR. Stacks on #1389. Mocks PR 2's payload in tests; rebase onto main once PR 2 lands. PR 4 (lineage pre-warm + GA flip) consumes the same hook + cacheKeys.profileDistribution(model) cache key. Refs DRC-3390. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Danyel Fisher <danyel@gmail.com>

Replaces the PR 1 stub of `ProfileDistributionTask` with the full six-query `approx_all` orchestration (HLL probe + APPROX_PERCENTILE + APPROX_TOP_K per env), composed on top of PR 1's per-dialect SQL renderers and capability table. What this PR ships: - Strategy router (`pick_strategy`) → `approx_all` / `percentile_only` / `unsupported`, driven by `recce.adapter.capabilities`. - Batched HLL + min/max probe per env in one SELECT. - Batched `APPROX_PERCENTILE` and `APPROX_TOP_K` per env. - `cap_degenerate` skip: columns with HLL >= 0.95 * row_count emit an empty top-K slot (effectively unique). - Per-column failure isolation: a bad column's batch failure falls back to per-column retry so the rest of the model still renders. - Per-adapter top-K result-shape post-processor (`parse_topk_result`): Snowflake `[[value, count], …]` pairs vs DuckDB / ClickHouse flat lists vs BigQuery / Databricks STRUCTs vs Trino MAP — each normalized to `(values, counts | None)`. - In-memory memoization keyed by `(model, manifest_hash, version_token)` on the active `RecceContext`. - PostHog telemetry via `log_performance("profile_distribution", …)` with `{strategy, total_wall_ms, phase_wall_ms, column_count, error_count, cache_hit}`. Bakes in two prototype-bug fixes: - DRC-3504: `render_epoch_cast` applies a per-adapter in-SQL epoch conversion (DuckDB `epoch()`, Snowflake `date_part(epoch_second, …)`, BigQuery `UNIX_SECONDS(…)`, etc.) before the percentile fragment, so TIMESTAMP histograms render instead of silently emitting empty charts. - DRC-3507: `_execute_with_rollback` issues an explicit ROLLBACK on any per-statement failure, so DuckDB's transaction-cascade behaviour no longer blanks subsequent columns. Payload schemas (frozen with PR 3, per DRC-3390 spec): - Continuous: `{kind: "histogram", bin_edges, base_density, current_density, base_total, current_total}` — 12 edges, 11 densities, quantile-binned constant-area. - Categorical: `{kind: "topk", values, base_counts, current_counts, trimmed}` — counts is `None` when the adapter's sketch doesn't expose them (DuckDB, ClickHouse). - Unsupported tier: single envelope `{status: "unsupported", reason, columns: {}}`. - Per-column failure: `{kind: null}`. Tests cover continuous numeric, low-cardinality categorical, high-cardinality (trimmed) categorical, UUID-cap-degenerate, timestamp (DRC-3504 regression), deliberately-bad column (DRC-3507 regression), strategy router, top-K post-processor across all adapter shapes, and memoization hit/miss. Manual warehouse smoke 2026-05-21: - Snowflake `orders` (999 rows, 9 cols): 8.1s cold, 0ms cached. ORDER_DATE (TIMESTAMP) renders a histogram; STATUS top-K returns pairs. - DuckDB `orders` (999 rows, 9 cols): 93ms cold, 0ms cached. STATUS top-K returns flat list, counts=None. Stacks on #1389. Rebase onto main after #1389 merges. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Danyel Fisher <danyel@gmail.com>

danyelf · 2026-05-26T06:37:50Z

Closing as part of restructure (Ultraplan-approved). The original 4-PR plan optimized for per-dialect SQL audit in isolation; the new staging optimizes for reviewer-runnability — each PR independently demoable.

This PR's content splits across:

Stage B (DuckDB-only): capabilities.py (single-row table), approx_aggregates.py (DuckDB renderers only), TS contract types (RunType.PROFILE_DISTRIBUTION, RegisteredRunType, isRegisteredRunType), --inline-profile CLI flag + RecceServerFlags.inline_profile, and the two cherry-picked test fixes (registry.test.ts narrow, RecceActionAdapter mock stub).
Stage D (polyglot extension): per-dialect SQL renderers for Snowflake / BigQuery / Databricks / Spark / Athena / Trino / Presto / ClickHouse / Redshift; full capability tier table.

Branch feature/drc-3390-pr1-polyglot-foundation retained as a code-stash source for the new stages.

This was referenced May 21, 2026

feat(profile-distribution): full approx_all pipeline (DRC-3390 PR 2) #1390

Closed

feat(profile-distribution): paired histogram cells for inline profile (DRC-3390 PR 3) #1391

Closed

danyelf and others added 2 commits May 21, 2026 13:05

danyelf closed this May 26, 2026

danyelf mentioned this pull request May 26, 2026

feat: DuckDB-only paired-histogram backend (DRC-3390 Stage B) #1399

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(profile-distribution): polyglot SQL foundation (DRC-3390 PR 1)#1389

feat(profile-distribution): polyglot SQL foundation (DRC-3390 PR 1)#1389
danyelf wants to merge 4 commits into
mainfrom
feature/drc-3390-pr1-polyglot-foundation

danyelf commented May 21, 2026

Uh oh!

codecov Bot commented May 21, 2026 •

edited

Loading

Uh oh!

danyelf commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

danyelf commented May 21, 2026

What's in this PR

Adapter coverage matrix (locked in DRC-3389/DRC-3390)

Test plan

Why this split

Out of scope (lands in subsequent PRs)

Uh oh!

codecov Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

danyelf commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov Bot commented May 21, 2026 •

edited

Loading