feat(profile-distribution): polyglot SQL foundation (DRC-3390 PR 1)#1389
Closed
danyelf wants to merge 4 commits into
Closed
feat(profile-distribution): polyglot SQL foundation (DRC-3390 PR 1)#1389danyelf wants to merge 4 commits into
danyelf wants to merge 4 commits into
Conversation
Foundation layer for the paired-histogram feature. No user-facing behavior yet — wires the RunType, CLI flag, capability detection, and per-dialect SQL renderers that PR 2 will orchestrate into the approx_all pipeline. Backend - RunType.PROFILE_DISTRIBUTION enum + registration in dbt_supported_registry - --inline-profile CLI flag (default off), plumbed to the server flag dict - recce/adapter/capabilities.py: per-adapter AdapterCapabilities lookup. Snowflake / BigQuery / DuckDB / Databricks / Spark / Athena / Trino / Presto / ClickHouse -> full; Redshift -> percentile-only; everything else (Postgres / MySQL / SQLite / SQL Server) -> disabled tier. - recce/adapter/approx_aggregates.py: per-dialect SQL fragment renderers for approx_count_distinct / approx_top_k / approx_percentile. Raises UnsupportedAggregateError on adapters without native support. - ProfileDistributionTask stub returning an empty payload so PR 1 flag plumbing is end-to-end verifiable before PR 2 lands. Frontend - RecceServerFlags.inline_profile boolean - RunType union + Run discriminated union + RUN_TYPES list + type guard for "profile_distribution"; minimal ProfileDistributionParams / ProfileDistributionResult placeholders (PR 2 will populate the shape) - Stub entry in run RunRegistry so existing components type-check Tests - 50 per-dialect SQL snapshot tests covering every supported adapter for each renderer; +disabled-tier raises; +input-validation - 19 capability-detection tests asserting the tier matrix is intact All 104 tests pass (69 new + 35 existing histogram/profile/top_k/CLI). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Danyel Fisher <danyel@gmail.com>
Codecov Report❌ Patch coverage is
... and 4 files with indirect coverage changes 🚀 New features to boost your workflow:
|
…UI) subset Address review feedback on PR 1: profile_distribution isn't a user-facing checklist run — it's a backend data fetch whose result feeds into schema view cells. Giving it a title/icon/registry entry conflated wire-level run types with user-facing run-with-result-view types. Splits the two concepts: - ``RunType`` (api/types/run.ts) — every wire-level run type the API knows, including profile_distribution. Used by the task dispatcher. - ``RegisteredRunType`` (components/run/registry.ts) — the subset that surfaces in the UI as a checklist item with title / icon / result view. - ``RunTypeWithRef`` (api/types/run.ts) — the further subset whose result view supports forwardRef (screenshots). Concrete changes: - Remove the placeholder profile_distribution entry from RunRegistry + registry const. - Add ``RegisteredRunType``, ``REGISTERED_RUN_TYPES``, and ``isRegisteredRunType`` guard. - Narrow ``findByRunType`` / ``createBoundFindByRunType`` to ``RegisteredRunType`` (was ``RunType``). - Promote ``runTypeHasRef`` to a real type guard returning ``runType is RunTypeWithRef`` (was bare ``boolean``). Two callers (RunResultPaneOss, useRun) get automatic narrowing for free. - Update RunTypeRegistry (in types.ts) to be keyed on registered types. - Add explicit isRegisteredRunType guards at the two remaining callsites (CheckDetailOss, RecceActionAdapter) that pass arbitrary RunType values. - Update registry.test.ts to iterate REGISTERED_RUN_TYPES (was RUN_TYPES). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Danyel Fisher <danyel@gmail.com>
This was referenced May 21, 2026
…unblock type:check PR 1 narrowed ``findByRunType``'s parameter from ``RunType`` to ``RegisteredRunType`` (a subset that excludes ``profile_distribution``) but left the test arrays in ``js/src/components/run/__tests__/registry.test.ts`` typed as ``RunType[]``. tsc flags every call site since ``profile_distribution`` isn't a valid ``keyof RunRegistry``. Re-typing the test-side arrays to ``RegisteredRunType[]`` lines them up with what ``findByRunType`` actually accepts, which is what they were already intending — every value in those arrays is already a registered type. Unblocks the pre-push ``pnpm type:check`` hook for the DRC-3390 PR 3 stack. Refs DRC-3390. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Danyel Fisher <danyel@gmail.com>
…mock PR 1 (DRC-3390 polyglot foundation, commit e6b4cf5) added a runtime ``isRegisteredRunType`` guard inside ``RecceActionAdapter`` to filter out backend-only run types like ``profile_distribution`` before passing them to ``findByRunType``. The existing RecceActionAdapter test mock replaces the whole ``@datarecce/ui/components/run`` module, which silently broke 9 tests when ``isRegisteredRunType`` started returning undefined → falsy → the adapter threw "Run type X does not have a result view" before ever calling ``submitRun``. Adding the guard to the mock (defaulted to true, which preserves prior behavior for the ``query_diff`` / ``row_count`` test fixtures) restores all 9. Verified by running ``pnpm vitest run RecceActionAdapter`` — 26/26 pass on this commit, vs 17/26 on the parent branch. Unblocks the pre-push hook for the DRC-3390 PR 3 stack. Refs DRC-3390. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Danyel Fisher <danyel@gmail.com>
danyelf
added a commit
that referenced
this pull request
May 21, 2026
… (DRC-3390 PR 3) PR 3 of the paired-histograms GA productionization. Adds the frontend cells, hook, schema-row integration, and unsupported-adapter banner that consume the per-column payloads PR 2 will produce. Stacks on PR 1 (#1389) which already shipped RunType.PROFILE_DISTRIBUTION + the discriminated-union plumbing. Includes: - PairedHistogramContinuous — quantile-bin, constant-area paired bar chart. Heights = density, widths = bin span, so bar area reads as the proportion of rows in that bin. Replaces the prototype's uniform-bin renderer (closed PR #1380). - PairedHistogramDiscrete — top-K paired chart with gap-on-absent semantics. When a value is missing on one side, that bar isn't drawn, leaving a visible empty half-slot that reads as 'absent here'. - useInlineProfileDistribution — TanStack-Query-backed hook that fires a single PROFILE_DISTRIBUTION run per model, parses the discriminated- union payload, and returns {distributions, loading, error, isUnsupported}. Emits Amplitude analytics events at request and result (PostHog wiring noted in DRC-3390 but the codebase uses Amplitude — same trail). - InlineProfileDistributionCell — Compact-mode wrapper that picks between loading / error / empty (per-column failure) / continuous / discrete renders in a fixed-size slot, so adjacent rows don't reflow. Grid mode is intentionally absent at GA (per DRC-3390 lock). - ProfileDistributionUnsupportedBanner — once-per-task MUI Alert for the {status: 'unsupported', reason} envelope (Postgres / MySQL / SQLite / SQL Server pre-2022). Payload contract (frozen — PR 2 produces exactly this): - Continuous: {kind:'histogram', bin_edges:[12], base_density:[11], current_density:[11], base_total, current_total} - Categorical: {kind:'topk', values, base_counts, current_counts, base_total, current_total, trimmed} - Per-column failure: {kind:null} - Unsupported envelope: {status:'unsupported', reason:...} Refines the ProfileDistributionResult type in api/types/run.ts from the PR 1 stub (Record<string, unknown>) to the full discriminated union — PR 1 left this as a placeholder for PR 2/3. Tests + storybook: - 57 unit tests across 5 files covering both cells, the integration wrapper, the banner, and the hook (mocked at the submit boundary so PR 2 doesn't need to be merged). - 20 storybook stories covering both cells (cell + card density, light + dark, gap-on-absent, trimmed, empty payload, degenerate range) and all four integration-cell states; verified visually on port 6011. - pnpm type:check clean; remaining errors are pre-existing baseline (CSS module declarations, lineage hooks) and unrelated to this PR. Stacks on #1389. Mocks PR 2's payload in tests; rebase onto main once PR 2 lands. PR 4 (lineage pre-warm + GA flip) consumes the same hook + cacheKeys.profileDistribution(model) cache key. Refs DRC-3390. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Danyel Fisher <danyel@gmail.com>
danyelf
added a commit
that referenced
this pull request
May 21, 2026
Replaces the PR 1 stub of `ProfileDistributionTask` with the full
six-query `approx_all` orchestration (HLL probe + APPROX_PERCENTILE +
APPROX_TOP_K per env), composed on top of PR 1's per-dialect SQL
renderers and capability table.
What this PR ships:
- Strategy router (`pick_strategy`) → `approx_all` / `percentile_only`
/ `unsupported`, driven by `recce.adapter.capabilities`.
- Batched HLL + min/max probe per env in one SELECT.
- Batched `APPROX_PERCENTILE` and `APPROX_TOP_K` per env.
- `cap_degenerate` skip: columns with HLL >= 0.95 * row_count emit an
empty top-K slot (effectively unique).
- Per-column failure isolation: a bad column's batch failure falls back
to per-column retry so the rest of the model still renders.
- Per-adapter top-K result-shape post-processor (`parse_topk_result`):
Snowflake `[[value, count], …]` pairs vs DuckDB / ClickHouse flat
lists vs BigQuery / Databricks STRUCTs vs Trino MAP — each
normalized to `(values, counts | None)`.
- In-memory memoization keyed by `(model, manifest_hash, version_token)`
on the active `RecceContext`.
- PostHog telemetry via `log_performance("profile_distribution", …)`
with `{strategy, total_wall_ms, phase_wall_ms, column_count,
error_count, cache_hit}`.
Bakes in two prototype-bug fixes:
- DRC-3504: `render_epoch_cast` applies a per-adapter in-SQL epoch
conversion (DuckDB `epoch()`, Snowflake `date_part(epoch_second, …)`,
BigQuery `UNIX_SECONDS(…)`, etc.) before the percentile fragment, so
TIMESTAMP histograms render instead of silently emitting empty
charts.
- DRC-3507: `_execute_with_rollback` issues an explicit ROLLBACK on any
per-statement failure, so DuckDB's transaction-cascade behaviour no
longer blanks subsequent columns.
Payload schemas (frozen with PR 3, per DRC-3390 spec):
- Continuous: `{kind: "histogram", bin_edges, base_density,
current_density, base_total, current_total}` — 12 edges, 11 densities,
quantile-binned constant-area.
- Categorical: `{kind: "topk", values, base_counts, current_counts,
trimmed}` — counts is `None` when the adapter's sketch doesn't expose
them (DuckDB, ClickHouse).
- Unsupported tier: single envelope `{status: "unsupported", reason,
columns: {}}`.
- Per-column failure: `{kind: null}`.
Tests cover continuous numeric, low-cardinality categorical,
high-cardinality (trimmed) categorical, UUID-cap-degenerate, timestamp
(DRC-3504 regression), deliberately-bad column (DRC-3507 regression),
strategy router, top-K post-processor across all adapter shapes, and
memoization hit/miss.
Manual warehouse smoke 2026-05-21:
- Snowflake `orders` (999 rows, 9 cols): 8.1s cold, 0ms cached.
ORDER_DATE (TIMESTAMP) renders a histogram; STATUS top-K returns
pairs.
- DuckDB `orders` (999 rows, 9 cols): 93ms cold, 0ms cached. STATUS
top-K returns flat list, counts=None.
Stacks on #1389. Rebase onto main after #1389 merges.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Danyel Fisher <danyel@gmail.com>
Contributor
Author
|
Closing as part of restructure (Ultraplan-approved). The original 4-PR plan optimized for per-dialect SQL audit in isolation; the new staging optimizes for reviewer-runnability — each PR independently demoable. This PR's content splits across:
Branch |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
First of four PRs landing the paired-histogram feature on
main. Scope is intentionally narrow: this PR adds the wire — RunType, CLI flag, capability detection, per-dialect SQL renderers — and nothing user-visible. PR 2 wires the orchestration; PRs 3–4 ship the frontend cells, schema view integration, in-session cache, and lineage pre-warm.Full 4-PR plan + locked decisions are in DRC-3390. Data-pull strategy decision (
approx_all) is in DRC-3389.What's in this PR
Backend
RunType.PROFILE_DISTRIBUTIONenum + registration indbt_supported_registry--inline-profileCLI flag (default off), plumbed to the serverflagdictrecce/adapter/capabilities.py: per-adapterAdapterCapabilitieslookuprecce/adapter/approx_aggregates.py: per-dialect SQL fragment renderers forapprox_count_distinct/approx_top_k/approx_percentile. RaisesUnsupportedAggregateErroron adapters without native supportProfileDistributionTaskstub returning empty payload so PR 1 flag plumbing is end-to-end verifiable before PR 2 landsFrontend (minimal — just keeping types coherent)
RecceServerFlags.inline_profileRunTypeunion + discriminated union +RUN_TYPESlist + type guard for"profile_distribution"ProfileDistributionParams/ProfileDistributionResult(PR 2 will populate)RunRegistryso existing components type-checkAdapter coverage matrix (locked in DRC-3389/DRC-3390)
Caveats baked into the matrix that capability detection by adapter type can't enforce (deferred to runtime version detection):
sqlserverdefaults to disabled. SQL Server 2022+ / Azure SQL addedAPPROX_PERCENTILE_CONT/_DISC; pre-2022 has neither. A version-aware refinement can move 2022+ users into percentile-only tier.databricks/sparkdefault to full. Databricks/Spark addedapprox_top_kin 3.5; older clusters would land in percentile-only.Test plan
pytest tests/adapter/test_capabilities.py tests/adapter/test_approx_aggregates.py— 69/69 pass (capability mapping + per-dialect SQL snapshots)pytest tests/tasks/test_{histogram,profile,top_k}.py tests/test_cli.py— 35/35 pass (no regression)recce server --help—--inline-profileflag present and describedRunType.PROFILE_DISTRIBUTIONin registry, stub task instantiates and returns{columns: {}, status: "ok"}pnpm run --filter @datarecce/ui type:check— pre-existing baseline errors only; no new type errors introducedWhy this split
Reviewers can audit each dialect's SQL in isolation. Lots of small dialect mistakes possible here — snapshot tests are load-bearing. Landing the SQL foundation alone makes them easy to catch before PR 2 wires the orchestration around them.
Out of scope (lands in subsequent PRs)
ProfileDistributionTask(HLL probe + APPROX_PERCENTILE + APPROX_TOP_K) — PR 2🤖 Generated with Claude Code