Skip to content

feat(profile-distribution): polyglot SQL foundation (DRC-3390 PR 1)#1389

Closed
danyelf wants to merge 4 commits into
mainfrom
feature/drc-3390-pr1-polyglot-foundation
Closed

feat(profile-distribution): polyglot SQL foundation (DRC-3390 PR 1)#1389
danyelf wants to merge 4 commits into
mainfrom
feature/drc-3390-pr1-polyglot-foundation

Conversation

@danyelf
Copy link
Copy Markdown
Contributor

@danyelf danyelf commented May 21, 2026

First of four PRs landing the paired-histogram feature on main. Scope is intentionally narrow: this PR adds the wire — RunType, CLI flag, capability detection, per-dialect SQL renderers — and nothing user-visible. PR 2 wires the orchestration; PRs 3–4 ship the frontend cells, schema view integration, in-session cache, and lineage pre-warm.

Full 4-PR plan + locked decisions are in DRC-3390. Data-pull strategy decision (approx_all) is in DRC-3389.

What's in this PR

Backend

  • RunType.PROFILE_DISTRIBUTION enum + registration in dbt_supported_registry
  • --inline-profile CLI flag (default off), plumbed to the server flag dict
  • recce/adapter/capabilities.py: per-adapter AdapterCapabilities lookup
  • recce/adapter/approx_aggregates.py: per-dialect SQL fragment renderers for approx_count_distinct / approx_top_k / approx_percentile. Raises UnsupportedAggregateError on adapters without native support
  • ProfileDistributionTask stub returning empty payload so PR 1 flag plumbing is end-to-end verifiable before PR 2 lands

Frontend (minimal — just keeping types coherent)

  • RecceServerFlags.inline_profile
  • RunType union + discriminated union + RUN_TYPES list + type guard for "profile_distribution"
  • Placeholder ProfileDistributionParams / ProfileDistributionResult (PR 2 will populate)
  • Stub entry in RunRegistry so existing components type-check

Adapter coverage matrix (locked in DRC-3389/DRC-3390)

Tier Engines Behavior
Full Snowflake, BigQuery, DuckDB, Databricks/Spark 3.5+, Athena/Trino/Presto, ClickHouse Histograms + top-K
Histograms only Redshift Continuous columns render; categorical cells blank
Disabled Postgres, MySQL, SQLite, SQL Server (incl. Azure SQL pre/post-2022) Feature unavailable; banner if flag is on

Caveats baked into the matrix that capability detection by adapter type can't enforce (deferred to runtime version detection):

  • sqlserver defaults to disabled. SQL Server 2022+ / Azure SQL added APPROX_PERCENTILE_CONT/_DISC; pre-2022 has neither. A version-aware refinement can move 2022+ users into percentile-only tier.
  • databricks / spark default to full. Databricks/Spark added approx_top_k in 3.5; older clusters would land in percentile-only.

Test plan

  • pytest tests/adapter/test_capabilities.py tests/adapter/test_approx_aggregates.py69/69 pass (capability mapping + per-dialect SQL snapshots)
  • pytest tests/tasks/test_{histogram,profile,top_k}.py tests/test_cli.py35/35 pass (no regression)
  • recce server --help--inline-profile flag present and described
  • Smoke: RunType.PROFILE_DISTRIBUTION in registry, stub task instantiates and returns {columns: {}, status: "ok"}
  • pnpm run --filter @datarecce/ui type:check — pre-existing baseline errors only; no new type errors introduced

Why this split

Reviewers can audit each dialect's SQL in isolation. Lots of small dialect mistakes possible here — snapshot tests are load-bearing. Landing the SQL foundation alone makes them easy to catch before PR 2 wires the orchestration around them.

Out of scope (lands in subsequent PRs)

  • Full ProfileDistributionTask (HLL probe + APPROX_PERCENTILE + APPROX_TOP_K) — PR 2
  • Frontend cells, hook, schema view integration, in-session memoization — PR 3
  • Lineage-click pre-warm + GA flip — PR 4

🤖 Generated with Claude Code

Foundation layer for the paired-histogram feature. No user-facing behavior
yet — wires the RunType, CLI flag, capability detection, and per-dialect
SQL renderers that PR 2 will orchestrate into the approx_all pipeline.

Backend
- RunType.PROFILE_DISTRIBUTION enum + registration in dbt_supported_registry
- --inline-profile CLI flag (default off), plumbed to the server flag dict
- recce/adapter/capabilities.py: per-adapter AdapterCapabilities lookup.
  Snowflake / BigQuery / DuckDB / Databricks / Spark / Athena / Trino /
  Presto / ClickHouse -> full; Redshift -> percentile-only; everything
  else (Postgres / MySQL / SQLite / SQL Server) -> disabled tier.
- recce/adapter/approx_aggregates.py: per-dialect SQL fragment renderers
  for approx_count_distinct / approx_top_k / approx_percentile. Raises
  UnsupportedAggregateError on adapters without native support.
- ProfileDistributionTask stub returning an empty payload so PR 1 flag
  plumbing is end-to-end verifiable before PR 2 lands.

Frontend
- RecceServerFlags.inline_profile boolean
- RunType union + Run discriminated union + RUN_TYPES list + type guard
  for "profile_distribution"; minimal ProfileDistributionParams /
  ProfileDistributionResult placeholders (PR 2 will populate the shape)
- Stub entry in run RunRegistry so existing components type-check

Tests
- 50 per-dialect SQL snapshot tests covering every supported adapter
  for each renderer; +disabled-tier raises; +input-validation
- 19 capability-detection tests asserting the tier matrix is intact

All 104 tests pass (69 new + 35 existing histogram/profile/top_k/CLI).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Danyel Fisher <danyel@gmail.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 21, 2026

Codecov Report

❌ Patch coverage is 97.61905% with 4 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
recce/tasks/profile_distribution.py 75.00% 3 Missing ⚠️
recce/cli.py 66.66% 1 Missing ⚠️
Files with missing lines Coverage Δ
recce/adapter/approx_aggregates.py 100.00% <100.00%> (ø)
recce/adapter/capabilities.py 100.00% <100.00%> (ø)
recce/adapter/dbt_adapter/__init__.py 79.44% <ø> (ø)
recce/models/types.py 98.25% <100.00%> (+0.01%) ⬆️
recce/tasks/__init__.py 100.00% <100.00%> (ø)
tests/adapter/test_approx_aggregates.py 100.00% <100.00%> (ø)
tests/adapter/test_capabilities.py 100.00% <100.00%> (ø)
recce/cli.py 68.73% <66.66%> (-0.01%) ⬇️
recce/tasks/profile_distribution.py 75.00% <75.00%> (ø)

... and 4 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…UI) subset

Address review feedback on PR 1: profile_distribution isn't a user-facing
checklist run — it's a backend data fetch whose result feeds into schema
view cells. Giving it a title/icon/registry entry conflated wire-level run
types with user-facing run-with-result-view types.

Splits the two concepts:
- ``RunType`` (api/types/run.ts) — every wire-level run type the API knows,
  including profile_distribution. Used by the task dispatcher.
- ``RegisteredRunType`` (components/run/registry.ts) — the subset that
  surfaces in the UI as a checklist item with title / icon / result view.
- ``RunTypeWithRef`` (api/types/run.ts) — the further subset whose result
  view supports forwardRef (screenshots).

Concrete changes:
- Remove the placeholder profile_distribution entry from RunRegistry +
  registry const.
- Add ``RegisteredRunType``, ``REGISTERED_RUN_TYPES``, and ``isRegisteredRunType``
  guard.
- Narrow ``findByRunType`` / ``createBoundFindByRunType`` to
  ``RegisteredRunType`` (was ``RunType``).
- Promote ``runTypeHasRef`` to a real type guard returning ``runType is
  RunTypeWithRef`` (was bare ``boolean``). Two callers (RunResultPaneOss,
  useRun) get automatic narrowing for free.
- Update RunTypeRegistry (in types.ts) to be keyed on registered types.
- Add explicit isRegisteredRunType guards at the two remaining callsites
  (CheckDetailOss, RecceActionAdapter) that pass arbitrary RunType values.
- Update registry.test.ts to iterate REGISTERED_RUN_TYPES (was RUN_TYPES).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Danyel Fisher <danyel@gmail.com>
danyelf and others added 2 commits May 21, 2026 13:05
…unblock type:check

PR 1 narrowed ``findByRunType``'s parameter from ``RunType`` to ``RegisteredRunType``
(a subset that excludes ``profile_distribution``) but left the test arrays in
``js/src/components/run/__tests__/registry.test.ts`` typed as ``RunType[]``. tsc
flags every call site since ``profile_distribution`` isn't a valid ``keyof
RunRegistry``. Re-typing the test-side arrays to ``RegisteredRunType[]`` lines
them up with what ``findByRunType`` actually accepts, which is what they were
already intending — every value in those arrays is already a registered type.

Unblocks the pre-push ``pnpm type:check`` hook for the DRC-3390 PR 3 stack.

Refs DRC-3390.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Danyel Fisher <danyel@gmail.com>
…mock

PR 1 (DRC-3390 polyglot foundation, commit e6b4cf5) added a runtime
``isRegisteredRunType`` guard inside ``RecceActionAdapter`` to filter
out backend-only run types like ``profile_distribution`` before passing
them to ``findByRunType``. The existing RecceActionAdapter test mock
replaces the whole ``@datarecce/ui/components/run`` module, which
silently broke 9 tests when ``isRegisteredRunType`` started returning
undefined → falsy → the adapter threw "Run type X does not have a
result view" before ever calling ``submitRun``. Adding the guard to
the mock (defaulted to true, which preserves prior behavior for the
``query_diff`` / ``row_count`` test fixtures) restores all 9.

Verified by running ``pnpm vitest run RecceActionAdapter`` — 26/26
pass on this commit, vs 17/26 on the parent branch.

Unblocks the pre-push hook for the DRC-3390 PR 3 stack.

Refs DRC-3390.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Danyel Fisher <danyel@gmail.com>
danyelf added a commit that referenced this pull request May 21, 2026
… (DRC-3390 PR 3)

PR 3 of the paired-histograms GA productionization. Adds the frontend
cells, hook, schema-row integration, and unsupported-adapter banner
that consume the per-column payloads PR 2 will produce. Stacks on PR 1
(#1389) which already shipped RunType.PROFILE_DISTRIBUTION + the
discriminated-union plumbing.

Includes:

- PairedHistogramContinuous — quantile-bin, constant-area paired bar
  chart. Heights = density, widths = bin span, so bar area reads as
  the proportion of rows in that bin. Replaces the prototype's
  uniform-bin renderer (closed PR #1380).
- PairedHistogramDiscrete — top-K paired chart with gap-on-absent
  semantics. When a value is missing on one side, that bar isn't drawn,
  leaving a visible empty half-slot that reads as 'absent here'.
- useInlineProfileDistribution — TanStack-Query-backed hook that fires
  a single PROFILE_DISTRIBUTION run per model, parses the discriminated-
  union payload, and returns {distributions, loading, error,
  isUnsupported}. Emits Amplitude analytics events at request and
  result (PostHog wiring noted in DRC-3390 but the codebase uses
  Amplitude — same trail).
- InlineProfileDistributionCell — Compact-mode wrapper that picks
  between loading / error / empty (per-column failure) / continuous /
  discrete renders in a fixed-size slot, so adjacent rows don't reflow.
  Grid mode is intentionally absent at GA (per DRC-3390 lock).
- ProfileDistributionUnsupportedBanner — once-per-task MUI Alert for
  the {status: 'unsupported', reason} envelope (Postgres / MySQL /
  SQLite / SQL Server pre-2022).

Payload contract (frozen — PR 2 produces exactly this):
- Continuous: {kind:'histogram', bin_edges:[12], base_density:[11],
  current_density:[11], base_total, current_total}
- Categorical: {kind:'topk', values, base_counts, current_counts,
  base_total, current_total, trimmed}
- Per-column failure: {kind:null}
- Unsupported envelope: {status:'unsupported', reason:...}

Refines the ProfileDistributionResult type in api/types/run.ts from
the PR 1 stub (Record<string, unknown>) to the full discriminated
union — PR 1 left this as a placeholder for PR 2/3.

Tests + storybook:
- 57 unit tests across 5 files covering both cells, the integration
  wrapper, the banner, and the hook (mocked at the submit boundary so
  PR 2 doesn't need to be merged).
- 20 storybook stories covering both cells (cell + card density,
  light + dark, gap-on-absent, trimmed, empty payload, degenerate
  range) and all four integration-cell states; verified visually on
  port 6011.
- pnpm type:check clean; remaining errors are pre-existing baseline
  (CSS module declarations, lineage hooks) and unrelated to this PR.

Stacks on #1389. Mocks PR 2's payload in tests; rebase onto main once
PR 2 lands. PR 4 (lineage pre-warm + GA flip) consumes the same hook +
cacheKeys.profileDistribution(model) cache key.

Refs DRC-3390.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Danyel Fisher <danyel@gmail.com>
danyelf added a commit that referenced this pull request May 21, 2026
Replaces the PR 1 stub of `ProfileDistributionTask` with the full
six-query `approx_all` orchestration (HLL probe + APPROX_PERCENTILE +
APPROX_TOP_K per env), composed on top of PR 1's per-dialect SQL
renderers and capability table.

What this PR ships:

- Strategy router (`pick_strategy`) → `approx_all` / `percentile_only`
  / `unsupported`, driven by `recce.adapter.capabilities`.
- Batched HLL + min/max probe per env in one SELECT.
- Batched `APPROX_PERCENTILE` and `APPROX_TOP_K` per env.
- `cap_degenerate` skip: columns with HLL >= 0.95 * row_count emit an
  empty top-K slot (effectively unique).
- Per-column failure isolation: a bad column's batch failure falls back
  to per-column retry so the rest of the model still renders.
- Per-adapter top-K result-shape post-processor (`parse_topk_result`):
  Snowflake `[[value, count], …]` pairs vs DuckDB / ClickHouse flat
  lists vs BigQuery / Databricks STRUCTs vs Trino MAP — each
  normalized to `(values, counts | None)`.
- In-memory memoization keyed by `(model, manifest_hash, version_token)`
  on the active `RecceContext`.
- PostHog telemetry via `log_performance("profile_distribution", …)`
  with `{strategy, total_wall_ms, phase_wall_ms, column_count,
  error_count, cache_hit}`.

Bakes in two prototype-bug fixes:

- DRC-3504: `render_epoch_cast` applies a per-adapter in-SQL epoch
  conversion (DuckDB `epoch()`, Snowflake `date_part(epoch_second, …)`,
  BigQuery `UNIX_SECONDS(…)`, etc.) before the percentile fragment, so
  TIMESTAMP histograms render instead of silently emitting empty
  charts.
- DRC-3507: `_execute_with_rollback` issues an explicit ROLLBACK on any
  per-statement failure, so DuckDB's transaction-cascade behaviour no
  longer blanks subsequent columns.

Payload schemas (frozen with PR 3, per DRC-3390 spec):

- Continuous: `{kind: "histogram", bin_edges, base_density,
  current_density, base_total, current_total}` — 12 edges, 11 densities,
  quantile-binned constant-area.
- Categorical: `{kind: "topk", values, base_counts, current_counts,
  trimmed}` — counts is `None` when the adapter's sketch doesn't expose
  them (DuckDB, ClickHouse).
- Unsupported tier: single envelope `{status: "unsupported", reason,
  columns: {}}`.
- Per-column failure: `{kind: null}`.

Tests cover continuous numeric, low-cardinality categorical,
high-cardinality (trimmed) categorical, UUID-cap-degenerate, timestamp
(DRC-3504 regression), deliberately-bad column (DRC-3507 regression),
strategy router, top-K post-processor across all adapter shapes, and
memoization hit/miss.

Manual warehouse smoke 2026-05-21:

- Snowflake `orders` (999 rows, 9 cols): 8.1s cold, 0ms cached.
  ORDER_DATE (TIMESTAMP) renders a histogram; STATUS top-K returns
  pairs.
- DuckDB `orders` (999 rows, 9 cols): 93ms cold, 0ms cached. STATUS
  top-K returns flat list, counts=None.

Stacks on #1389. Rebase onto main after #1389 merges.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Danyel Fisher <danyel@gmail.com>
@danyelf
Copy link
Copy Markdown
Contributor Author

danyelf commented May 26, 2026

Closing as part of restructure (Ultraplan-approved). The original 4-PR plan optimized for per-dialect SQL audit in isolation; the new staging optimizes for reviewer-runnability — each PR independently demoable.

This PR's content splits across:

  • Stage B (DuckDB-only): capabilities.py (single-row table), approx_aggregates.py (DuckDB renderers only), TS contract types (RunType.PROFILE_DISTRIBUTION, RegisteredRunType, isRegisteredRunType), --inline-profile CLI flag + RecceServerFlags.inline_profile, and the two cherry-picked test fixes (registry.test.ts narrow, RecceActionAdapter mock stub).
  • Stage D (polyglot extension): per-dialect SQL renderers for Snowflake / BigQuery / Databricks / Spark / Athena / Trino / Presto / ClickHouse / Redshift; full capability tier table.

Branch feature/drc-3390-pr1-polyglot-foundation retained as a code-stash source for the new stages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant