Skip to content

feat(profile-distribution): full approx_all pipeline (DRC-3390 PR 2)#1390

Closed
danyelf wants to merge 1 commit into
feature/drc-3390-pr1-polyglot-foundationfrom
feature/drc-3390-pr2-backend-task
Closed

feat(profile-distribution): full approx_all pipeline (DRC-3390 PR 2)#1390
danyelf wants to merge 1 commit into
feature/drc-3390-pr1-polyglot-foundationfrom
feature/drc-3390-pr2-backend-task

Conversation

@danyelf
Copy link
Copy Markdown
Contributor

@danyelf danyelf commented May 21, 2026

Summary

PR 2 of the DRC-3390 paired-histograms GA productionization. Replaces the PR 1 stub of ProfileDistributionTask with the full six-query approx_all orchestration: batched HLL probe + APPROX_PERCENTILE per env (continuous) + APPROX_TOP_K per env (categorical), composed on top of PR 1's per-dialect SQL renderers and capability table.

Stacks on #1389. Rebase onto main after #1389 merges.

Linear: DRC-3390

What this ships

  • Strategy router pick_strategy(capabilities) → 'approx_all' | 'percentile_only' | 'unsupported', driven by recce.adapter.capabilities.
  • Phase orchestration:
    • Batched HLL + min/max probe per env (one SELECT per env).
    • Batched APPROX_PERCENTILE per env (one SELECT per env).
    • Batched APPROX_TOP_K per env (one SELECT per env).
  • cap_degenerate skip: columns with HLL ≥ 0.95 × row_count get an empty top-K slot (effectively unique).
  • Per-column failure isolation: if a batched SELECT fails, fall back to per-column retries so one bad column doesn't blank the rest.
  • Per-adapter top-K result-shape post-processor (parse_topk_result): Snowflake [[value, count], …] pairs vs DuckDB / ClickHouse flat lists vs BigQuery / Databricks STRUCTs vs Trino MAP — each normalized to (values, counts | None).
  • In-memory memoization keyed by (model_unique_id, manifest_hash, version_token) on the active RecceContext. No file-backed persistence at GA per spec.
  • PostHog telemetry via log_performance("profile_distribution", {strategy, total_wall_ms, phase_wall_ms, column_count, error_count, cache_hit}).

Prototype-bug fixes baked in

  • DRC-3504 (timestamp epoch cast): render_epoch_cast applies a per-adapter in-SQL conversion (DuckDB epoch(), Snowflake date_part(epoch_second, …), BigQuery UNIX_SECONDS(…), etc.) before the percentile fragment. TIMESTAMP histograms render instead of silently emitting empty charts.
  • DRC-3507 (DuckDB rollback): _execute_with_rollback issues an explicit ROLLBACK on any per-statement failure so DuckDB's transaction-cascade behaviour no longer blanks subsequent columns.

Payload schemas (frozen contract with PR 3)

  • Continuous: {kind: "histogram", bin_edges (12), base_density (11), current_density (11), base_total, current_total} — quantile-binned, constant-area.
  • Categorical: {kind: "topk", values, base_counts, current_counts, trimmed}counts is None when the adapter's sketch doesn't expose them (DuckDB, ClickHouse).
  • Unsupported tier: single envelope {status: "unsupported", reason, columns: {}}.
  • Per-column failure: {kind: null}.

Test plan

  • Unit tests for pick_strategy (full / percentile-only / unsupported).
  • Unit tests for classify_column_type across numeric / datetime / categorical / skip.
  • Unit tests for render_epoch_cast per adapter (DRC-3504 fix snapshots).
  • Unit tests for parse_topk_result against every adapter shape (Snowflake pairs, DuckDB flat, ClickHouse flat, BigQuery struct, Databricks struct, Trino map).
  • DuckDB integration: continuous numeric column → histogram payload with correct schema.
  • DuckDB integration: low-cardinality categorical → topk payload.
  • DuckDB integration: high-cardinality categorical → topk trimmed.
  • DuckDB integration: UUID-cap-degenerate column → empty top-K slot.
  • DuckDB integration: timestamp column (DRC-3504 regression) — emits non-empty histogram.
  • DuckDB integration: deliberately-bad column (DRC-3507 regression) — other columns still succeed.
  • Memoization hit/miss tests.
  • Unsupported-tier short-circuit returns single envelope.
  • All 45 new tests green; 278 across tests/tasks + tests/adapter pass.
  • black ./recce ./tests + isort ./recce ./tests clean; flake8 clean.
  • Manual smoke against Snowflake orders (8.1s cold, 0ms cached, 9 columns: 8 histogram + 1 topk).
  • Manual smoke against DuckDB orders (93ms cold, 0ms cached; DuckDB top-K returns counts=None as expected).

Replaces the PR 1 stub of `ProfileDistributionTask` with the full
six-query `approx_all` orchestration (HLL probe + APPROX_PERCENTILE +
APPROX_TOP_K per env), composed on top of PR 1's per-dialect SQL
renderers and capability table.

What this PR ships:

- Strategy router (`pick_strategy`) → `approx_all` / `percentile_only`
  / `unsupported`, driven by `recce.adapter.capabilities`.
- Batched HLL + min/max probe per env in one SELECT.
- Batched `APPROX_PERCENTILE` and `APPROX_TOP_K` per env.
- `cap_degenerate` skip: columns with HLL >= 0.95 * row_count emit an
  empty top-K slot (effectively unique).
- Per-column failure isolation: a bad column's batch failure falls back
  to per-column retry so the rest of the model still renders.
- Per-adapter top-K result-shape post-processor (`parse_topk_result`):
  Snowflake `[[value, count], …]` pairs vs DuckDB / ClickHouse flat
  lists vs BigQuery / Databricks STRUCTs vs Trino MAP — each
  normalized to `(values, counts | None)`.
- In-memory memoization keyed by `(model, manifest_hash, version_token)`
  on the active `RecceContext`.
- PostHog telemetry via `log_performance("profile_distribution", …)`
  with `{strategy, total_wall_ms, phase_wall_ms, column_count,
  error_count, cache_hit}`.

Bakes in two prototype-bug fixes:

- DRC-3504: `render_epoch_cast` applies a per-adapter in-SQL epoch
  conversion (DuckDB `epoch()`, Snowflake `date_part(epoch_second, …)`,
  BigQuery `UNIX_SECONDS(…)`, etc.) before the percentile fragment, so
  TIMESTAMP histograms render instead of silently emitting empty
  charts.
- DRC-3507: `_execute_with_rollback` issues an explicit ROLLBACK on any
  per-statement failure, so DuckDB's transaction-cascade behaviour no
  longer blanks subsequent columns.

Payload schemas (frozen with PR 3, per DRC-3390 spec):

- Continuous: `{kind: "histogram", bin_edges, base_density,
  current_density, base_total, current_total}` — 12 edges, 11 densities,
  quantile-binned constant-area.
- Categorical: `{kind: "topk", values, base_counts, current_counts,
  trimmed}` — counts is `None` when the adapter's sketch doesn't expose
  them (DuckDB, ClickHouse).
- Unsupported tier: single envelope `{status: "unsupported", reason,
  columns: {}}`.
- Per-column failure: `{kind: null}`.

Tests cover continuous numeric, low-cardinality categorical,
high-cardinality (trimmed) categorical, UUID-cap-degenerate, timestamp
(DRC-3504 regression), deliberately-bad column (DRC-3507 regression),
strategy router, top-K post-processor across all adapter shapes, and
memoization hit/miss.

Manual warehouse smoke 2026-05-21:

- Snowflake `orders` (999 rows, 9 cols): 8.1s cold, 0ms cached.
  ORDER_DATE (TIMESTAMP) renders a histogram; STATUS top-K returns
  pairs.
- DuckDB `orders` (999 rows, 9 cols): 93ms cold, 0ms cached. STATUS
  top-K returns flat list, counts=None.

Stacks on #1389. Rebase onto main after #1389 merges.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Danyel Fisher <danyel@gmail.com>
@danyelf danyelf force-pushed the feature/drc-3390-pr2-backend-task branch from 51d490c to ce9e5ab Compare May 21, 2026 20:14
@danyelf
Copy link
Copy Markdown
Contributor Author

danyelf commented May 26, 2026

Closing as part of restructure (Ultraplan-approved). New 4-stage vertical-slice plan replaces the original 4-PR breakdown.

This PR's content splits across:

  • Stage B (DuckDB-only backend): ProfileDistributionTask orchestration, DRC-3504 epoch-cast fix, DRC-3507 rollback fix, in-memory memoization, telemetry (with docstring fix — actual routing is Amplitude not PostHog), and DuckDB-targeted tests.
  • Stage D (polyglot extension): multi-adapter strategy router branches (approx_all / percentile_only), per-adapter parse_topk_result() shape switch, render_epoch_cast() dialect branches, and per-dialect snapshot tests.

Branch feature/drc-3390-pr2-backend-task retained as a code-stash source.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant