feat(profile-distribution): full approx_all pipeline (DRC-3390 PR 2)#1390
Closed
danyelf wants to merge 1 commit into
Closed
feat(profile-distribution): full approx_all pipeline (DRC-3390 PR 2)#1390danyelf wants to merge 1 commit into
danyelf wants to merge 1 commit into
Conversation
Replaces the PR 1 stub of `ProfileDistributionTask` with the full
six-query `approx_all` orchestration (HLL probe + APPROX_PERCENTILE +
APPROX_TOP_K per env), composed on top of PR 1's per-dialect SQL
renderers and capability table.
What this PR ships:
- Strategy router (`pick_strategy`) → `approx_all` / `percentile_only`
/ `unsupported`, driven by `recce.adapter.capabilities`.
- Batched HLL + min/max probe per env in one SELECT.
- Batched `APPROX_PERCENTILE` and `APPROX_TOP_K` per env.
- `cap_degenerate` skip: columns with HLL >= 0.95 * row_count emit an
empty top-K slot (effectively unique).
- Per-column failure isolation: a bad column's batch failure falls back
to per-column retry so the rest of the model still renders.
- Per-adapter top-K result-shape post-processor (`parse_topk_result`):
Snowflake `[[value, count], …]` pairs vs DuckDB / ClickHouse flat
lists vs BigQuery / Databricks STRUCTs vs Trino MAP — each
normalized to `(values, counts | None)`.
- In-memory memoization keyed by `(model, manifest_hash, version_token)`
on the active `RecceContext`.
- PostHog telemetry via `log_performance("profile_distribution", …)`
with `{strategy, total_wall_ms, phase_wall_ms, column_count,
error_count, cache_hit}`.
Bakes in two prototype-bug fixes:
- DRC-3504: `render_epoch_cast` applies a per-adapter in-SQL epoch
conversion (DuckDB `epoch()`, Snowflake `date_part(epoch_second, …)`,
BigQuery `UNIX_SECONDS(…)`, etc.) before the percentile fragment, so
TIMESTAMP histograms render instead of silently emitting empty
charts.
- DRC-3507: `_execute_with_rollback` issues an explicit ROLLBACK on any
per-statement failure, so DuckDB's transaction-cascade behaviour no
longer blanks subsequent columns.
Payload schemas (frozen with PR 3, per DRC-3390 spec):
- Continuous: `{kind: "histogram", bin_edges, base_density,
current_density, base_total, current_total}` — 12 edges, 11 densities,
quantile-binned constant-area.
- Categorical: `{kind: "topk", values, base_counts, current_counts,
trimmed}` — counts is `None` when the adapter's sketch doesn't expose
them (DuckDB, ClickHouse).
- Unsupported tier: single envelope `{status: "unsupported", reason,
columns: {}}`.
- Per-column failure: `{kind: null}`.
Tests cover continuous numeric, low-cardinality categorical,
high-cardinality (trimmed) categorical, UUID-cap-degenerate, timestamp
(DRC-3504 regression), deliberately-bad column (DRC-3507 regression),
strategy router, top-K post-processor across all adapter shapes, and
memoization hit/miss.
Manual warehouse smoke 2026-05-21:
- Snowflake `orders` (999 rows, 9 cols): 8.1s cold, 0ms cached.
ORDER_DATE (TIMESTAMP) renders a histogram; STATUS top-K returns
pairs.
- DuckDB `orders` (999 rows, 9 cols): 93ms cold, 0ms cached. STATUS
top-K returns flat list, counts=None.
Stacks on #1389. Rebase onto main after #1389 merges.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Danyel Fisher <danyel@gmail.com>
51d490c to
ce9e5ab
Compare
Contributor
Author
|
Closing as part of restructure (Ultraplan-approved). New 4-stage vertical-slice plan replaces the original 4-PR breakdown. This PR's content splits across:
Branch |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PR 2 of the DRC-3390 paired-histograms GA productionization. Replaces the PR 1 stub of
ProfileDistributionTaskwith the full six-queryapprox_allorchestration: batched HLL probe +APPROX_PERCENTILEper env (continuous) +APPROX_TOP_Kper env (categorical), composed on top of PR 1's per-dialect SQL renderers and capability table.Stacks on #1389. Rebase onto main after #1389 merges.
Linear: DRC-3390
What this ships
pick_strategy(capabilities) → 'approx_all' | 'percentile_only' | 'unsupported', driven byrecce.adapter.capabilities.APPROX_PERCENTILEper env (one SELECT per env).APPROX_TOP_Kper env (one SELECT per env).cap_degenerateskip: columns with HLL ≥ 0.95 × row_count get an empty top-K slot (effectively unique).parse_topk_result): Snowflake[[value, count], …]pairs vs DuckDB / ClickHouse flat lists vs BigQuery / Databricks STRUCTs vs TrinoMAP— each normalized to(values, counts | None).(model_unique_id, manifest_hash, version_token)on the activeRecceContext. No file-backed persistence at GA per spec.log_performance("profile_distribution", {strategy, total_wall_ms, phase_wall_ms, column_count, error_count, cache_hit}).Prototype-bug fixes baked in
render_epoch_castapplies a per-adapter in-SQL conversion (DuckDBepoch(), Snowflakedate_part(epoch_second, …), BigQueryUNIX_SECONDS(…), etc.) before the percentile fragment. TIMESTAMP histograms render instead of silently emitting empty charts._execute_with_rollbackissues an explicitROLLBACKon any per-statement failure so DuckDB's transaction-cascade behaviour no longer blanks subsequent columns.Payload schemas (frozen contract with PR 3)
{kind: "histogram", bin_edges (12), base_density (11), current_density (11), base_total, current_total}— quantile-binned, constant-area.{kind: "topk", values, base_counts, current_counts, trimmed}—countsisNonewhen the adapter's sketch doesn't expose them (DuckDB, ClickHouse).{status: "unsupported", reason, columns: {}}.{kind: null}.Test plan
pick_strategy(full / percentile-only / unsupported).classify_column_typeacross numeric / datetime / categorical / skip.render_epoch_castper adapter (DRC-3504 fix snapshots).parse_topk_resultagainst every adapter shape (Snowflake pairs, DuckDB flat, ClickHouse flat, BigQuery struct, Databricks struct, Trino map).black ./recce ./tests+isort ./recce ./testsclean;flake8clean.orders(8.1s cold, 0ms cached, 9 columns: 8 histogram + 1 topk).orders(93ms cold, 0ms cached; DuckDB top-K returnscounts=Noneas expected).