fix(valuediff): normalise column case before adapter.quote() for Snowflake (DRC-3464)#1388
Conversation
…flake
On Snowflake with default quoting config, physical columns are stored uppercase
(e.g. CUSTOMER_ID). user-supplied check params follow the dbt manifest
convention (lowercase, e.g. customer_id). Passing them directly to
adapter.quote() produces '"customer_id"' — a case-sensitive identifier that
Snowflake cannot resolve, raising:
SQL compilation error: invalid identifier '"customer_id"'
Fix: add ValueDiffMixin._build_column_case_lookup() that calls get_columns()
and builds a {lower(physical_name): physical_name} lookup. Before any
adapter.quote() call in ValueDiffTask._query_value_diff and
ValueDiffDetailTask._query_value_diff, each user-supplied identifier is mapped
through the lookup so adapter.quote() receives the physical name ("CUSTOMER_ID")
and produces a valid quoted identifier.
Pass-through for identifiers not found in the catalog preserves existing
behaviour for unknown columns rather than masking errors. Mirrors the pattern
profile_diff already uses via get_columns().
Regression tests: test_value_diff_snowflake_uppercase_columns and
test_value_diff_detail_snowflake_uppercase_columns patch get_columns() to return
uppercase names (simulating Snowflake) and assert the generated SQL contains
'"CUSTOMER_ID"' and NOT '"customer_id"'.
Closes DRC-3464.
Signed-off-by: Kent <kentchen@reccehq.com>
Signed-off-by: Kent <iamcxa@gmail.com>
…y_key (DRC-3464 cycle 1) When primary_key is a list, execute() now builds the column-case lookup and normalises keys BEFORE calling _verify_primary_key, so the composite path's adapter.quote() calls receive physical catalog names (e.g. CUSTOMER_ID on Snowflake) instead of user-supplied lowercase identifiers. Also adds regression tests test_value_diff_snowflake_composite_primary_key and test_value_diff_detail_snowflake_composite_primary_key that patch dbt_adapter.adapter.execute (the low-level adapter used by _verify_primary_key) to capture and assert the generated SQL uses uppercase quoted identifiers. PRE-FIX: both tests FAIL (sql contains lowercase quoted identifiers) POST-FIX: both tests PASS (sql contains uppercase quoted identifiers) Signed-off-by: Kent <iamcxa@gmail.com>
There was a problem hiding this comment.
Pull request overview
Fixes Snowflake value_diff failures caused by quoting lowercase identifiers (from dbt manifest conventions) against physically uppercase columns, by normalizing user-supplied column identifiers to catalog/physical case before adapter.quote() is applied.
Changes:
- Added a catalog-driven
{lower(physical): physical}lookup helper and identifier normalization inValueDiffTaskandValueDiffDetailTask. - Normalized
primary_keybefore_verify_primary_key()to fix the composite primary-key quoting path. - Added Snowflake-focused regression tests that patch
get_columns()to return uppercase columns and assert generated SQL uses uppercase quoted identifiers.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
recce/tasks/valuediff.py |
Adds catalog case lookup + normalizes PK/columns before quoting to avoid Snowflake invalid identifier errors. |
tests/tasks/test_valuediff.py |
Adds unit tests for the lookup/normalization helpers and Snowflake regression tests capturing executed SQL. |
Comments suppressed due to low confidence (1)
recce/tasks/valuediff.py:431
- As in
ValueDiffTask._query_value_diff,case_lookupis built unconditionally (twoget_columns()macro executions) even whencolumnsis None and the function already callsget_columns()again to compute common columns. Since value_diff can run frequently andget_columns()may hit INFORMATION_SCHEMA, consider avoiding redundant catalog calls by building the lookup only when normalising user-supplied identifiers, or reusing the already-fetched column lists to construct the lookup.
# Normalise user-supplied identifiers to physical catalog names before quoting.
# On Snowflake (default quoting config), physical columns are stored uppercase;
# adapter.quote("customer_id") → '"customer_id"' which Snowflake cannot resolve.
case_lookup = self._build_column_case_lookup(dbt_adapter, model)
if columns is None or len(columns) == 0:
base_columns = [column.column for column in dbt_adapter.get_columns(model, base=True)]
curr_columns = [column.column for column in dbt_adapter.get_columns(model, base=False)]
columns = [column for column in base_columns if column in curr_columns]
else:
columns = [self._normalise_identifier(c, case_lookup) for c in columns]
Codecov Report❌ Patch coverage is
... and 4 files with indirect coverage changes 🚀 New features to boost your workflow:
|
…umns() (DRC-3464 PR #1388 review) Signed-off-by: Kent <iamcxa@gmail.com>
wcchang1115
left a comment
There was a problem hiding this comment.
Code Review: PR #1388
SHA 543dbf5d · Verdict GO
Notes
-
recce/tasks/valuediff.py:142-160,417-434—_query_value_diffre-runs
primary_key normalisation thatexecute()already performed (lines 336-340,
541-545). Idempotent becauselookup[upper().lower()] == upper, so not a
bug, but the duplication makes the data flow harder to follow.
Evidence: every callable path goes throughexecute()first (grep confirms
no other caller of_query_value_diff); thecase_lookup=parameter is
the only reason_query_value_diffcould safely skip re-normalisation,
and it already accepts that parameter. Pass F. -
recce/tasks/valuediff.py:48—except Exception: passin
_build_column_case_lookupsilently swallows every failure (connection
errors, missing relations, permission denials). On a real outage the
lookup is empty, primary_key/columns pass through unchanged, and the
user gets the original "invalid identifier" Snowflake error this fix
was meant to prevent — with no log line explaining why the fix was
bypassed. Consider logging at debug/warning level so the bypass is
diagnosable. Pass D. -
recce/tasks/valuediff.py:142-149,417-424— Whencolumns is None,
the function callsget_columns()again (twice) even though
_build_column_case_lookupjust called it twice. That's 4
get_columns()round-trips per execute() in the default path.
case_lookupalready contains the union of base+curr physical names;
thecolumns is Nonebranch could derive the column list from
case_lookup(orcase_lookupcould cache the per-relation lists).
Pass G.
What I verified
- Local data contract preserved. On case-preserving warehouses
(Postgres, DuckDB, BigQuery),_build_column_case_lookupreturns
{lower(name): name}wherenamealready equals the user input,
so_normalise_identifieris a no-op. Test suite (116 tasks tests)
passes unchanged. - Frontend contract preserved.
js/.../toValueDiffGrid.ts:8-11
("primary_keys match actual column casing") relies on
self.params.primary_keyaligning with result column keys. The fix
feeds normalised (physical-case)primary_keyinto the
post-executionnormalize_keys_to_columns(primary_key, columns)call
atvaluediff.py:312,518—columnsis also normalised, so the
output keeps matching the result-set casing. No frontend regression. QueryDiffTask(inheritsValueDiffMixin) untouched. It only
consumes_verify_primary_keyindirectly via its own path; it does
not call the new helpers. Other check types (profile_diff,
histogram_diff, row_count_diff, query_diff, schema_diff) are not
modified.- Regression tests are evidence-grade. Both
'"CUSTOMER_ID"' in sql
and'"customer_id"' not in sqlassertions — catches partial fixes.
Composite-PK tests catch the cycle-1 regression where
_verify_primary_key's composite Jinja path was missed on the first
attempt; without those, a future refactor could silently re-break it. - No new SQL-injection surface. Identifiers are looked up in a
warehouse-derived dict before being passed toadapter.quote();
unknown identifiers fall through to the existing quoting path — same
trust boundary as before.
flake8: clean. pytest tests/tasks/ 116 passed.
Summary
value_diffwas failing on Snowflake withSQL compilation error: invalid identifier '"customer_id"'. The check params carry lowercase column names (from the dbt manifest convention), but Snowflake folds unquoted DDL identifiers to uppercase (CUSTOMER_ID). Callingadapter.quote("customer_id")produced the case-sensitive'"customer_id"'identifier, which Snowflake cannot resolve against its physical uppercase column.Root cause
Two failure sites in
recce/tasks/valuediff.py, both invoked fromexecute():_verify_primary_keycomposite path (whenprimary_keyis a list) — builds inline Jinja that callsadapter.quote(col)over the raw user-supplied list._query_value_diff— callsadapter.quote(field)/adapter.quote(column_to_compare)on the surrogate-key and column-comparison clauses.The scalar primary_key path of
_verify_primary_keyusesadapter.dispatch('test_unique', 'dbt')which renders{{ column_name }}verbatim (unquoted) and so passes Snowflake case-insensitively. Same pattern inValueDiffDetailTask. Other check tasks (profile_diff,histogram_diff,top_k_diff,row_count_diff,query_diff,schema_diff) are unaffected —profile_diffaccidentally avoids the bug by sourcing column names fromget_columns()(uppercase on Snowflake) rather than from check params.Fix
Add a
ValueDiffMixin._build_column_case_lookup()helper that callsdbt_adapter.get_columns()(base + current relations) and returns a{lower(physical_name): physical_name}lookup._normalise_identifier()maps a user-supplied lowercase identifier to its physical catalog case, falling back to the original string when the column is unknown (preserves current behaviour for unknown columns instead of masking errors).In
ValueDiffTask.execute()andValueDiffDetailTask.execute(), the lookup is built FIRST and theprimary_key(andcolumnslist when present) is normalised BEFORE_verify_primary_key()is called — so both_verify_primary_key's composite path AND_query_value_diff's adapter.quote() calls receive the physical catalog case. Mirrors theget_columns()pattern already used byprofile_diff(recce/tasks/profile.py:222).Tests
recce/tests/tasks/test_valuediff.py:TestBuildColumnCaseLookup— 4 unit tests for the new helper (case-insensitive lookup, base+current relation merge, unknown-identifier passthrough, exception fallback).test_value_diff_snowflake_uppercase_columns+test_value_diff_detail_snowflake_uppercase_columns— scalar primary_key on Snowflake: mockget_columns()returns uppercase columns; capture executed SQL; assert"CUSTOMER_ID"present,"customer_id"absent.test_value_diff_snowflake_composite_primary_key+test_value_diff_detail_snowflake_composite_primary_key— composite primary_key on Snowflake; assert the SQL executed by_verify_primary_key's composite path contains uppercase quoted identifiers.Pre-fix rigor (rebuild verified independently):
valuediff.pyto before either fix commit → all 4 Snowflake regression tests FAIL with verbatimAssertionError: '"CUSTOMER_ID"' in 'with a_query as (select "customer_id", ...)'.Full tasks suite (
pytest tests/tasks/ -q): 116 passed, 1 pre-existing portalocker warning (unrelated).Audit trail
Investigated through DataRecce's internal
debug-flywheelworkflow as case 019. Stages: reproduce → discover → hypothesize → verify → fix-and-harden (cycle 1 after codex review caught the composite path gap on the first fix attempt).Linear: DRC-3464