Skip to content

[SPARK-55647][SQL] Fix ConstantPropagation incorrectly replacing attributes with non-binary-stable collations#54435

Closed
ilicmarkodb wants to merge 1 commit intoapache:masterfrom
ilicmarkodb:fix_collation_
Closed

[SPARK-55647][SQL] Fix ConstantPropagation incorrectly replacing attributes with non-binary-stable collations#54435
ilicmarkodb wants to merge 1 commit intoapache:masterfrom
ilicmarkodb:fix_collation_

Conversation

@ilicmarkodb
Copy link
Contributor

What changes were proposed in this pull request?

  • ConstantPropagation optimizer rule substitutes attributes with literals derived from equality
    predicates (e.g. c = 'hello'), then propagates them into other conditions in the same
    conjunction. This is unsafe for non-binary-stable collations (e.g. UTF8_LCASE) where
    equality is non-identity: c = 'hello' (case-insensitive) does not imply c holds exactly
    the bytes 'hello' - it could also be 'HELLO', 'Hello', etc.
  • Substituting c → 'hello' in a second condition like c = 'HELLO' COLLATE UNICODE turns it
    into the constant expression 'hello' = 'HELLO' COLLATE UNICODE, which is always false,
    producing incorrect results.
  • Fixed by guarding safeToReplace with isBinaryStable(ar.dataType) so propagation is skipped
    for attributes whose type is not binary-stable.

Why are the changes needed?

Bug fix.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New unit test.

Was this patch authored or co-authored using generative AI tooling?

No.

@ilicmarkodb ilicmarkodb changed the title [SPARK-55647][SQL] Fix ConstantPropagation incorrectly replacing attributes with non-binary-stable collations [SPARK-55647][SQL] Fix ConstantPropagation incorrectly replacing attributes with non-binary-stable collations Feb 24, 2026
@cloud-fan
Copy link
Contributor

thanks, merging to master/4.1!

@cloud-fan cloud-fan closed this in ec35791 Feb 24, 2026
cloud-fan pushed a commit that referenced this pull request Feb 24, 2026
…tributes with non-binary-stable collations

### What changes were proposed in this pull request?
* `ConstantPropagation` optimizer rule substitutes attributes with literals derived from equality
    predicates (e.g. `c = 'hello'`), then propagates them into other conditions in the same
    conjunction. This is unsafe for non-binary-stable collations (e.g. `UTF8_LCASE`) where
    equality is non-identity: `c = 'hello'` (case-insensitive) does not imply `c` holds exactly
    the bytes `'hello'` - it could also be `'HELLO'`, `'Hello'`, etc.
 * Substituting `c → 'hello'` in a second condition like `c = 'HELLO' COLLATE UNICODE` turns it
    into the constant expression `'hello' = 'HELLO' COLLATE UNICODE`, which is always `false`,
    producing incorrect results.
 * Fixed by guarding `safeToReplace` with `isBinaryStable(ar.dataType)` so propagation is skipped
    for attributes whose type is not binary-stable.

### Why are the changes needed?
Bug fix.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
New unit test.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #54435 from ilicmarkodb/fix_collation_.

Authored-by: ilicmarkodb <marko.ilic@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit ec35791)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants