Fix Schema Duplication Errors in Self‑Referential INTERSECT/EXCEPT by Requalifying Input Sides #18814

kosiew · 2025-11-19T04:23:07Z

Which issue does this PR close?

Closes [substrait] [sqllogictest] Schema contains duplicate qualified field name #16295.

Rationale for this change

Self-referential INTERSECT and EXCEPT queries (where both sides originate from the same table) failed during Substrait round‑trip consumption with the error:

"Schema contains duplicate qualified field name"

This happened because the join-based implementation of set operations attempted to merge two identical schemas without requalification, resulting in duplicate or ambiguous field names. By ensuring both sides are requalified when needed, DataFusion can correctly construct valid logical plans for these operations.

Before

❯ cargo test --test sqllogictests -- --substrait-round-trip intersection.slt:33
    Finished `test` profile [unoptimized + debuginfo] target(s) in 0.24s
     Running bin/sqllogictests.rs (target/debug/deps/sqllogictests-917e139464eeea33)
Completed 1 test files in 0 seconds                                              External error: 1 errors in file /Users/kosiew/GitHub/datafusion/datafusion/sqllogictest/test_files/intersection.slt

1. query failed: DataFusion error: Schema error: Schema contains duplicate qualified field name alltypes_plain.int_col
...

After

❯ cargo test --test sqllogictests -- --substrait-round-trip intersection.slt:33
    Finished `test` profile [unoptimized + debuginfo] target(s) in 0.64s
     Running bin/sqllogictests.rs (target/debug/deps/sqllogictests-917e139464eeea33)
Completed 1 test files in 0 seconds

What changes are included in this PR?

Added a requalification step (requalify_sides_if_needed) inside intersect_or_except to avoid duplicate or ambiguous field names.
Improved conflict detection logic in requalify_sides_if_needed to handle:
1. Duplicate qualified fields
2. Duplicate unqualified fields
3. Ambiguous references (qualified vs. unqualified collisions)
Updated optimizer tests to reflect correct aliasing (left, right).
Added new Substrait round‑trip tests for:
- INTERSECT and EXCEPT (both DISTINCT and ALL variants)
- Self-referential queries that previously failed
Minor formatting and consistency improvements in Substrait consumer code.

Are these changes tested?

Yes. The PR includes comprehensive tests that:

Reproduce the original failure modes.
Validate that requalification produces stable and correct logical plans.
Confirm correct behavior across INTERSECT, EXCEPT, ALL, and DISTINCT cases.

Are there any user-facing changes?

No user-facing behavior changes.
This is a correctness improvement ensuring that valid SQL queries—previously failing only in Substrait round‑trip mode—now work without error.

LLM-generated code disclosure

This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and validated.

Add automatic requalification logic to LogicalPlanBuilder::intersect_or_except to handle conflicting column qualifiers. Wrap input plans and use temporary aliases when conflicts are detected. Update the Substrait SetRel consumer to apply this logic for intersection and except operations. Add integration tests to verify the functionality with self-referential queries.

Improve the logic in builder.rs to detect conflicts across all three error cases. Return early with requalification as soon as a conflict is found, while preserving the original plan structure when no conflicts exist.

Revise intersect test snapshot to reflect correct behavior with requalification. The query now properly triggers requalification for the inner INTERSECT when both sides reference the same test source.

kosiew · 2025-11-19T04:25:55Z

datafusion/optimizer/tests/optimizer_integration.rs

-LeftSemi Join: test.col_int32 = test.col_int32, test.col_utf8 = test.col_utf8
-  Aggregate: groupBy=[[test.col_int32, test.col_utf8]], aggr=[[]]
-    LeftSemi Join: test.col_int32 = test.col_int32, test.col_utf8 = test.col_utf8
-      Aggregate: groupBy=[[test.col_int32, test.col_utf8]], aggr=[[]]


The old snapshot passed in main but was not properly distinguishing left from right

test.col_int32 = test.col_int32, test.col_utf8 = test.col_utf8

martin-g · 2025-11-19T09:23:13Z

datafusion/substrait/tests/cases/roundtrip_logical_plan.rs

+    // is optimized away, resulting in just the LeftAnti join
+    assert_expected_plan(
+        "SELECT a FROM data WHERE a > 0 EXCEPT SELECT a FROM data WHERE a < 5",
+        "LeftAnti Join: left.a = right.a\


Is there a difference between the plans for INTERSECT (self_referential_intersect) and EXCEPT (self_referential_except) ?
I don't see any.

The expected plans look almost identical in the test assertions, which is confusing. The key difference is actually in the join type, not the overall structure:

self_referential_intersect produces: **LeftSemi** Join: left.a = right.a

self_referential_except produces: **LeftAnti** Join: left.a = right.a

The rest of the plan structure is identical because:

Both operate on the same table (data) with similar filters

Both include the DISTINCT operation (via Aggregate: groupBy=[[data.a]]) because neither uses ALL

Both get requalified to left and right aliases due to the duplicate field name issue

martin-g · 2025-11-19T09:28:39Z

datafusion/expr/src/logical_plan/builder.rs

+    // 2. Duplicate unqualified fields: both sides have same unqualified name
+    // 3. Ambiguous reference: one side qualified, other unqualified, same name
+    for l in &left_cols {
+        for r in &right_cols {


Here the complexity is O(n*m).
You could optimize it to O(n+m) by iterating over left_cols (O(n)) and storing them in a HashMap<ColumnName, Column>, then while iterating over right_cols (O(m)) lookup by name in the hashmap (O(1)) and do the checks when there is an entry for that name.

Excellent observation on the algorithmic complexity. You're correct that the current nested loop is O(n*m), and this can be optimized to O(n+m) using a HashMap.

Analysis:

Here are some reasons for the current implementation:

Schema size is typically small: For example, the TPC-H benchmark schemas range from 3-16 columns (median ~8). Even the largest table (lineitem with 16 columns) would only result in 256 comparisons worst-case for this function., which is negligible for modern CPUs.

Early return on conflict: The function returns immediately upon finding the first conflict, so in the common case where conflicts exist (which is when this function matters), we often exit very early in the iteration.

Simple conflict detection logic: The current implementation is straightforward and easy to reason about. The match statement clearly shows all conflict scenarios.

Called infrequently: This function is only called during logical plan construction, not during execution. It's not in a hot path that runs millions of times.

Trade-offs of HashMap approach:

Pros:

O(n+m) vs O(n*m) complexity

Scales better for schemas with hundreds of columns

Cons:

More memory allocation overhead for the HashMap

More complex code that's slightly harder to understand

HashMap construction and hashing overhead may not pay off for small schemas

Need to handle the case where multiple columns have the same name in one schema (which can happen with different qualifiers)

If you feel strongly about this or if we anticipate very wide schemas (hundreds of columns), I'm happy to implement the HashMap-based optimization.

Wow! Such an answer!
Next time just tell me "We can improve it if it ever shows up in the profiler" 😄
Thank you!

…join types

kosiew added 5 commits November 19, 2025 12:17

feat: add tests for self-referential INTERSECT ALL and EXCEPT ALL

c1e0094

Enhance conflict detection in requalify_sides_if_needed

9473c04

Improve the logic in builder.rs to detect conflicts across all three error cases. Return early with requalification as soon as a conflict is found, while preserving the original plan structure when no conflicts exist.

Update test snapshot in optimizer_integration.rs

c657f56

Revise intersect test snapshot to reflect correct behavior with requalification. The query now properly triggers requalification for the inner INTERSECT when both sides reference the same test source.

fix: correct error handling in intersect and except functions

d8cf07b

github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules substrait Changes to the substrait crate labels Nov 19, 2025

kosiew commented Nov 19, 2025

View reviewed changes

martin-g reviewed Nov 19, 2025

View reviewed changes

docs: add implementation notes for requalify_sides_if_needed function

9fbc211

kosiew force-pushed the duplicate-schema-16295 branch from 1f7d563 to a1e2ca3 Compare November 19, 2025 13:56

docs: enhance comments for intersect and except functions to clarify …

fbde854

…join types

kosiew force-pushed the duplicate-schema-16295 branch from a1e2ca3 to fbde854 Compare November 19, 2025 13:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix Schema Duplication Errors in Self‑Referential INTERSECT/EXCEPT by Requalifying Input Sides #18814

Fix Schema Duplication Errors in Self‑Referential INTERSECT/EXCEPT by Requalifying Input Sides #18814

kosiew commented Nov 19, 2025

Uh oh!

kosiew Nov 19, 2025 •

edited

Loading

Uh oh!

martin-g Nov 19, 2025

Uh oh!

kosiew Nov 19, 2025

Uh oh!

martin-g Nov 19, 2025

Uh oh!

kosiew Nov 19, 2025

Uh oh!

martin-g Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix Schema Duplication Errors in Self‑Referential INTERSECT/EXCEPT by Requalifying Input Sides #18814

Are you sure you want to change the base?

Fix Schema Duplication Errors in Self‑Referential INTERSECT/EXCEPT by Requalifying Input Sides #18814

Conversation

kosiew commented Nov 19, 2025

Which issue does this PR close?

Rationale for this change

Before

After

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

LLM-generated code disclosure

Uh oh!

kosiew Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martin-g Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

kosiew Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

martin-g Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

kosiew Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

martin-g Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kosiew Nov 19, 2025 •

edited

Loading