-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Fix Schema Duplication Errors in Self‑Referential INTERSECT/EXCEPT by Requalifying Input Sides #18814
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Add automatic requalification logic to LogicalPlanBuilder::intersect_or_except to handle conflicting column qualifiers. Wrap input plans and use temporary aliases when conflicts are detected. Update the Substrait SetRel consumer to apply this logic for intersection and except operations. Add integration tests to verify the functionality with self-referential queries.
Improve the logic in builder.rs to detect conflicts across all three error cases. Return early with requalification as soon as a conflict is found, while preserving the original plan structure when no conflicts exist.
Revise intersect test snapshot to reflect correct behavior with requalification. The query now properly triggers requalification for the inner INTERSECT when both sides reference the same test source.
| LeftSemi Join: test.col_int32 = test.col_int32, test.col_utf8 = test.col_utf8 | ||
| Aggregate: groupBy=[[test.col_int32, test.col_utf8]], aggr=[[]] | ||
| LeftSemi Join: test.col_int32 = test.col_int32, test.col_utf8 = test.col_utf8 | ||
| Aggregate: groupBy=[[test.col_int32, test.col_utf8]], aggr=[[]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old snapshot passed in main but was not properly distinguishing left from right
test.col_int32 = test.col_int32, test.col_utf8 = test.col_utf8
| // is optimized away, resulting in just the LeftAnti join | ||
| assert_expected_plan( | ||
| "SELECT a FROM data WHERE a > 0 EXCEPT SELECT a FROM data WHERE a < 5", | ||
| "LeftAnti Join: left.a = right.a\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a difference between the plans for INTERSECT (self_referential_intersect) and EXCEPT (self_referential_except) ?
I don't see any.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The expected plans look almost identical in the test assertions, which is confusing. The key difference is actually in the join type, not the overall structure:
self_referential_intersectproduces:**LeftSemi** Join: left.a = right.aself_referential_exceptproduces:**LeftAnti** Join: left.a = right.a
The rest of the plan structure is identical because:
- Both operate on the same table (
data) with similar filters - Both include the DISTINCT operation (via
Aggregate: groupBy=[[data.a]]) because neither usesALL - Both get requalified to
leftandrightaliases due to the duplicate field name issue
| // 2. Duplicate unqualified fields: both sides have same unqualified name | ||
| // 3. Ambiguous reference: one side qualified, other unqualified, same name | ||
| for l in &left_cols { | ||
| for r in &right_cols { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here the complexity is O(n*m).
You could optimize it to O(n+m) by iterating over left_cols (O(n)) and storing them in a HashMap<ColumnName, Column>, then while iterating over right_cols (O(m)) lookup by name in the hashmap (O(1)) and do the checks when there is an entry for that name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent observation on the algorithmic complexity. You're correct that the current nested loop is O(n*m), and this can be optimized to O(n+m) using a HashMap.
Analysis:
Here are some reasons for the current implementation:
-
Schema size is typically small: For example, the TPC-H benchmark schemas range from 3-16 columns (median ~8). Even the largest table (lineitem with 16 columns) would only result in 256 comparisons worst-case for this function., which is negligible for modern CPUs.
-
Early return on conflict: The function returns immediately upon finding the first conflict, so in the common case where conflicts exist (which is when this function matters), we often exit very early in the iteration.
-
Simple conflict detection logic: The current implementation is straightforward and easy to reason about. The match statement clearly shows all conflict scenarios.
-
Called infrequently: This function is only called during logical plan construction, not during execution. It's not in a hot path that runs millions of times.
Trade-offs of HashMap approach:
Pros:
- O(n+m) vs O(n*m) complexity
- Scales better for schemas with hundreds of columns
Cons:
- More memory allocation overhead for the HashMap
- More complex code that's slightly harder to understand
- HashMap construction and hashing overhead may not pay off for small schemas
- Need to handle the case where multiple columns have the same name in one schema (which can happen with different qualifiers)
If you feel strongly about this or if we anticipate very wide schemas (hundreds of columns), I'm happy to implement the HashMap-based optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow! Such an answer!
Next time just tell me "We can improve it if it ever shows up in the profiler" 😄
Thank you!
1f7d563 to
a1e2ca3
Compare
a1e2ca3 to
fbde854
Compare
Which issue does this PR close?
Rationale for this change
Self-referential INTERSECT and EXCEPT queries (where both sides originate from the same table) failed during Substrait round‑trip consumption with the error:
This happened because the join-based implementation of set operations attempted to merge two identical schemas without requalification, resulting in duplicate or ambiguous field names. By ensuring both sides are requalified when needed, DataFusion can correctly construct valid logical plans for these operations.
Before
After
What changes are included in this PR?
Added a requalification step (
requalify_sides_if_needed) insideintersect_or_exceptto avoid duplicate or ambiguous field names.Improved conflict detection logic in
requalify_sides_if_neededto handle:Updated optimizer tests to reflect correct aliasing (
left,right).Added new Substrait round‑trip tests for:
Minor formatting and consistency improvements in Substrait consumer code.
Are these changes tested?
Yes. The PR includes comprehensive tests that:
Are there any user-facing changes?
No user-facing behavior changes.
This is a correctness improvement ensuring that valid SQL queries—previously failing only in Substrait round‑trip mode—now work without error.
LLM-generated code disclosure
This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and validated.