Summary
DataFusion's SQL planner currently validates duplicate relation names/aliases in three separate, inconsistent places:
| Location |
Path covered |
Diagnostic quality |
datafusion/sql/src/cte.rs – plan_with_clause |
Duplicate CTE names in a WITH block |
Good – span-aware Diagnostic with "previously defined here" note |
datafusion/sql/src/select.rs – plan_from_tables (multi-entry branch only) |
Duplicate aliases in comma-separated FROM lists (FROM a, b, a) |
Good – span-aware Diagnostic with "first defined here" note |
datafusion/common/src/dfschema.rs – DFSchema::check_names / DFSchema::join |
All other duplicates (explicit JOIN, sub-selects, etc.) |
Poor – generic SchemaError::DuplicateQualifiedField / DuplicateUnqualifiedField with no source span |
There is no single, authoritative place where the planner records "which relation identity has been bound, at which source span". This causes three concrete problems described below.
Problem 1 – Explicit JOIN duplicate aliases are not caught by the new diagnostic
plan_from_tables only applies the new check when from.len() > 1 (the comma-join path). A single TableWithJoins goes straight to plan_table_with_joins without any alias-deduplication step:
// datafusion/sql/src/select.rs lines 691-694
match from.len() {
0 => Ok(LogicalPlanBuilder::empty(true).build()?),
1 => {
let input = from.remove(0);
self.plan_table_with_joins(input, planner_context) // ← no duplicate check
}
_ => { /* new duplicate check lives here only */ }
}
plan_table_with_joins in datafusion/sql/src/relation/join.rs calls create_relation for each join operand and then parse_relation_join, neither of which validates alias uniqueness. This means:
-- Explicit JOIN – skips the new check entirely
SELECT * FROM (SELECT 1 AS a) AS t JOIN (SELECT 2 AS b) AS t ON true
-- When column names overlap, falls through to the generic DFSchema error
-- When column names do not overlap, may plan successfully with two `t` aliases
Both outcomes differ from the comma-join behaviour, which now returns a proper positioned diagnostic.
Problem 2 – Unaliased fully-qualified table names collapse to the wrong key
The extract_table_name closure uses only the last identifier of a multi-part table name:
// datafusion/sql/src/select.rs lines 714-722
TableFactor::Table { name, alias: None, .. } => {
let span = Span::try_from_sqlparser_span(t.relation.span());
let table_name = name
.0
.iter()
.filter_map(|p| p.as_ident())
.map(|id| self.ident_normalizer.normalize(id.clone()))
.next_back()?; // ← only the final component
Some((table_name, span))
}
So catalog1.public.orders and catalog2.public.orders both produce the key "orders" and are rejected as duplicates, even though DataFusion's scan qualifier tracks the full TableReference (catalog1.public.orders vs catalog2.public.orders). Conversely, the same physical table accessed via two different schema paths could be double-scanned without triggering the guard.
The correct key for an unaliased table is object_name_to_table_reference(name.clone())?.to_string(), which produces "catalog1.public.orders" vs "catalog2.public.orders".
Problem 3 – DFSchema duplicate errors carry no source location
When a duplicate alias escapes planner-level checks (e.g. via an explicit JOIN), the error propagates from DFSchema::check_names / DFSchema::join in datafusion/common/src/dfschema.rs (lines 240–246). These paths emit SchemaError::DuplicateQualifiedField { qualifier, name } and SchemaError::DuplicateUnqualifiedField { name } with no attached Diagnostic and no source span. The user sees a schema-layer error message with no indication of which SQL token caused the conflict.
Proposed solution
Introduce a planner-level relation binding table (RelationScope or similar) that records normalized relation identity, source alias/name, and the associated Span for every relation introduced into scope. This table would be populated and consulted during relation and join planning (not only in plan_from_tables), making duplicate detection consistent across all paths.
High-level sketch:
/// Maintained inside PlannerContext during planning of a single FROM clause.
struct RelationScope {
/// Maps normalized relation key -> (display name, source span)
bindings: HashMap<String, (String, Option<Span>)>,
}
impl RelationScope {
/// Register a relation; returns an error with Diagnostic if the key is already bound.
fn bind(
&mut self,
key: String,
display: String,
span: Option<Span>,
) -> Result<()> { ... }
}
RelationScope::bind would be called from:
plan_from_tables (current comma-join path), replacing the local alias_spans map
plan_table_with_joins, after resolving each join operand's alias
- Potentially
plan_with_clause (CTE names), to have one unified binding registry
Files that would need to change:
| File |
Change |
datafusion/sql/src/select.rs |
Replace local alias_spans HashMap with RelationScope; add binding call for the single-item FROM case |
datafusion/sql/src/relation/join.rs |
Call RelationScope::bind for each join operand alias in plan_table_with_joins |
datafusion/sql/src/planner.rs or a new relation/scope.rs |
Define RelationScope / RelationBinding |
datafusion/sql/src/cte.rs |
Optionally migrate CTE duplicate check to use the same registry |
Acceptance criteria
Affected files (for reference)
This is a follow-up refactor surfaced during review of PR #20720
Summary
DataFusion's SQL planner currently validates duplicate relation names/aliases in three separate, inconsistent places:
datafusion/sql/src/cte.rs–plan_with_clauseWITHblockDiagnosticwith "previously defined here" notedatafusion/sql/src/select.rs–plan_from_tables(multi-entry branch only)FROMlists (FROM a, b, a)Diagnosticwith "first defined here" notedatafusion/common/src/dfschema.rs–DFSchema::check_names/DFSchema::joinJOIN, sub-selects, etc.)SchemaError::DuplicateQualifiedField/DuplicateUnqualifiedFieldwith no source spanThere is no single, authoritative place where the planner records "which relation identity has been bound, at which source span". This causes three concrete problems described below.
Problem 1 – Explicit
JOINduplicate aliases are not caught by the new diagnosticplan_from_tablesonly applies the new check whenfrom.len() > 1(the comma-join path). A singleTableWithJoinsgoes straight toplan_table_with_joinswithout any alias-deduplication step:plan_table_with_joinsindatafusion/sql/src/relation/join.rscallscreate_relationfor each join operand and thenparse_relation_join, neither of which validates alias uniqueness. This means:Both outcomes differ from the comma-join behaviour, which now returns a proper positioned diagnostic.
Problem 2 – Unaliased fully-qualified table names collapse to the wrong key
The
extract_table_nameclosure uses only the last identifier of a multi-part table name:So
catalog1.public.ordersandcatalog2.public.ordersboth produce the key"orders"and are rejected as duplicates, even though DataFusion's scan qualifier tracks the fullTableReference(catalog1.public.ordersvscatalog2.public.orders). Conversely, the same physical table accessed via two different schema paths could be double-scanned without triggering the guard.The correct key for an unaliased table is
object_name_to_table_reference(name.clone())?.to_string(), which produces"catalog1.public.orders"vs"catalog2.public.orders".Problem 3 – DFSchema duplicate errors carry no source location
When a duplicate alias escapes planner-level checks (e.g. via an explicit JOIN), the error propagates from
DFSchema::check_names/DFSchema::joinindatafusion/common/src/dfschema.rs(lines 240–246). These paths emitSchemaError::DuplicateQualifiedField { qualifier, name }andSchemaError::DuplicateUnqualifiedField { name }with no attachedDiagnosticand no source span. The user sees a schema-layer error message with no indication of which SQL token caused the conflict.Proposed solution
Introduce a planner-level relation binding table (
RelationScopeor similar) that records normalized relation identity, source alias/name, and the associatedSpanfor every relation introduced into scope. This table would be populated and consulted during relation and join planning (not only inplan_from_tables), making duplicate detection consistent across all paths.High-level sketch:
RelationScope::bindwould be called from:plan_from_tables(current comma-join path), replacing the localalias_spansmapplan_table_with_joins, after resolving each join operand's aliasplan_with_clause(CTE names), to have one unified binding registryFiles that would need to change:
datafusion/sql/src/select.rsalias_spansHashMap withRelationScope; add binding call for the single-item FROM casedatafusion/sql/src/relation/join.rsRelationScope::bindfor each join operand alias inplan_table_with_joinsdatafusion/sql/src/planner.rsor a newrelation/scope.rsRelationScope/RelationBindingdatafusion/sql/src/cte.rsAcceptance criteria
SELECT * FROM t JOIN t ON truereturns a span-aware "duplicate table alias" diagnostic (blocked by the missing JOIN path)SELECT * FROM t JOIN t ON t.x = t.x(non-overlapping-column case) is also rejected with a clear diagnosticSELECT * FROM catalog1.public.orders, catalog2.public.orderssucceeds (different full identities) instead of producing a false-positive duplicate errorSELECT * FROM public.orders, public.ordersis still rejected (same full identity)datafusion/sql/tests/cases/diagnostic.rsand/ordatafusion/sqllogictest/test_files/covering explicit JOIN duplicates, cross-catalog same-table-name, and unaliased same-table-nameAffected files (for reference)
plan_from_tables,extract_table_nameclosureplan_table_with_joins,parse_relation_joinplan_with_clause(reference implementation of span-aware duplicate check)DFSchema::check_names,DFSchema::join(downstream fallback path)SchemaError::DuplicateQualifiedField,SchemaError::DuplicateUnqualifiedFieldobject_name_to_table_reference(canonical normalization helper)This is a follow-up refactor surfaced during review of PR #20720