Disallow duplicated qualified field names #12608

eejbyfeldt · 2024-09-24T18:51:27Z

Which issue does this PR close?

Rationale for this change

In #6091 we relaxed and allowed duplicated qualified fields in order to support duplicated group bys. This leads to some confusion and bugs as we do not allow for unqualified duplicated names. This makes the necessary changes to still support duplicated group by while still disallowing duplicated qualified names.

What changes are included in this PR?

Revert allowing duplicated qualified names from #6091

Deduplicate group by exprs when creating output schema for LogicalPlan::Aggregate.

Remove test case with_column_self_join as this test no fails already at the join, I don't think it possible to modify to fulfill it old purpose.

The old logic in plan_table_with_joins would create schemas with duplicated fields as it would end up with multiple copies for the first table when there are multiple joins. I am not 100% this is the right fix, but it passes all existing test cases.

Are these changes tested?

Existing test cases.

Are there any user-facing changes?

Since this adds new restrictions it possible that it will break downstream users depending on the old behavior.

jonahgao · 2024-09-25T02:29:49Z

I think this is the right way; DuckDB has a RemoveDuplicateGroups that does the same thing.

There might be a minor issue with volatile expressions, such as group by random(), random(), but I think we can disregard it for now.

eejbyfeldt · 2024-09-25T09:11:48Z

datafusion/sql/src/planner.rs

-        self.outer_from_schema = match self.outer_from_schema.as_ref() {
-            Some(from_schema) => Some(Arc::new(from_schema.join(schema)?)),
-            None => Some(Arc::clone(schema)),
+        match self.outer_from_schema.as_mut() {
+            Some(from_schema) => Arc::make_mut(from_schema).merge(schema),
+            None => self.outer_from_schema = Some(Arc::clone(schema)),


@aalexandrov Can you help verify that this is correct?

I am not fully following your comment here: #11456 (comment) At least in the current code merge will still have both values for j1_id since they will have different qualifiers.

alamb · 2024-09-25T10:34:58Z

datafusion/core/src/dataframe/mod.rs

    }

-    // Table 't1' self join
-    // Supplementary test of issue: https://github.com/apache/datafusion/issues/7790


why is this test removed?

Should have explained that more in the description. Before it tests that calling with_column after a self join without alias will fail in a certain way. After this change it will fail already at the self join. So there is no way to make it test what it testsed before and I belive the self join already has test coverage elsewhere.

comphead · 2024-09-25T16:47:39Z

datafusion/expr/src/utils.rs

    } else {
-        Ok(group_expr.iter().collect())
+        Ok(group_expr
+            .iter()


wondering if can get rid of double collection iteration?

The double iteration is used to deduplicate the values

comphead · 2024-09-25T18:54:20Z

we need probably to check how it works if provided the schema and catalog

alamb

Thank you @eejbyfeldt -- I reviewed this code and the tests and it looks good to me.

@comphead what do yoy mean by this comment:

we need probably to check how it works if provided the schema and catalog

Do you mean we should add a test with different schemas?

alamb · 2024-09-27T12:30:18Z

datafusion/sql/src/planner.rs

-            Some(from_schema) => Some(Arc::new(from_schema.join(schema)?)),
-            None => Some(Arc::clone(schema)),
+        match self.outer_from_schema.as_mut() {
+            Some(from_schema) => Arc::make_mut(from_schema).merge(schema),


TIL Arc::make_mut 📓

jonahgao

LGTM, I left a small suggestion, but I don't think it's necessary.

jonahgao · 2024-09-29T07:15:32Z

datafusion/expr/src/logical_plan/plan.rs

            .iter()
            .map(|item| item.schema_name().to_string())
+            .collect::<IndexSet<_>>()
+            .into_iter()


Can we directly remove duplicates from group_expr so that we don't need to perform duplicate removal here again?

I played around with trying to make this work and I found it was not easy -- perhaps we can do it as a follow on PR?

comphead · 2024-09-30T19:41:29Z

Thank you @eejbyfeldt -- I reviewed this code and the tests and it looks good to me.

@comphead what do yoy mean by this comment:

we need probably to check how it works if provided the schema and catalog

Do you mean we should add a test with different schemas?

I was thinking about playing with fully qualified columns, like schema/catalog etc

alamb · 2024-10-02T22:00:32Z

Thank you @eejbyfeldt -- I reviewed this code and the tests and it looks good to me.
@comphead what do yoy mean by this comment:

we need probably to check how it works if provided the schema and catalog

Do you mean we should add a test with different schemas?

I was thinking about playing with fully qualified columns, like schema/catalog etc

Let's do it as a follow on so this PR doesn't hang out unmerged for too long.

alamb · 2024-10-02T22:00:46Z

Thanks again @eejbyfeldt @jonahgao and @comphead

@etseidl

* Add support for external tables with qualified names (#12645) * Make support schemas * Set default name to table * Remove print statements and stale comment * Add tests for create table * Fix typo * Update datafusion/sql/src/statement.rs Co-authored-by: Jonah Gao <jonahgao@msn.com> * convert create_external_table to objectname * Add sqllogic tests * Fix failing tests --------- Co-authored-by: Jonah Gao <jonahgao@msn.com> * Fix Regex signature types (#12690) * Fix Regex signature types * Uncomment the shared tests in string_query.slt.part and removed tests copies everywhere else * Test `LIKE` and `MATCH` with flags; Remove new tests from regexp.slt * Refactor `ByteGroupValueBuilder` to use `MaybeNullBufferBuilder` (#12681) * Fix malformed hex string literal in docs (#12708) * Simplify match patterns in coercion rules (#12711) Remove conditions where unnecessary. Refactor to improve readability. * Remove aggregate functions dependency on frontend (#12715) * Remove aggregate functions dependency on frontend DataFusion is a SQL query engine and also a reusable library for building query engines. The core functionality should not depend on frontend related functionalities like `sqlparser` or `datafusion-sql`. * Remove duplicate license header * Minor: Remove clone in `transform_to_states` (#12707) * rm clone Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fmt Signed-off-by: jayzhan211 <jayzhan211@gmail.com> --------- Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * Refactor tests for union sorting properties, add tests for unions and constants (#12702) * Refactor tests for union sorting properties * update doc test * Undo import reordering * remove unecessary static lifetimes * Fix: support Qualified Wildcard in count aggregate function (#12673) * Reduce code duplication in `PrimitiveGroupValueBuilder` with const generics (#12703) * Reduce code duplication in `PrimitiveGroupValueBuilder` with const generics * Fix docs * Disallow duplicated qualified field names (#12608) * Disallow duplicated qualified field names * Fix tests * Optimize base64/hex decoding by pre-allocating output buffers (~2x faster) (#12675) * add bench * replace macro with generic function * remove duplicated code * optimize base64/hex decode * Allow DynamicFileCatalog support to query partitioned file (#12683) * support to query partitioned table for dynamic file catalog * cargo clippy * split partitions inferring to another function * Support `LIMIT` Push-down logical plan optimization for `Extension` nodes (#12685) * Update trait `UserDefinedLogicalNodeCore` Signed-off-by: Austin Liu <austin362667@gmail.com> * Update corresponding interface Signed-off-by: Austin Liu <austin362667@gmail.com> Add rewrite rule for `push-down-limit` for `Extension` Signed-off-by: Austin Liu <austin362667@gmail.com> * Add rewrite rule for `push-down-limit` for `Extension` and tests Signed-off-by: Austin Liu <austin362667@gmail.com> * Update corresponding interface Signed-off-by: Austin Liu <austin362667@gmail.com> * Reorganize to match guard Signed-off-by: Austin Liu <austin362667@gmail.com> * Clena up Signed-off-by: Austin Liu <austin362667@gmail.com> Clean up Signed-off-by: Austin Liu <austin362667@gmail.com> --------- Signed-off-by: Austin Liu <austin362667@gmail.com> * Fix AvroReader: Add union resolving for nested struct arrays (#12686) * Add union resolving for nested struct arrays * Add test * Change test * Reproduce index error * fmt --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Adds macros for creating `WindowUDF` and `WindowFunction` expression (#12693) * Adds macro for udwf singleton * Adds a doc comment parameter to macro * Add doc comment for `create_udwf` macro * Uses default constructor * Update `Cargo.lock` in `datafusion-cli` * Fixes: expand `$FN_NAME` in doc strings * Adds example for macro usage * Renames macro * Improve doc comments * Rename udwf macro * Minor: doc copy edits * Adds macro for creating fluent-style expression API * Adds support for 1 or more parameters in expression function * Rewrite doc comments * Rename parameters * Minor: formatting * Adds doc comment for `create_udwf_expr` macro * Improve example docs * Hides extraneous code in doc comments * Add a one-line readme * Adds doc test assertions + minor formatting fixes * Adds common macro for defining user-defined window functions * Adds doc comment for `define_udwf_and_expr` * Defines `RowNumber` using common macro * Add usage example for common macro * Adds usage for custom constructor * Add examples for remaining patterns * Improve doc comments for usage examples * Rewrite inner line docs * Rewrite `create_udwf_expr!` doc comments * Minor doc improvements * Fix doc test and usage example * Add inline comments for macro patterns * Minor: change doc comment in example * Support unparsing plans with both Aggregation and Window functions (#12705) * Support unparsing plans with both Aggregation and Window functions (#35) * Fix unparsing for aggregation grouping sets * Add test for grouping set unparsing * Update datafusion/sql/src/unparser/utils.rs Co-authored-by: Jax Liu <liugs963@gmail.com> * Update datafusion/sql/src/unparser/utils.rs Co-authored-by: Jax Liu <liugs963@gmail.com> * Update * More tests --------- Co-authored-by: Jax Liu <liugs963@gmail.com> * Fix strpos invocation with dictionary and null (#12712) In 1b3608d `strpos` signature was modified to indicate it supports dictionary as input argument, but the invoke method doesn't support them. * docs: Update DataFusion introduction to clarify that DataFusion does provide an "out of the box" query engine (#12666) * Update DataFusion introduction to show that DataFusion offers packaged versions for end users * change order * Update README.md Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * refine wording and update user guide for consistency * prettier --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Framework for generating function docs from embedded code documentation (#12668) * Initial work on #12432 to allow for generation of udf docs from embedded documentation in the code * Add missing license header. * Fixed examples. * Fixing a really weird RustRover/wsl ... something. No clue what happened there. * permission change * Cargo fmt update. * Refactored Documentation to allow it to be used in a const. * Add documentation for syntax_example * Refactoring Documentation based on PR feedback. * Cargo fmt update. * Doc update * Fixed copy/paste error. * Minor text updates. --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Add IMDB(JOB) Benchmark [2/N] (imdb queries) (#12529) * imdb dataset * cargo fmt * Add 113 queries for IMDB(JOB) Signed-off-by: Austin Liu <austin362667@gmail.com> * Add `get_query_sql` from `query_id` string Signed-off-by: Austin Liu <austin362667@gmail.com> * Fix CSV reader & Remove Parquet partition Signed-off-by: Austin Liu <austin362667@gmail.com> * Add benchmark IMDB runner Signed-off-by: Austin Liu <austin362667@gmail.com> * Add `run_imdb` script Signed-off-by: Austin Liu <austin362667@gmail.com> * Add checker for imdb option Signed-off-by: Austin Liu <austin362667@gmail.com> * Add SLT for IMDB Signed-off-by: Austin Liu <austin362667@gmail.com> * Fix `get_query_sql()` for CI roundtrip test Signed-off-by: Austin Liu <austin362667@gmail.com> Fix `get_query_sql()` for CI roundtrip test Signed-off-by: Austin Liu <austin362667@gmail.com> Fix `get_query_sql()` for CI roundtrip test Signed-off-by: Austin Liu <austin362667@gmail.com> * Clean up Signed-off-by: Austin Liu <austin362667@gmail.com> * Add missing license Signed-off-by: Austin Liu <austin362667@gmail.com> * Add IMDB(JOB) queries `2b` to `5c` Signed-off-by: Austin Liu <austin362667@gmail.com> * Add `INCLUDE_IMDB` in CI verify-benchmark-results Signed-off-by: Austin Liu <austin362667@gmail.com> * Prepare IMDB dataset Signed-off-by: Austin Liu <austin362667@gmail.com> Prepare IMDB dataset Signed-off-by: Austin Liu <austin362667@gmail.com> * use uint as id type * format * Seperate `tpch` and `imdb` benchmarking CI jobs Signed-off-by: Austin Liu <austin362667@gmail.com> Fix path Signed-off-by: Austin Liu <austin362667@gmail.com> Fix path Signed-off-by: Austin Liu <austin362667@gmail.com> Remove `tpch` in `imdb` benchmark Signed-off-by: Austin Liu <austin362667@gmail.com> * Remove IMDB(JOB) slt in CI Signed-off-by: Austin Liu <austin362667@gmail.com> Remove IMDB(JOB) slt in CI Signed-off-by: Austin Liu <austin362667@gmail.com> --------- Signed-off-by: Austin Liu <austin362667@gmail.com> Co-authored-by: DouPache <douenergy@gmail.com> * Minor: avoid clone while calculating union equivalence properties (#12722) * Minor: avoid clone while calculating union equivalence properties * Update datafusion/physical-expr/src/equivalence/properties.rs * fmt * Simplify streaming_merge function parameters (#12719) * simplify streaming_merge function parameters * revert test change * change StreamingMergeConfig into builder pattern * Fix links on docs index page (#12750) * Provide field and schema metadata missing on cross joins, and union with null fields. (#12729) * test: reproducer for missing schema metadata on cross join * fix: pass thru schema metadata on cross join * fix: preserve metadata when transforming to view types * test: reproducer for missing field metadata in left hand NULL field of union * fix: preserve field metadata from right side of union * chore: safe indexing * Minor: Update string tests for strpos (#12739) * Apply `type_union_resolution` to array and values (#12753) * cleanup make array coercion rule Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * change to type union resolution Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * change value too Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix tpyo Signed-off-by: jayzhan211 <jayzhan211@gmail.com> --------- Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * Add `DocumentationBuilder::with_standard_argument` to reduce copy/paste (#12747) * Add `DocumentationBuilder::with_standard_expression` to reduce copy/paste * fix doc * fix standard argument * Update docs * Improve documentation to explain what is different * fix `equal_to` in `PrimitiveGroupValueBuilder` (#12758) * fix `equal_to` in `PrimitiveGroupValueBuilder`. * fix typo. * add uts. * reduce calling of `is_null`. * Minor: doc how field name is to be set (#12757) * Fix `equal_to` in `ByteGroupValueBuilder` (#12770) * Fix `equal_to` in `ByteGroupValueBuilder` * refactor null_equal_to * Update datafusion/physical-plan/src/aggregates/group_values/group_column.rs * Allow simplification even when nullable (#12746) The nullable requirement seem to have been added in #1401 but as far as I can tell they are not needed for these 2 cases. I think this can be shown using this truth table: (generated using datafusion-cli without this patch) ``` > CREATE TABLE t (v BOOLEAN) as values (true), (false), (NULL); > select t.v, t2.v, t.v AND (t.v OR t2.v), t.v OR (t.v AND t2.v) from t cross join t as t2; +-------+-------+---------------------+---------------------+ | v | v | t.v AND t.v OR t2.v | t.v OR t.v AND t2.v | +-------+-------+---------------------+---------------------+ | true | true | true | true | | true | false | true | true | | true | | true | true | | false | true | false | false | | false | false | false | false | | false | | false | false | | | true | | | | | false | | | | | | | | +-------+-------+---------------------+---------------------+ ``` And it seems Spark applies both of these and DuckDB applies only the first one. * Fix unnest conjunction with selecting wildcard expression (#12760) * fix unnest statement with wildcard expression * add commnets * Improve `round` scalar function unparsing for Postgres (#12744) * Postgres: enforce required `NUMERIC` type for `round` scalar function (#34) Includes initial support for dialects to override scalar functions unparsing * Document scalar_function_to_sql_overrides fn * Fix stack overflow calculating projected orderings (#12759) * Fix stack overflow calculating projected orderings * fix docs * Port / Add Documentation for `VarianceSample` and `VariancePopulation` (#12742) * Upgrade arrow/parquet to `53.1.0` / fix clippy (#12724) * Update to arrow/parquet 53.1.0 * Update some API * update for changed file sizes * Use non deprecated APIs * Use ParquetMetadataReader from @etseidl * remove upstreamed implementation * Update CSV schema * Use upstream is_null and is_not_null kernels * feat: add support for Substrait ExtendedExpression (#12728) * Add support for serializing and deserializing Substrait ExtendedExpr message * Address clippy reviews * Reuse existing rename method * Transformed::new_transformed: Fix documentation formatting (#12787) Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * fix: Correct results for grouping sets when columns contain nulls (#12571) * Fix grouping sets behavior when data contains nulls * PR suggestion comment * Update new test case * Add grouping_id to the logical plan * Add doc comment next to INTERNAL_GROUPING_ID * Fix unparsing of Aggregate with grouping sets --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Migrate documentation for all string functions from scalar_functions.md to code (#12775) * Added documentation for string and unicode functions. * Fixed issues with aliases. * Cargo fmt. * Minor doc fixes. * Update docs for var_pop/samp --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Account for constant equivalence properties in union, tests (#12562) * Minor: clarify comment about empty dependencies (#12786) * Introduce Signature::String and return error if input of `strpos` is integer (#12751) * fix sig Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix error Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix all signature Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix all signature Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * change default type Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * clippy Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix docs Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * rm deadcode Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * cleanup Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * cleanup Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * rm test Signed-off-by: jayzhan211 <jayzhan211@gmail.com> --------- Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * Minor: improve docs on MovingMin/MovingMax (#12790) * Add slt tests (#12721) --------- Signed-off-by: jayzhan211 <jayzhan211@gmail.com> Signed-off-by: Austin Liu <austin362667@gmail.com> Co-authored-by: OussamaSaoudi <45303303+OussamaSaoudi@users.noreply.github.com> Co-authored-by: Jonah Gao <jonahgao@msn.com> Co-authored-by: Dmitrii Blaginin <dmitrii@blaginin.me> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Tomoaki Kawada <kawada@kmckk.co.jp> Co-authored-by: Piotr Findeisen <piotr.findeisen@gmail.com> Co-authored-by: Jay Zhan <jayzhan211@gmail.com> Co-authored-by: HuSen <husen.xjtu@gmail.com> Co-authored-by: Emil Ejbyfeldt <emil.ejbyfeldt@gmail.com> Co-authored-by: Simon Vandel Sillesen <simon.vandel@gmail.com> Co-authored-by: Jax Liu <liugs963@gmail.com> Co-authored-by: Austin Liu <austin362667@gmail.com> Co-authored-by: JonasDev1 <jswipp@googlemail.com> Co-authored-by: jcsherin <jacob@protoship.io> Co-authored-by: Sergei Grebnov <sergei.grebnov@gmail.com> Co-authored-by: Andy Grove <agrove@apache.org> Co-authored-by: Bruce Ritchie <bruce.ritchie@veeva.com> Co-authored-by: DouPache <douenergy@gmail.com> Co-authored-by: mertak-synnada <mertak67+synaada@gmail.com> Co-authored-by: Bryce Mecum <petridish@gmail.com> Co-authored-by: wiedld <wiedld@users.noreply.github.com> Co-authored-by: kamille <caoruiqiu.crq@antgroup.com> Co-authored-by: Weston Pace <weston.pace@gmail.com> Co-authored-by: Val Lorentz <vlorentz@softwareheritage.org>

github-actions bot added sql SQL Planner logical-expr Logical plan and expressions optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate labels Sep 24, 2024

eejbyfeldt force-pushed the i-11464 branch from 34a7842 to ee024ec Compare September 24, 2024 18:52

Disallow duplicated qualified field names

5515063

eejbyfeldt force-pushed the i-11464 branch from ee024ec to 5515063 Compare September 24, 2024 18:55

eejbyfeldt force-pushed the i-11464 branch from 103d38f to e0e1d06 Compare September 25, 2024 08:52

Fix tests

2bb322f

eejbyfeldt force-pushed the i-11464 branch from e0e1d06 to 2bb322f Compare September 25, 2024 09:07

eejbyfeldt commented Sep 25, 2024

View reviewed changes

eejbyfeldt marked this pull request as ready for review September 25, 2024 09:21

alamb reviewed Sep 25, 2024

View reviewed changes

comphead reviewed Sep 25, 2024

View reviewed changes

alamb reviewed Sep 27, 2024

View reviewed changes

jonahgao approved these changes Sep 29, 2024

View reviewed changes

alamb merged commit dcc018e into apache:main Oct 2, 2024

eejbyfeldt deleted the i-11464 branch October 5, 2024 07:50

jonahgao mentioned this pull request Oct 8, 2024

Remove unnecessary DFSchema::check_ambiguous_name #12805

Merged

jonahgao mentioned this pull request Jan 14, 2025

Can no longer easily join duplicate schemas as of version 43 #14112

Closed

Disallow duplicated qualified field names #12608

Disallow duplicated qualified field names #12608

Uh oh!

Conversation

eejbyfeldt commented Sep 24, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

jonahgao commented Sep 25, 2024

Uh oh!

eejbyfeldt Sep 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb Sep 25, 2024

Choose a reason for hiding this comment

Uh oh!

eejbyfeldt Sep 25, 2024

Choose a reason for hiding this comment

Uh oh!

comphead Sep 25, 2024

Choose a reason for hiding this comment

Uh oh!

alamb Sep 30, 2024

Choose a reason for hiding this comment

Uh oh!

comphead commented Sep 25, 2024

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Sep 27, 2024

Choose a reason for hiding this comment

Uh oh!

jonahgao left a comment

Choose a reason for hiding this comment

Uh oh!

jonahgao Sep 29, 2024

Choose a reason for hiding this comment

Uh oh!

alamb Sep 30, 2024

Choose a reason for hiding this comment

Uh oh!

comphead commented Sep 30, 2024

Uh oh!

alamb commented Oct 2, 2024

Uh oh!

alamb commented Oct 2, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

eejbyfeldt Sep 25, 2024 •

edited

Loading