Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reject CREATE TABLE/VIEW with duplicate column names #13517

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

findepi
Copy link
Member

@findepi findepi commented Nov 21, 2024

Reject CREATE TABLE/VIEW with duplicate column names

DFSchema checks column for uniqueness, but allows duplicate column names
when they are qualified differently. This is because DFSchema plays
central role during query planning as a identifier resolution scope.

Those checks in their current form should not be there, since they
prevent execution of queries with duplicate column aliases, which is
legal in SQL. But even with these checks present, they are not
sufficient to ensure CREATE TABLE/VIEW is well structured. Table or
view columns need to have unique names and there is no qualification
involved.

This commit adds necessary checks in CREATE TABLE/VIEW DDL structs,
ensuring that CREATE TABLE/VIEW logical plans are valid in that regard.

This PR includes

@github-actions github-actions bot added sql SQL Planner logical-expr Logical plan and expressions core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate proto Related to proto crate labels Nov 21, 2024
@findepi findepi force-pushed the findepi/protect-create-table branch from 57aba78 to 65e6c46 Compare November 21, 2024 14:44
/// Whether the table is an infinite streams
pub unbounded: bool,
/// Table(provider) specific options
pub options: HashMap<String, String>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can add comments what is a key and what is the value in the hashmaps?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I don't know what these are yet. This is copied from CreateExternalTable. This whole struct allows public full destruction of CreateExternalTable, without allowing public construction of that struct, so that validation can take place. Would it help if i removed all these doc comments, so that documentation is checked at CreateExternalTable only?

]))
)?)
.unwrap_err().strip_backtrace().to_string(),
"Schema error: Schema contains qualified fields with duplicate unqualified names t1.c1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 I feel lower case col names are not easy to read in the error message. Will tweak the schema_err! output

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can wrap them in brackets [...] to clearly separate from surrounding text, while making it kind of obvious we're not bothering to produce proper quoted identifiers.

@findepi findepi force-pushed the findepi/protect-create-table branch from 65e6c46 to be87611 Compare November 22, 2024 08:37
DFSchema checks column for uniqueness, but allows duplicate column names
when they are qualified differently. This is because DFSchema plays
central role during query planning as a identifier resolution scope.

Those checks in their current form should not be there, since they
prevent execution of queries with duplicate column aliases, which is
legal in SQL. But even with these checks present, they are not
sufficient to ensure CREATE TABLE/VIEW is well structured. Table or
view columns need to have unique names and there is no qualification
involved.

This commit adds necessary checks in CREATE TABLE/VIEW DDL structs,
ensuring that CREATE TABLE/VIEW logical plans are valid in that regard.
@findepi findepi force-pushed the findepi/protect-create-table branch from be87611 to 7dafc6f Compare November 22, 2024 08:38
@findepi findepi requested a review from comphead November 22, 2024 08:39
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @findepi and @comphead -- I think this is a very nice improvement

I am not sure about the *Fields pattern. Let me know what you think

constraints: Constraints::empty(),
column_defaults: Default::default(),
},
&CreateExternalTable::builder()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😍

datafusion/expr/src/logical_plan/ddl.rs Outdated Show resolved Hide resolved
/// Creates an in memory table.
#[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Hash)]
#[non_exhaustive]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why add this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To disallow public direct construction. If we allow public direct construction, we can't enforce invariants in the factory method, since it may or may not get called.

@@ -192,11 +193,12 @@ impl DdlStatement {

/// Creates an external table.
#[derive(Debug, Clone, PartialEq, Eq)]
#[non_exhaustive]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why make it non exhaustive? That might make it harder to write match statements where all the fields are matched (and the compiler tells you when new ones are added)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why make it non exhaustive?

To disallow public direct construction. If we allow public direct construction, we can't enforce invariants in the factory method, since it may or may not get called.

That might make it harder to write match statements where all the fields are matched (and the compiler tells you when new ones are added)

💯 and i think this is important use-case. that's why those new into_fields() are added

@@ -288,8 +290,234 @@ impl PartialOrd for CreateExternalTable {
}
}

impl CreateExternalTable {
pub fn new(fields: CreateExternalTableFields) -> Result<Self> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given all the fields in CreateExternalTable are pub anyway, I don't think introducing CreateExternalTableFields requires all construction to go through new 🤔

Therefore I would personally suggest removing the CreateExternalTableFields (and equivalent for MemoryTable) and staying with the struct and the builder.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given all the fields in CreateExternalTable are pub anyway, I don't think introducing CreateExternalTableFields requires all construction to go through new 🤔

i think it's disallowed by non_exhaustive.
in fact, i did revert this PR changes in datafusion/core/src/catalog_common/listing_schema.rs and cargo check failed with

error[E0639]: cannot create non-exhaustive struct using struct expression
   --> datafusion/core/src/catalog_common/listing_schema.rs:134:26
    |
134 |                           &CreateExternalTable {
    |  __________________________^
135 | |                             schema: Arc::new(DFSchema::empty()),
136 | |                             name,
137 | |                             location: table_url,
...   |
147 | |                             column_defaults: Default::default(),
148 | |                         },
    | |_________________________^

For more information about this error, try `rustc --explain E0639`.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see -- that was non obvious.

Given that what do you think then about making all the fields not pub instead? I think that would be less confusing / more obvious.

Also I do think it is worth considering if we could make a smaller / less breaking change here -- I understand the desire to ensure only semantically correct CreateExternalTable structures can be created, but in my opinion the extra cognative overhead of the *Field structures makes it significantly harder to work with this code.

What if we

  1. Kept the CreateExternalTableBuilder (to generate errors during construction)
  2. Added another check somewhere when it was processed -- for example in that there were no duplicate columns in the schema

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that what do you think then about making all the fields not pub instead? I think that would be less confusing / more obvious.

This would be more breaking.
And we would need to add accessor for all these fields, for cases when someone wants to inspect some attributes without deconstructing

Added another check somewhere when it was processed -- for example in
datafusion/datafusion/core/src/datasource/listing_table_factory.rs

That could work, but where could this check be?
The link is currently in core byt the listing table is something we would want to move out.
And this should be the core's guarantee, regardless which table provider is used.

@findepi findepi requested a review from alamb November 28, 2024 21:19
@findepi findepi force-pushed the findepi/protect-create-table branch from 8065096 to 70740c6 Compare November 28, 2024 21:31
@findepi findepi force-pushed the findepi/protect-create-table branch from 70740c6 to e0ecbb8 Compare November 28, 2024 22:08
@@ -288,8 +290,234 @@ impl PartialOrd for CreateExternalTable {
}
}

impl CreateExternalTable {
pub fn try_new(fields: CreateExternalTableFields) -> Result<Self> {
let CreateExternalTableFields {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonahgao I wonder if you have some time to offer an opinion on this pattern (a way to allow pattern matching but still ensure a struct has to be valudated as part of construction

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think CreateExternalTableFields makes things too complex. It has exactly the same fields as CreateExternalTable, and users will feel confused and need to spend time distinguishing and using them.

In my opinion, it is enough to perform all the validity checks in CreateExternalTableBuilder::build, such as missing required fields and duplicate column names. CreateExternalTables typically come from the SQL planner of DataFusion, where we have already used the builder. For other scenarios, we can encourage users in the documentation to use the Builder API to create CreateExternalTables .

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't block public construction, the downstream projects won't know to migrate to the builder API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The downside of not migrating is a delayed error message, right? Maybe it is ok if the error doesn't immediately happen on construction for all downstream users

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The verification is in the try_new() method used by the builder. If the builder is not used, and this new try_new method isn't used directly either, the verification won't run.
I believe the downside of not migrating may be no error message at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate logical-expr Logical plan and expressions proto Related to proto crate sql SQL Planner sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CREATE TABLE succeeds when schema has duplicate names, resulting in a table that cannot be selected from
4 participants