feat: extend unnest to support Struct datatype#10429
Conversation
datafusion/expr/src/expr_schema.rs
Outdated
| } | ||
| DataType::Struct(_) => { | ||
| not_impl_err!("unnest() does not support struct yet") | ||
| // TODO: this is not correct, because unnest(struct) wll result into multiple data_type |
There was a problem hiding this comment.
I'm still finding a way to solve this TODO
There was a problem hiding this comment.
When do we end up with an unest Expr (vs an LogicalPlan::Unnest).
Could we simply return not_yet_implemented error here?
There was a problem hiding this comment.
if we returns error here, the error will stop the query plan from continue.
Initially unnest comes into projection step as an Expr here:
datafusion/datafusion/expr/src/utils.rs
Line 734 in d58bae4
try_process_unnest to do the transformation from unnest(struct) -> struct.field1, struct.field 2 ...
I think unnest is the only expr so far that represent a set of fields instead of singular value here, so i'm thinking of adding an extra check at this function
datafusion/datafusion/expr/src/utils.rs
Line 734 in d58bae4
There was a problem hiding this comment.
Thank you @duongcongtoai -- I think this is really cool.
I played around with this PR a bit locally and it works nicely. I have a few suggested tests and some simplification suggestions, but I also think those could be done as follow on PRs. Please let me know what you think
datafusion/expr/src/expr_schema.rs
Outdated
| } | ||
| DataType::Struct(_) => { | ||
| not_impl_err!("unnest() does not support struct yet") | ||
| // TODO: this is not correct, because unnest(struct) wll result into multiple data_type |
There was a problem hiding this comment.
When do we end up with an unest Expr (vs an LogicalPlan::Unnest).
Could we simply return not_yet_implemented error here?
| unnest_with_options(input, columns, UnnestOptions::default()) | ||
| } | ||
|
|
||
| pub fn get_unnested_columns( |
There was a problem hiding this comment.
Could you please add some documentation explaining what this function does?
Also, I wonder if it needs to be pub or if we could leave it pub(crate) or not pub?
| 5 [4, 5] 3 4 {c0: 3, c1: 4} | ||
| 6 [6] NULL NULL NULL | ||
| 12 [12] 7 8 {c0: 7, c1: 8} | ||
|
|
There was a problem hiding this comment.
Can we please add some other tests:
- Test the output type (e.g.
select arrow_typeof(unnest(column1))) - Test nested structs (I tested it works well, but I think we should have some coverage)
Nested structs
> create or replace table t as values (struct('a', 'b', struct('c'))), (struct('d', 'e', struct('f')));
0 row(s) fetched.
Elapsed 0.031 seconds.
> select * from t;
+-----------------------------+
| column1 |
+-----------------------------+
| {c0: a, c1: b, c2: {c0: c}} |
| {c0: d, c1: e, c2: {c0: f}} |
+-----------------------------+
2 row(s) fetched.
Elapsed 0.006 seconds.
> select unnest(column1) from t;
+----------------------+----------------------+----------------------+
| unnest(t.column1).c0 | unnest(t.column1).c1 | unnest(t.column1).c2 |
+----------------------+----------------------+----------------------+
| a | b | {c0: c} |
| d | e | {c0: f} |
+----------------------+----------------------+----------------------+
2 row(s) fetched.
Elapsed 0.013 seconds.And
> create or replace table t as values (struct('a', 'b', [1,2,3])), (struct('x', 'y', [10,20]));
0 row(s) fetched.
Elapsed 0.010 seconds.
> select unnest(column1) from t;
+----------------------+----------------------+----------------------+
| unnest(t.column1).c0 | unnest(t.column1).c1 | unnest(t.column1).c2 |
+----------------------+----------------------+----------------------+
| a | b | [1, 2, 3] |
| x | y | [10, 20] |
+----------------------+----------------------+----------------------+
2 row(s) fetched.
Elapsed 0.008 seconds.There was a problem hiding this comment.
Interestingly the output type doesn't seem to work (which probably makes sense as unnest needs to be converted to a LogicalPlan). I don't think we have to fix this in the PR (but adding coverage would be good)
> select unnest(column1) from t;
+----------------------+----------------------+----------------------+
| unnest(t.column1).c0 | unnest(t.column1).c1 | unnest(t.column1).c2 |
+----------------------+----------------------+----------------------+
| a | b | [1, 2, 3] |
| x | y | [10, 20] |
+----------------------+----------------------+----------------------+
2 row(s) fetched.
> select arrow_typeof(unnest(column1)) from t;
Internal error: unnest on struct can ony be applied at the root level of select expression.
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue trackerThere was a problem hiding this comment.
thank you, i added more coverage
datafusion/sql/src/utils.rs
Outdated
|
|
||
| // unnest(struct_col) | ||
| let original_expr = Expr::Unnest(Unnest { | ||
| expr: Box::new(Expr::Column(Column::from_name("struct_col"))), |
There was a problem hiding this comment.
I think you can make this a lot more concise using the expr_fn API, like:
| expr: Box::new(Expr::Column(Column::from_name("struct_col"))), | |
| expr: Box::new(col("struct_col")) |
In fact maybe we should add an expr_fn for unnest
/// Call `unnest` on the expression
fn unnest(arg: Expr) -> Expr {
Expr::Unnest(Unnest { expr:: Box::new(arg) } )
}
datafusion/sql/src/utils.rs
Outdated
| let original_expr = Expr::Unnest(Unnest { | ||
| expr: Box::new(Expr::Column(Column::from_name("array_col"))), | ||
| }) | ||
| .add(Expr::Literal(ScalarValue::Int64(Some(1)))); |
There was a problem hiding this comment.
| .add(Expr::Literal(ScalarValue::Int64(Some(1)))); | |
| .add(lit(1i64)); |
There are similar simplifications we could do below too
unnest to support Struct datatype
alamb
left a comment
There was a problem hiding this comment.
Thanks again @duongcongtoai -- really nice improvement
* feat: extend unnest for struct * compile err * debugging * finish basic * chore: complete impl * chore: clean garbage * chore: more test * test: fix df test * prettify display * fix unit test * chore: compile err * chore: fix physical exec err * add sqllogic test * chore: more doc * chore: refactor * fix doc * fmt * fix doc * ut for recursive transform unnest * a small integration test * fix comment
Which issue does this PR close?
Closes #10264.
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?