-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Fixes 3 bugs during serialization and deserialization of physical plans #16858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| bool distinct = 3; | ||
| bool ignore_nulls = 6; | ||
| optional bytes fun_definition = 7; | ||
| string human_display = 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix for bug 1 that does not serialize human_display for aggregate function
| optional bytes fun_definition = 3; | ||
| datafusion_common.ArrowType return_type = 4; | ||
| bool nullable = 5; | ||
| string return_field_name = 6; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix for bug 3 that does not serialize ScarlarFunction name
| // Only 4 queries pass: q3, q5, q10, q12 | ||
| // Ignore the test until the bug is fixed | ||
| #[ignore] | ||
| // Bugs: https://github.com/apache/datafusion/issues/16772 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turn on all 22 tpc-h queries
| } | ||
| } else if group_expr.is_empty() { | ||
| // No GROUP BY clause - create empty PhysicalGroupBy | ||
| Ok(PhysicalGroupBy::new(vec![], vec![], vec![])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the places that fixes bug 2 that should return nothing for non-groupby
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should look into changing group_expr: Option<&[Expr]> or something so the compiler makes it more likely that we handle the difference between an emptyVecvsNone`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me try
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb : I had a look. This would be very challenging and a lot of changes needed in physical plan, logical plan, parsing logic, serialization/deserialization. I do not think we want to make a big change just for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe something we can ask an AI agent to try in a follow on PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did, and those were the results I got. It even flagged a semantic error due to not distinguishing between no group and an empty one. Coding without AI these days feels almost impossible. 😄
| Field::new( | ||
| &e.return_field_name, | ||
| convert_required!(e.return_type)?, | ||
| true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this field could ever be non nullable, for example for udfs defined by DF users, then the deserialization would result in a mismatch (not in the scope of this fixes though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change here is to use e.return_field_name in deserialization. So whatever name of the UDF we serialize at line 356 of the file to_proto.rs in this PR, it will be deserialized here; and it will match. Does this answer your concern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was refering to the nullability of the return Field, that is always being set to true (even before this PR) i was wondering if this could be an issue later if the output of the scalar function could contain nulls, but is unrelated to the changes & fixes of this PR, sorry for the confusion 😄
LiaCastaneda
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are nice finds, thanks! 🙇♀️
|
@alamb : When you have a moment, could you take a quick look? Thanks! |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a good change to me -- thank you @NGA-TRAN and @LiaCastaneda
| } | ||
| } else if group_expr.is_empty() { | ||
| // No GROUP BY clause - create empty PhysicalGroupBy | ||
| Ok(PhysicalGroupBy::new(vec![], vec![], vec![])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should look into changing group_expr: Option<&[Expr]> or something so the compiler makes it more likely that we handle the difference between an emptyVecvsNone`
|
Thanks again everyone! |
Which issue does this PR close?
Rationale for this change
Serialization and deserialization of physical plans are currently incomplete. This PR introduces the necessary fixes to ensure all TPC-H queries pass.
What changes are included in this PR?
Summary of Changes:
Are these changes tested?
Yes
Are there any user-facing changes?
The explain for the roundtrip query is now more accurate