Skip to content

Conversation

@NGA-TRAN
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

Serialization and deserialization of physical plans are currently incomplete. This PR introduces the necessary fixes to ensure all TPC-H queries pass.

What changes are included in this PR?

Summary of Changes:

  1. Introduced pretty_assertions in the proto crate to highlight test output differences more clearly. This addition aids debugging, though I’m open to reverting it if it noticeably impacts crate size.
  2. Resolved the missing aggregate in human_display by adding a corresponding field to the proto definition.
  3. Corrected the date_part display function by introducing a new proto field to handle it properly.
  4. Fixed the issue of extra [] in non-group-by aggregates by checking for empty group-by conditions.
  5. Enabled roundtrip tests for all TPC-H queries to ensure correctness and completeness.

Are these changes tested?

Yes

  • Added three targeted tests—one for each case—using minimal queries to isolate issues effectively.
  • You're encouraged to copy these tests into your main branch before applying the fix to confirm that they fail as expected.
  • Also enabled all currently available TPC-H tests to ensure comprehensive coverage.

Are there any user-facing changes?

The explain for the roundtrip query is now more accurate

@github-actions github-actions bot added core Core DataFusion crate proto Related to proto crate ffi Changes to the ffi crate physical-plan Changes to the physical-plan crate labels Jul 22, 2025
bool distinct = 3;
bool ignore_nulls = 6;
optional bytes fun_definition = 7;
string human_display = 8;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix for bug 1 that does not serialize human_display for aggregate function

optional bytes fun_definition = 3;
datafusion_common.ArrowType return_type = 4;
bool nullable = 5;
string return_field_name = 6;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix for bug 3 that does not serialize ScarlarFunction name

// Only 4 queries pass: q3, q5, q10, q12
// Ignore the test until the bug is fixed
#[ignore]
// Bugs: https://github.com/apache/datafusion/issues/16772
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turn on all 22 tpc-h queries

}
} else if group_expr.is_empty() {
// No GROUP BY clause - create empty PhysicalGroupBy
Ok(PhysicalGroupBy::new(vec![], vec![], vec![]))
Copy link
Contributor Author

@NGA-TRAN NGA-TRAN Jul 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the places that fixes bug 2 that should return nothing for non-groupby

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should look into changing group_expr: Option<&[Expr]> or something so the compiler makes it more likely that we handle the difference between an emptyVecvsNone`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me try

Copy link
Contributor Author

@NGA-TRAN NGA-TRAN Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb : I had a look. This would be very challenging and a lot of changes needed in physical plan, logical plan, parsing logic, serialization/deserialization. I do not think we want to make a big change just for this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something we can ask an AI agent to try in a follow on PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did, and those were the results I got. It even flagged a semantic error due to not distinguishing between no group and an empty one. Coding without AI these days feels almost impossible. 😄

Field::new(
&e.return_field_name,
convert_required!(e.return_type)?,
true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this field could ever be non nullable, for example for udfs defined by DF users, then the deserialization would result in a mismatch (not in the scope of this fixes though)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change here is to use e.return_field_name in deserialization. So whatever name of the UDF we serialize at line 356 of the file to_proto.rs in this PR, it will be deserialized here; and it will match. Does this answer your concern?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was refering to the nullability of the return Field, that is always being set to true (even before this PR) i was wondering if this could be an issue later if the output of the scalar function could contain nulls, but is unrelated to the changes & fixes of this PR, sorry for the confusion 😄

Copy link
Contributor

@LiaCastaneda LiaCastaneda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are nice finds, thanks! 🙇‍♀️

@NGA-TRAN
Copy link
Contributor Author

@alamb : When you have a moment, could you take a quick look? Thanks!

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good change to me -- thank you @NGA-TRAN and @LiaCastaneda

}
} else if group_expr.is_empty() {
// No GROUP BY clause - create empty PhysicalGroupBy
Ok(PhysicalGroupBy::new(vec![], vec![], vec![]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should look into changing group_expr: Option<&[Expr]> or something so the compiler makes it more likely that we handle the difference between an emptyVecvsNone`

@alamb alamb merged commit a6d4798 into apache:main Jul 25, 2025
29 checks passed
@alamb
Copy link
Contributor

alamb commented Jul 25, 2025

Thanks again everyone!

adriangb pushed a commit to pydantic/datafusion that referenced this pull request Jul 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate ffi Changes to the ffi crate physical-plan Changes to the physical-plan crate proto Related to proto crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Only 4 tpc-h queries have matching physical plans before serialization and after deserialization

3 participants