Skip to content

feat: Add Aggregate UDF to FFI crate #14775

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 33 commits into from
Jun 5, 2025

Conversation

timsaucer
Copy link
Contributor

@timsaucer timsaucer commented Feb 19, 2025

Which issue does this PR close?

This PR addresses part of #14562

Rationale for this change

This change allows for using user defined aggregate functions across FFI boundaries. It is useful for enabling shared libraries to pass functions back and forth. This feature will unlock:

  • Modules to provide DataFusion based FFI aggregates that can be reused in projects such as datafusion-python
  • Allows for use across different DataFusion versions between the function provider and consumer
  • Is a major step towards getting to the ability to export a function registry via FFI, a large blocker on sharing a session context

What changes are included in this PR?

This PR follows the same pattern as the previous FFI code. It exposes AggregateUDF and the Accumulators and AccumulatorArgs that go along with it.

Are these changes tested?

Included in this PR are both unit tests and two integration tests. The integration tests cover both grouping and non-grouping UDAFs. For the unit tests here is a tarpaulin report showing the coverage. The 12% not covered is almost entirely error handling.

ffi_coverage ffi_udaf_coverage

Are there any user-facing changes?

There are no changes to existing API but new functions are exposed in the datafusion-ffi crate.

@timsaucer timsaucer force-pushed the feat/aggregate-udf-ffi branch 2 times, most recently from 9aae315 to 77fd002 Compare February 27, 2025 07:00
@timsaucer timsaucer marked this pull request as ready for review February 27, 2025 07:16
@timsaucer timsaucer changed the title feat: Add Aggregate UDF to FFI interface feat: Add Aggregate UDF to FFI crate Feb 27, 2025
m09526
m09526 previously requested changes Feb 28, 2025
Copy link
Contributor

@m09526 m09526 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my first substantial review of someone's else code. Please let me know if I am not following proper steps in reviewing!

@timsaucer
Copy link
Contributor Author

@m09526 Thank you for the review! I've been out of country all last week, so I will try to get back to this soon now that I'm back.

@alamb alamb mentioned this pull request Mar 4, 2025
12 tasks
@alamb
Copy link
Contributor

alamb commented Mar 12, 2025

Let me know when this is ready for review

@timsaucer timsaucer marked this pull request as draft March 12, 2025 18:57
@timsaucer
Copy link
Contributor Author

Moved to draft until I get time to address the comments above. Additionally I discovered that Nullary type signatures do not make it properly through the type coercion, so that needs fixing before this can be widely used.

@timsaucer
Copy link
Contributor Author

The above Nullary issue also impacts the existing Scalar UDFs.

@timsaucer
Copy link
Contributor Author

Now that #15487 is resolved, this is unblocked.

@github-actions github-actions bot added logical-expr Logical plan and expressions core Core DataFusion crate ffi Changes to the ffi crate labels Apr 2, 2025
@alamb
Copy link
Contributor

alamb commented Apr 7, 2025

Now that #15487 is resolved, this is unblocked.

Shall we mark it as "ready for review"?

@crystalxyz
Copy link
Contributor

Now that #15487 is resolved, this is unblocked.

Shall we mark it as "ready for review"?

@alamb I'm helping Tim on this and I'm still addressing some comments. It shouldn't take too long. Will mark it ready for review when I'm done!

@crystalxyz
Copy link
Contributor

@timsaucer @alamb Should be good for review now. Thanks!

Copy link
Contributor Author

@timsaucer timsaucer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work! Thank you so much for pulling this over the line.

I have a few suggestions, but the core functionality looks correct to me. As one of the authors I cannot approve the pull request.

Comment on lines +42 to +50
pub struct FFI_GroupsAccumulator {
pub update_batch: unsafe extern "C" fn(
accumulator: &mut Self,
values: RVec<WrappedArray>,
group_indices: RVec<usize>,
opt_filter: ROption<WrappedArray>,
total_num_groups: usize,
) -> RResult<(), RString>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to add documentation for the struct, but it should be very similar to the others except pointing to GroupsAccumulator

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would personally suggest not implementing the GroupsAccumulator in the first FFI implementation -- it is hard to implement correctly as is.

Perhaps we can avoid this for now 🤔

Though I suppose GroupsAccumulator is basically required for high performance aggregation 🤔

@timsaucer
Copy link
Contributor Author

@robtandy As someone keenly interested in the FFI work, would you be able to do a review?

@timsaucer timsaucer marked this pull request as ready for review April 10, 2025 15:01
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THanks @timsaucer -- this looks like it is pretty close. It has a few unresolved comments, but then would be ready to go it seems

Comment on lines +42 to +50
pub struct FFI_GroupsAccumulator {
pub update_batch: unsafe extern "C" fn(
accumulator: &mut Self,
values: RVec<WrappedArray>,
group_indices: RVec<usize>,
opt_filter: ROption<WrappedArray>,
total_num_groups: usize,
) -> RResult<(), RString>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would personally suggest not implementing the GroupsAccumulator in the first FFI implementation -- it is hard to implement correctly as is.

Perhaps we can avoid this for now 🤔

Though I suppose GroupsAccumulator is basically required for high performance aggregation 🤔

@timsaucer timsaucer force-pushed the feat/aggregate-udf-ffi branch from ec4dda9 to 4282c2a Compare June 4, 2025 19:22
@timsaucer
Copy link
Contributor Author

@alamb I believe this one is ready.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @timsaucer -- I think this looks good to me. I didn't have time to review all the APIs in detail, but I think we need to get the big initial PR in and then work on fixing any blemishes over time

@alamb alamb dismissed m09526’s stale review June 5, 2025 18:17

Requested changes were made

@alamb
Copy link
Contributor

alamb commented Jun 5, 2025

Thank you very much @m09526 for the review

@alamb
Copy link
Contributor

alamb commented Jun 5, 2025

I merged up one more time to make sure the CI tests pass but assuming they do I plan to merge this one in

@alamb
Copy link
Contributor

alamb commented Jun 5, 2025

It seems github is experiencing issues. I will close/reopen this PR to restart the checks

https://www.githubstatus.com/

Screenshot 2025-06-05 at 3 07 37 PM

@alamb alamb closed this Jun 5, 2025
@alamb alamb reopened this Jun 5, 2025
@alamb alamb merged commit 5d3ed9c into apache:main Jun 5, 2025
45 of 58 checks passed
@alamb
Copy link
Contributor

alamb commented Jun 5, 2025

gogogo

kosiew pushed a commit to kosiew/datafusion that referenced this pull request Jun 9, 2025
* Work in progress adding user defined aggregate function FFI support

* Intermediate work. Going through groups accumulator

* MVP for aggregate udf via FFI

* Clean up after rebase

* Add unit test for FFI Accumulator Args

* Adding unit tests and fixing memory errors in aggregate ffi udf

* Working through additional unit and integration tests for UDAF ffi

* Switch to a accumulator that supports convert to state to get a little better coverage

* Set feature so we do not get an error warning in stable rustc

* Add more options to test

* Add unit test for FFI RecordBatchStream

* Add a few more args to ffi accumulator test fn

* Adding more unit tests on ffi aggregate udaf

* taplo format

* Update code comment

* Correct function name

* Temp fix record batch test dependencies

* Address some comments

* Revise comments and address PR comments

* Remove commented code

* Refactor GroupsAccumulator

* Add documentation

* Split integration tests

* Address comments to refactor error handling for opt filter

* Fix linting errors

* Fix linting and add deref

* Remove extra tests and unnecessary code

* Adjustments to FFI aggregate functions after rebase on main

* cargo fmt

* cargo clippy

* Re-implement cleaned up code that was removed in last push

* Minor review comments

---------

Co-authored-by: Crystal Zhou <crystal.zhouxiaoyue@hotmail.com>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate ffi Changes to the ffi crate logical-expr Logical plan and expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants