Skip to content

WASM UDFs #9326

@crepererum

Description

@crepererum

Is your feature request related to a problem or challenge?

Databases / ETL solutions built on top of DataFusion can use UDFs (in their various forms) to extend the functionality of DataFusion, e.g. to add new scalar or aggregation functions. This extensibility however does NOT automatically extend to their users, since they cannot and (for security reasons) should not add code to the running system. So the U in UDF currently stands for "DataFusion API User", not for "End-User".

WASM provides a way to run user code in a secure sandbox under an "unknown" host (i.e. the user does not need to know about the operating system or CPU architecture). A DataFusion-based solution can use that to implement UDFs. However, since the calling convention from the WASM payload to the UDF are solution-defined, the end user is likely to have a hard time with it, and there is likely only a non-existing/small ecosystem for tooling to develop UDFs.

Defining the UDF WASM interface in DataFusion -- potentially in collaboration with the Arrow (since we need to get Arrow data across the WASM memory boundary) -- would likely facilitate a wider ecosystem and a more streamlined solution. Prior art to this is Arrow Flight, which is now being integrated into more and more server and client implementations.

Describe the solution you'd like

  1. Define/find a way to pass Arrow data in/out a WASM payload.
  2. Define the WASM calling convention for the different types of UDFs (scalar, aggregation, window functions, ...). Make sure to version that interface so we can advance it later (e.g. by using new WASM features).
  3. Implement UDFs using wasmtime in DataFusion.
  4. Offer some easy blueprint / framework to develop UDFs in at least two languages.

Describe alternatives you've considered

  • Doing this as part of the DataFusion-based solution (i.e. downstream). See drawbacks illustrated within the intro.
  • Use other UDF interface types like Arrow IPC & a Python payload. That clearly has security issues and is harder to deploy/manage.

Additional context

Projects that might be helpful:

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions