Skip to content

Support User Defined Table Functions / Table Value Functions #7926

Closed
@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

It is sometimes helpful to have a custom table functions to extend DataFusion's functionality. As we continue to get more feature requests, such as #7859 it is important to support such usecases without having to add everything to the DataFusion core.

For example in the following query my_custom_fun is a table function

SELECT foo, bar FROM my_custom_fun(city='NYC', year=2020)

A specific example might be a function that fetches the contents of remote csv file and parses it into a table.

SELECT date, value FROM parse_remote_csv('https://data.wa.gov/api/views/f6w7-q2d2/rows.csv?accessType=DOWNLOAD')

You can do something similar to this with a TableProvider, but the main differences are:

  1. A TableProvider has no way to pass parameters
  2. A TableProvider's schema is fixed (it can't be a function of the parameters)

Prior Art

Other examples include the read_parquet etc functions in DuckDB

SELECT * FROM read_parquet(['folder1/*.parquet', 'folder2/*.parquet']);
...
SELECT * FROM parquet_schema('test.parquet');

Describe the solution you'd like

I would like to be able to have a table function that supported everything a TableProvider does, including filter and projection pushdown. One way to do so would be to actually return a TableProvider:

Option 1: Add to FunctionRegistry:

We could add Table Functions to the datafusion::execution::FunctionRegistry along with the UDFs, UDAs, etc which arguably would make them easier to discover

Something like

trait FunctionRegistry  { 
...
/// Return a  `TableProvider` for executing the `name` table function
fn udtf(name: &str, args: &[Expr]) -> Result<Arc<dyn TableProvider>>;

}

We would probably also need a

trait TableUDF {
  /// Return a  `TableProvider` for executing this  table function, given the specified
  /// arguments
  fn invoke(name: &str, args: &[Expr]) -> Result<Arc<dyn TableProvider>>;
}

Describe alternatives you've considered

This API is very powerful and would allow Table Functions to do anything a table provider does. We could also offer a stripped down version of the API potentially

We can probably add something like datafusion::logical_expr::create_udf to make it easier to construct basic table functions (e.g that produce a single SendableRecordBatchStream)

Add to SchemaProvider:

We could also add Table Functions to datafusion::catalog::schema::SchemaProvider

This might make sense given how similar TableFunctions are to TableProviders

trait SchemaProvider  { 
...
/// Return a  `TableProvider` for executing the `name` table function, given the specified
/// arguments
fn table_function(name: &str) -> Result<Arc<dyn TableProvider>>;

/// Register the `TableFunction` with the specified name, returning the previously registered function, if any
fn register_table_function(name: &str) -> Result<Option<Arc<dyn TableProvider>>>;
...
}

Additional context

I thought there was an existing ticket for this, but I can not find one

This came up several times, including:

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions