Description
Is your feature request related to a problem or challenge?
It is sometimes helpful to have a custom table functions to extend DataFusion's functionality. As we continue to get more feature requests, such as #7859 it is important to support such usecases without having to add everything to the DataFusion core.
For example in the following query my_custom_fun
is a table function
SELECT foo, bar FROM my_custom_fun(city='NYC', year=2020)
A specific example might be a function that fetches the contents of remote csv file and parses it into a table.
SELECT date, value FROM parse_remote_csv('https://data.wa.gov/api/views/f6w7-q2d2/rows.csv?accessType=DOWNLOAD')
You can do something similar to this with a TableProvider, but the main differences are:
- A
TableProvider
has no way to pass parameters - A
TableProvider
's schema is fixed (it can't be a function of the parameters)
Prior Art
Other examples include the read_parquet
etc functions in DuckDB
SELECT * FROM read_parquet(['folder1/*.parquet', 'folder2/*.parquet']);
...
SELECT * FROM parquet_schema('test.parquet');
Describe the solution you'd like
I would like to be able to have a table function that supported everything a TableProvider
does, including filter and projection pushdown. One way to do so would be to actually return a TableProvider
:
Option 1: Add to FunctionRegistry
:
We could add Table Functions to the datafusion::execution::FunctionRegistry along with the UDFs, UDAs, etc which arguably would make them easier to discover
Something like
trait FunctionRegistry {
...
/// Return a `TableProvider` for executing the `name` table function
fn udtf(name: &str, args: &[Expr]) -> Result<Arc<dyn TableProvider>>;
}
We would probably also need a
trait TableUDF {
/// Return a `TableProvider` for executing this table function, given the specified
/// arguments
fn invoke(name: &str, args: &[Expr]) -> Result<Arc<dyn TableProvider>>;
}
Describe alternatives you've considered
This API is very powerful and would allow Table Functions to do anything a table provider does. We could also offer a stripped down version of the API potentially
We can probably add something like datafusion::logical_expr::create_udf to make it easier to construct basic table functions (e.g that produce a single SendableRecordBatchStream
)
Add to SchemaProvider
:
We could also add Table Functions to datafusion::catalog::schema::SchemaProvider
This might make sense given how similar TableFunction
s are to TableProvider
s
trait SchemaProvider {
...
/// Return a `TableProvider` for executing the `name` table function, given the specified
/// arguments
fn table_function(name: &str) -> Result<Arc<dyn TableProvider>>;
/// Register the `TableFunction` with the specified name, returning the previously registered function, if any
fn register_table_function(name: &str) -> Result<Option<Arc<dyn TableProvider>>>;
...
}
Additional context
I thought there was an existing ticket for this, but I can not find one
This came up several times, including: